CS 575 Supercomputing for the Sciences
Ch. 13 Language Support for Performance

24 November 2003

Dr. Kris Stewart (stewart@rohan.sdsu.edu)
San Diego State University

MPI on Rohan 10Nov03

Classic problem of one-dimensioal metal plate ("rod") that is initially 0 degrees celsius and we place one end in 100 degree steam and the other end in zero degree ice. We want to simulate heat flow into the adjacent cells of the discretized rod.

Fig. 13-1: Heat flow in a rod


© O'Reilly Publishers (Used with permission)

Simplistic implementation:
	PROGRAM HEATROD
	PARAMETER (MAXTIME=200)
	INTEGER TICKS, I, MAXTIME
	REAL*4 ROD(10)
	ROD(1) = 100.0
	DO I=2,9
   		ROD(I) = 0.0
	ENDDO
	ROD(10)=0.0

	DO TICKS=1,MAXTIME
   		IF (MOD(TICKS,20) .EQ. 1) PRINT 100, TICKS, (ROD(I), I=1,10)
   		DO I=2,9
      		ROD(I) = (ROD(I-1) + ROD(I+1))/2
   		ENDDO
	ENDDO
100 	FORMAT (I4,10F7.2)
	END

We compile this code on rohan f90 p274rod.f90 and execute to obtain the output:

   1 100.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00
  21 100.00  87.04  74.52  62.54  51.15  40.30  29.91  19.83   9.92   0.00
  41 100.00  88.74  77.51  66.32  55.19  44.10  33.05  22.02  11.01   0.00
  61 100.00  88.88  77.76  66.64  55.53  44.42  33.31  22.21  11.10   0.00
  81 100.00  88.89  77.78  66.66  55.55  44.44  33.33  22.22  11.11   0.00

 101 100.00  88.89  77.78  66.67  55.56  44.44  33.33  22.22  11.11   0.00

 121 100.00  88.89  77.78  66.67  55.56  44.44  33.33  22.22  11.11   0.00
 141 100.00  88.89  77.78  66.67  55.56  44.44  33.33  22.22  11.11   0.00
 161 100.00  88.89  77.78  66.67  55.56  44.44  33.33  22.22  11.11   0.00
 181 100.00  88.89  77.78  66.67  55.56  44.44  33.33  22.22  11.11   0.00
We have formatted the output for only two digits past the decimal and clearly see steady state at step 101.

We obtain a listing and ask for some parallel optimization with the command f90 -Xlist -parallel p274rod.f90 and see the main loop was not vectorized
p274rod.lst.

Fig. 13-2: Computing the new value for a cell

© O'Reilly Publishers (Used with permission)
A more vectorized (and therefore parallel) technique uses the red-black order where we alternate between two vectors to more closely model the manner that heat physically progresses through the rod. This is illustrated below.

        PROGRAM HEATRED
        PARAMETER (MAXTIME=200)
        INTEGER TICKS, I, MAXTIME
        REAL*4 RED(10), BLACK(10)

        RED(1) = 100.0
        BLACK(1) = 100.0
        DO I=2,9
                RED(I) = 0.0
        ENDDO
        RED(10) = 0.0
        BLACK(10) = 0.0

        DO TICKS=1,MAXTIME,2
                IF (MOD(TICKS,20) .EQ. 1) PRINT 100, TICKS, (RED(I), I=1,10)
                DO I=2,9
                   BLACK(I) = (RED(I-1) + RED(I+1))/2
                ENDDO
                DO I=2,9
                   RED(I) = (BLACK(I-1) + BLACK(I+1))/2
                ENDDO
        ENDDO
100     FORMAT (I4,10F7.2)
        END

Fig. 13-3: Using two arrays to eliminate a dependency

© O'Reilly Publishers (Used with permission)

This yield the output:
rohan.stewart:~/cs575/spr01/diffusion> f90 -Xlist -parallel p275red.f90
f90: Warning: Optimizer level changed from 0 to 3 to support parallelized code
rohan.stewart:~/cs575/spr01/diffusion> a.out
   1 100.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00
  21 100.00  82.38  66.34  50.30  38.18  26.06  18.20  10.35   5.18   0.00
  41 100.00  87.04  74.52  61.99  50.56  39.13  28.94  18.75   9.38   0.00
  61 100.00  88.36  76.84  65.32  54.12  42.91  32.07  21.22  10.61   0.00
  81 100.00  88.74  77.51  66.28  55.14  44.00  32.97  21.93  10.97   0.00
 101 100.00  88.84  77.70  66.55  55.44  44.32  33.23  22.14  11.07   0.00
 121 100.00  88.88  77.76  66.63  55.52  44.41  33.30  22.20  11.10   0.00
 141 100.00  88.89  77.77  66.66  55.55  44.43  33.32  22.22  11.11   0.00
 161 100.00  88.89  77.78  66.66  55.55  44.44  33.33  22.22  11.11   0.00

 181 100.00  88.89  77.78  66.67  55.55  44.44  33.33  22.22  11.11   0.00

Now the convergence for 2 digits of steady-state occurs at step 181, somewhat slower, but we have eliminated the dependency in the loop and it vectorizes and therefore parallelizes, as we see in the listing
p27red.lst.

Reality Check

The time to compile and run these examples are all contained in the following script file
Script for Lecture

Fortran 90

We want to think in SIMD mode, focussing more on the data and less on the control to see how to best use Fortran 90.

Fig. 13-4: Data alignment and computations
© O'Reilly Publishers (Used with permission)

Using the array sections that Fortran 90 provides, we can easily rewrite the first approximation of Heat in the Rod with
        PROGRAM HEATROD
        PARAMETER (MAXTIME=200)
        INTEGER TICKS, I, MAXTIME
        REAL*4 ROD(10)
        ROD(1) = 100.0
        DO I=2,9
                ROD(I) = 0.0
        ENDDO
        ROD(10)=0.0

        DO TICKS=1,MAXTIME
                IF (MOD(TICKS,20) .EQ. 1) PRINT 100, TICKS, (ROD(I), I=1,10)
                ROD(2:9) = (ROD(1:8) + ROD(3:10))/2
        ENDDO
100     FORMAT (I4,10F7.2)
        END
Compiling and executing, we have output identical to the red-black ordering example, after replacing the loop with the single statement

ROD(2:9) = (ROD(1:8) + ROD(3:10))/2

f90 -Xlist -parallel p283heatrod.f90 -o p283heatrod
f90: Warning: Optimizer level changed from 0 to 3 to support parallelized code
rohan.stewart:~/cs575/spr01/diffusion> p283heatrod
   1 100.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00
  21 100.00  82.38  66.34  50.30  38.18  26.06  18.20  10.35   5.18   0.00
  41 100.00  87.04  74.52  61.99  50.56  39.13  28.94  18.75   9.38   0.00
  61 100.00  88.36  76.84  65.32  54.12  42.91  32.07  21.22  10.61   0.00
  81 100.00  88.74  77.51  66.28  55.14  44.00  32.97  21.93  10.97   0.00
 101 100.00  88.84  77.70  66.55  55.44  44.32  33.23  22.14  11.07   0.00
 121 100.00  88.88  77.76  66.63  55.52  44.41  33.30  22.20  11.10   0.00
 141 100.00  88.89  77.77  66.66  55.55  44.43  33.32  22.22  11.11   0.00
 161 100.00  88.89  77.78  66.66  55.55  44.44  33.33  22.22  11.11   0.00

 181 100.00  88.89  77.78  66.67  55.55  44.44  33.33  22.22  11.11   0.00


We can see the compiler listing flags the Array Ops in
p283heatrod.lst

Fortran 90 Portable, Scalable computing
As our text states at the end of Chapter 13,

"The good news is that both FORTRAN 90 and HPF provide one road map to portable scalable computing that doesn't require explicit message passing. the only question is which road we users will choose."

Return to CS 575 Home Page