CS 575 Supercomputing - Lecture Outline
Chapter 8: Loop Optimizations

October 22, 2003

Dr. Kris Stewart (stewart@rohan.sdsu.edu)
San Diego State University

This URL is http://www.stewart.cs.sdsu.edu/cs575/lecs/ch08.html

Why so much emphasis on loops?

First we will discuss Vectors and Loops and use some of the diagrams from the Appendix from our text.

Loops can be recognized by the compiler and mapped directly onto hardware which we will be using in November.

Chapter 8 provides guidance about why source code, may failed to be vectorized and therefore not a candidate for parallel loops. Keep in mind that our textbook in this chapter focussed on Fortran examples, but the same issues will apply to C programs. You just interchange the reference to column in Fortran with row in C.

Operation Counting
This is a concept you are already familiar with through the Diffusion Matrix example and its counting of the floating point arithmetic operations associated with the Gaussian Elimination algorithm. But there is also the operation mix: number of loads, stores, floating-point, integer and library calls per iteration of a loop. Good instruction mixes for one computing platform may not have the same properties when you move to a totally different computing platform. For now, we choose to become familiar with Sun platforms such as Rohan. Consider the sample loops from our text:

First instruction mix example (p. 148)
DO I=1,N
   A(I,J,K) = A(I,J,K) + B(J,I,K)
ENDDO
One floating point operation and three memory references (2 loads; 1 store). 3:1 ratio of memory reference to floating point operations means we expect no more than 1/3 peak floating point performance, unless we have more than one path to memory. Bad news for performance, but good information to know.

Second instruction mix example (p. 148)
DO I=1,N
A(I) = A(I) + B(J)
ENDDO
One floating point operation and two memory operations (1 load; 1 store). 2:1 ratio memrefs to floating point. Compiler should be smart enough to notice that B(J) is loop-invariant and load it only once.

Now consider the C code to multiply two vectors of complex numbers xi = (x_ri,x_ii) for the real and imaginary part of the number.


for (i=0; i < n; i++) {
xr[i] = xr[i]*yr[i] - xi[i]*yi[i];
xi[i] = xr[i]*yi[i] + xi[i]*yr[i];
}

We now have 6 memory references (4 loads, 2 stores) and 6 floating point operations (2 adds and 4 multiplies)

How would you see the effective of any detailed analysis? You would examine the generated assembler code, which for the Sun architecture is handled nicely.

f90 -S sgefa.f

sgefa.s compiler-generated RISC assembly code.

Notice in this RISC assembly code what is generated at source line 59 for the DO loop at line 60.

Basic Loop Unrolling
To quote our text "We're not suggesting that you unroll any loops by hand." (p. 149) Instead we look to be informed readers of compiler-generated assembly language where loop unrolling is likely to be done. The goal is to pack more arithmetic into each iteration of a loop. This will increase the size of the basic block.
Qualifying candidates for Loop Unrolling
Nested Loops
Loop Interchange
Memory Access Patterns
When Interchange Won't Work

Blocking to Ease Memory Access Patterns
Two-dimensional arrays are conceptualized as square groups of memory.

Fig. 8-1 Arrays A and B

© O'Reilly Publishers (Used with permission)

But these 2D arrangements of data are linearized for storage in the computer. (Fortran stores by columns; C stores by row)

Consider the vector sum code
	DO I=1,N
	   DO J=1,N
	      A(J,I) = A(J,I) + B(I,J)
	   ENDDO
	ENDDO 
One array is referenced with unit stride, the other with a stride of N.

Fig. 8-2 How array elements are stored
© O'Reilly Publishers (Used with permission)

Note: the cache line boundary. The array A references elements likely to still be within a cache, but array B references elements in separate columns, therefore in separate cache loads (illustrated in the top image of Fig. 8.3.

Loop-unrolling of the inner and outer loop can help:

	DO I=1,N,2
	  DO J=1,N,2
		A(J,I) 	= A(J,I) + B(I,J)
		A(J+1,I) = A(J+1,I) + B(I,J+1)
		A(J,I+1) = A(J,I+1) + B(I+1,J)
		A(J+1,I+1) = A(J+1,I+1) + B(I+1,J+1)
	  ENDDO
	ENDDO

We help the compiler out by rearranging loops to work with small 2 by 2 squares of the portion of array A and array B, as in the bottom image of Fig. 8.3, by unrolling our own loops to ensure the TLB (translation look-ahead buffer), a cache of precomputed array element locations, will have the address of memory data needed. We cut the original loop in two parts to save TLB entries.

Fig. 8-3 2x2 squares (loop-unrolling)

© O'Reilly Publishers (Used with permission)

We can take this even further by rewriting the inner loop to take advantage of block data, illustrated below in Fig. 8-4.

	DO I=1,N,2
	  DO J=1,N/2,2
		A(J,I) = A(J,I) + B(I,J)
		A(J+1,I) = A(J+1,I) + B(I+1,J)
		A(J,I+1) = A(J,I+1) + B(I+1,J)
		A(J+1,I) = A(J+1,I) + B(I+1,J)
	  ENDDO
	ENDDO
	DO I=1,N,2
	  DO J=N/2+1,N,2
		A(J,I) = A(J,I) + B(I,J)
		A(J+1,I) = A(J+1,I) + B(I+1,J)
		A(J,I+1) = A(J,I+1) + B(I+1,J)
		A(J+1,I+1) = A(J+1,I+1) + B(I+1,J+1)
	  ENDDO
	ENDDO

Fig. 8-4 Picture of unblocked versus blocked references

© O'Reilly Publishers (Used with permission)
Return to CS 575 home page