CS 575 Supercomputing
Forms for Linear Algebra
Suitable for High Performance Computing
November 3, 2002

This URL is http://www.stewart.cs.sdsu.edu/cs575/lecs/hpc_linalg.html

Dr. Kris Stewart (stewart@rohan.sdsu.edu)
San Diego State University

This material was first presented towards to end of Lab 15Sept03.

The variety of high performance computers available now motivates us to briefly examine some architecture points related to high performance computers. The architecture viewpoint is both "personal" information and opinions from Stewart as well as quotations from

Historical Perspective

In the current world of computing, the user is accustomed to using an interface tool to optimize performance on a particular hardware platform. Users' time is better spent working in a high level language, such as C or Fortan or MATLAB, than in working in assembler language. In the early days of computing (1950's or so), to achieve good performance all programming had to be done in "raw machine code", i.e. hex or octal digits depending on the machine. The computer was a new device and computer time was scarcer than programmer time, so the programmer was expected to "speak" the language of the particular machine that was available.

Luckily, time has passed. Computers have gotten faster and MUCH cheaper. Memory is cheaper and therefore more plentiful. Software tools such as compilers have advanced. There is effectively a universal operating system used on all high performance machines and it is called UNIX. The programmer has a reasonable expectation that when a solution has been developed and coded in Fortran or C, that same code will be transportable to other environments. "Reasonable" performance is expected if the compilers on the alternate platforms are good. There may be some improved performance still available by "assisting" the compiler with good programming structures.

                      ----------------     ------------
                 /--> | C or Fortran |---> | Compiler | --*
                /     | source file  |     ------------   |
               /      ----------------                    |
--------------         ----------          ------------  \|/
| Programmer | ------> | M-file | -------> | MATLAB   | --*
--------------         ----------          ------------   |
              \                                           |
               \     ---------------       ------------- \|/
                \--> | Assembler   | ----> | Assembler |--*
                     | Source file |       |           |  |
                     ---------------       -------------  |
                                                          |
                                                         \|/
                                                  ---------------
                                                  |             |
                                                  |  HARDWARE   |
                                                  |             |
                                                  ---------------

This diagram indicates how the programmer is removed from directly working with the particular hardware currently available. This makes the programmer's code writing skills valuable and makes moving from one compute environment to another reasonably straight forward. In the earlier days of computing, a programmer would become "tied" to a particular architecture since most programming had to be performed at the more detailed level of assembler language. The expertise a programmer would develop would not directly transfer to other environments.

Though the applications packages such as compilers and MATLAB relieve the programmer from concern about many of the lower level issues of dealing with the hardware, it is important for all users to have some awareness of what the hardware is actually capable of. With the new compute platforms of High Performance Computing, new concerns are addressed and the compiler cannot take care of everything.

We are taking a brief (hopefully not too specific) look at hardware.

CPU = Central Processing Unit
OS = Operating System
FPFU = Floating Point Function Units
IFU = Integer Functional Units
LFU = Logical Functional Units

               An Idealized Computer
----------------------------------------------------
|       CPU          Main Memory    File Space     |
|   -------------    -----------  --------------   |     ---------
|   |     ----- |    |   OS    |  |            |   |     |       |
|   |regis|   | |    |         |  |            |   |<--> |  I/O  |
|   | ters|   | |    |Compilers|  |  User's    |   |     |devices|
|   |     ----- |    |         |  |  file      |   |     |       |
|   |     ----- |    |Applica- |  |            |   |     ---------
|   |     |* +| |    | tion    |  |            |   |
|   |FPFUs| - | |    | Packages|  |            |   |     ---------
|   |     | / | |    |         |  |            |   |<--> |Network|
|   |     ----- |    | User    |  |            |   |     |connect|
|   |     |* +| |    |programs |  |            |   |     ---------
|   | IFUs| - | |    |         |  |            |   |       ^    ^
|   |     |mod| |    | User    |  |            |   |       |    |
|   |     ----- |    | data    |  |            |   |     modems |
|   |     |and| |    | (arrays)|  |            |   |            |
|   | LFUs| or| |    |         |  |            |   |            |
|   |     |...| |    |         |  |            |   |          other
|   |     ----- |    |         |  |            |   |        computers
|   | lots of   |    -----------  --------------   |
|   |neat stuff |                                  |
|   -------------                                  |
----------------------------------------------------

Registers hold one word of main memory storage. The functional units can operate on information in the registers to produce a new result, which is stored in a register and then, perhaps in memory. Address computations are integer computations and are performed in the IFUs. The "lots of neat stuff" above can include elaborate, special purpose, hardward components which we won't go into here since they vary greatly from platform to platform.

RISC = Reduced Instruction Set Computers (CDC 6600, 1975)

Our text: High Performance Computing: 2nd Edition by Kevin Dowd and Charles Severange (O'Reilly & Associates, Inc. 1998) states (p. 13).

"Characterizing RISC

RISC is more of a design philosophy that a set of goals. Of course every RISC processor has its own personality. However, there are a number of features commonly found in machine people consider to be RISC.

instruction pipelining
pipelining floating-point execution
uniform instruction length
delayed branches
load store architecture
simple addressing modes

This list highlights the differences between RISC and CISC processors. Naturally, the two types of instruction sets architectures have much in common - each uses registers, memory, etc. And many of these techiniques are used in CICS machines too, such as caches and instruction pipelines. But it's the fundamental differences that give RISC its speed advanage: focussing on a smaller set of less powerful instructions makes its possible to build a faster computer."

There are no hardware operations to work with a value in a register and a value in memory on the Cray T90. Therefore, every operand must first be loaded from memory into a register before any arithmetic can be performed.

CISC machines like the VAX and IBM/PC would allow:

register-register operations
register-memory operations
memory-memory operataions

but this violates the load store architecture listed above for RISC machines. Also violates the "uniform instruction length" since any operation involving registers needs to store the register number (a small integer), while one accessing memory would needed to have the memory address as part of the instruction. Depending on the size of memory, this could dictate large instruction sizes or place a limit on the amount of possible memory.

Effective Measures of Cost

We will examine the "counting of operations" to assess the amount of work involved in Gaussian elimination today.

FLOPS = "FL"oating Point "OP"erations per "S" second

and is the count of the number of floating point adds or multiplies or ... performed by a program.

. An excellent quote from Golub and Van Loan, now leads us to discuss additional considerations.

p. 20 "Flops counting is a necessarily crude approach to the measuring of program efficiency since it ignores subscripting, memory traffic, and the countless other overheads associated with program execution."

We are using the Sun SPARC 1which is a "scalar machine". This means that when the compiler transforms your Fortran or C commands into the native machine code the basic amount of work performed is associated with a single word of memory. Consider

y(2) = x(2) + z(2)

The address in memory of x(2) must be found (this is an address computation that is performed relative to wherever the first element of the vector x was placed). Then the contents of this memory location are loaded into a hardward register. Then the address of z(2) is computed, the location is accessed, and the contents loaded into a second hardware register. These two registers are then added and the result is placed into the location for the variable y(2) (after its address in memory is computed, of course).

High performance computers, such as the Cray T90, formerly at the San Diego Supercomputer Center, have greatly enhanced performance because of greatly enhanced hardware. On the T90, a floating point register can hold 128 words (each with 64 bits of precision). When arithmetic is performed, it is "pipelined". A vector load and vector add instruction can be "overlapped". Suppose we examine a vector add

y = x + z

where each of the quantities x, y, z are vectors of length 128. The T90 must determine the start location for the vector x and then begin loading each of x's components into the first vector register, V1. Then the components of z are located and loaded into vector register V2. Then the two vector registers are added with the result placed in a third vector register, V3. The components of V3 are then written back to memory, once the location of y is determined.


mem(x)->V1
     --------
     | x(1) | ------------------
     --------                  |
     | x(2) |                  |     FPFU(+)
     --------                  |   ---------------
     | x(3) | mem(z)->V2       --->|  x(1)+z(1)  |---     V3->mem(y)
     --------     --------     --->---------------  |  ---------
     | ...  |     | z(1) | ----|   |  x(2)+z(2)  |  -->|1st add|
     --------     --------         ---------------     ---------
     |x(128)|     | z(2) |         |  x(3)+z(3)  |     |2nd add|
     --------     --------         ---------------     ---------
                  | z(3) |         |     ...     |     |3rd add|
                  --------         ---------------     ---------
                  | ...  |         |x(128)+z(128)|     |  ...  |
                  --------         ---------------     ---------
                  |z(128)|                             |128 add|
                  --------                             ---------

Once the memory location of the start of a vector is known, it is a straight forward process to access the next component, so that although the vector load consumes a measurable amount of time, once started, they will run much more quickly than performing individual address computation then load for each component of a vector. A scalar must follow the process of address computational then load. But compilers on scalar machines are not stupid. Optimizing compilers can often spot sequential loads like the above and avoid somewhat redundant address computations, using addr(x(2))=addr(x(1))+1 for word addressing, to avoid computing addr(x(2)) from scratch.

High performance hardware is also capable of overlapping operations as indicated in the figure above - Fine Grain Parallelism. On the Cray T90, the hardware allows several operations to be performed at the same time:

In the same clock cycle,

2 reads and 1 write can be performed from/to memory
address computation can be performed in IFU
floating point operation can be performed in FPFU

A simplified time sequence is given below:

clock
-----
1    addr(x(1))
2    fetch x(1)    addr(z(1))
3    fetch x(2)    fetch z(1)
4    fetch x(3)    fetch z(2)   add: x(1)+z(1)
5    fetch x(4)    fetch z(3)   store(x(1)+z(1))  add: x(2)+z(2)
6    fetch x(5)    fetch z(4)   store(x(2)+x(2))  add: x(3)+z(3)

starting with clock 5 we begin to have one result stored on every clock cycle. This is pipelining which is commonly designed into current hardware for high performance computers.

Our new considerations:

Organize our access to memory
Avoid reloading something, if possible
We are concerned with how many times the data has to be loaded from memory into the register of the CPU. We'll also be concerned with now many times that information was subsequently used. There is only a finite amount of space for register information. If memory access was our only measure of cost then we would want to load a quantity, keep it in a register until all computations involving that quantity had been completed and then release the register for some other use.
Obtain a high "use" to "load" ratio with data

Back to Chapter 9 Parallelism

CS 575 Supercomputing Forms for Linear Algebra Suitable for High Performance Computing November 3, 2002