CS 575 Supercomputing
Performance for High Performance Computing

September 15, 2003

Dr. Kris Stewart (stewart@sdsu.edu)
San Diego State University

We meet Wednesday (18 Sept 02) in BAM113 again to discuss the format for your first computational report, which will be using HTML with your personal web page on Rohan.

Lecture next week (23Sept02) will cover our Chapter 3 Memory.

Chapter 2 (Ch 2) of our text introduced us to computer hardware by describing the CISC and RISC philosophy of design. Then our Wednesday lab (Lab1) introduced you to using the campus Rohan machine and the UNIX timer, dtime, which is simpler to use than the timer, etime, mentioned in our text (Ch 6) to obtain timing for a portion of a program

    etime (from text p. 105-106)
  1. Record the time before you start doing X
  2. Do X
  3. Record the time at completion of X
  4. Subtract the start time from the completion time
    dtime
  1. Initialized the timer before you start doing X
  2. Do X
  3. Record the time since the timer was initialized (at completion of X)

Effective Measures of Cost

We will examine the "counting of operations" to assess the amount of work.

FLOPS = "FL"oating Point "OP"erations per "S" second

and is the count of the number of floating point adds or multiplies performed by a program. In your sample codes, there are no simple adds and multiplies, instead you have system library calls, memory calculations and storage overhead. Still we will expect the work to behave in a predictable manner, for sufficiently large N. Your task is to discover how large N must be.

We are using the Sun SunFire 4800 (Rohan) which is a "scalar machine". This means that when the compiler transforms your Fortran or C commands into the native machine code the basic amount of work performed is associated with a single word of memory. Consider our sample code from lab last week

loop on index i
b(i) = sin(float(i))

What is accomplished in the loop you examined in lab last week?

  1. The integer index, i, is converted to a floating point value, temp1 = float(i).
  2. The system function to compute the sine is called, temp2 = sin(temp1).
  3. The memory address temp3 = addr(b(i)) is computed.
  4. The value of sin(float(i)) [temp2] is written to the memory address of b(i) [temp3].

Your first computational experiment, from last week's lab, is to run a variety of tests with your sample code with different input for the length of the loop. You want to investigate how easily you can generate timing data that can be related to the expected amount of work indicated in the program.

The variety of high performance computers available now motivates us to briefly examine some architecture points related to high performance computers as we cover Chapters 2, 6 and 3 of our text. The architecture viewpoint is both "personal" information and opinions from Stewart as well as quotations from

Historical Perspective

In the current world of computing, the user is accustomed to using an interface tool to optimize performance on a particular hardware platform. Users' time is better spent working in a high level language, such as C or Fortan or MATLAB, than in working in assembler language. In the early days of computing (1950's or so), to achieve good performance all programming had to be done in "raw machine code", i.e. hex or octal digits depending on the machine. The computer was a new device and computer time was scarcer than programmer time, so the programmer was expected to "speak" the language of the particular machine that was available.

Luckily, time has passed. Computers have gotten faster and MUCH cheaper. Memory is cheaper and therefore more plentiful. Software tools such as compilers have advanced. There is effectively a universal operating system used on all high performance machines and it is called UNIX. The programmer has a reasonable expectation that when a solution has been developed and coded in Fortran or C, that same code will be transportable to other environments. "Reasonable" performance is expected if the compilers on the alternate platforms are good. There may be some improved performance still available by "assisting" the compiler with good programming structures.

                      ----------------     ------------
                 /--> | C or Fortran |---> | Compiler | --*
                /     | source file  |     ------------   |
               /      ----------------                    |
--------------         ----------          ------------  \|/
| Programmer | ------> | M-file | -------> | MATLAB   | --*
--------------         ----------          ------------   |
              \                                           |
               \     ---------------       ------------- \|/
                \--> | Assembler   | ----> | Assembler |--*
                     | Source file |       |           |  |
                     ---------------       -------------  |
                                                          |
                                                         \|/
                                                  ---------------
                                                  |             |
                                                  |  HARDWARE   |
                                                  |             |
                                                  ---------------
This diagram indicates how the programmer is removed from directly working with the particular hardware currently available. This makes the programmer's code writing skills valuable and makes moving from one compute environment to another reasonably straight forward. In the earlier days of computing, a programmer would become "tied" to a particular architecture since most programming had to be performed at the more detailed level of assembler language. The expertise a programmer would develop would not directly transfer to other environments.

Though the applications packages such as compilers and MATLAB relieve the programmer from concern about many of the lower level issues of dealing with the hardware, it is important for all users to have some awareness of what the hardware is actually capable of. With the new compute platforms of High Performance Computing, new concerns are addressed and the compiler cannot take care of everything.

We are taking a brief (hopefully not too specific) look at hardware.

               An Idealized Computer
----------------------------------------------------
|       CPU          Main Memory    File Space     |
|   -------------    -----------  --------------   |     ---------
|   |     ----- |    |   OS    |  |            |   |     |       |
|   |regis|   | |    |         |  |            |   |<--> |  I/O  |
|   | ters|   | |    |Compilers|  |  User's    |   |     |devices|
|   |     ----- |    |         |  |  file      |   |     |       |
|   |     ----- |    |Applica- |  |            |   |     ---------
|   |     |* +| |    | tion    |  |            |   |
|   |FPFUs| - | |    | Packages|  |            |   |     ---------
|   |     | / | |    |         |  |            |   |<--> |Network|
|   |     ----- |    | User    |  |            |   |     |connect|
|   |     |* +| |    |programs |  |            |   |     ---------
|   | IFUs| - | |    |         |  |            |   |       ^    ^
|   |     |mod| |    | User    |  |            |   |       |    |
|   |     ----- |    | data    |  |            |   |     modems |
|   |     |and| |    | (arrays)|  |            |   |            |
|   | LFUs| or| |    |         |  |            |   |            |
|   |     |...| |    |         |  |            |   |          other
|   |     ----- |    |         |  |            |   |        computers
|   | lots of   |    -----------  --------------   |
|   | special   |                                  |
|   | hardware  |                                  |
|   -------------                                  |
---------------------------------------------------- 
Registers hold one word of main memory storage. The functional units can operate on information in the registers to produce a new result, which is stored in a register and then, perhaps in memory. Address computations are integer computations and are performed in the IFUs. The "lots of neat stuff" above can include elaborate, special purpose, hardward components which we won't go into here since they vary greatly from platform to platform.

RISC = Reduced Instruction Set Computers (CDC 6600, 1975)

Our text: High Performance Computing: 2nd Edition by Kevin Dowd and Charles Severange (O'Reilly & Associates, Inc. 1998) states (p. 13).

"Characterizing RISC

RISC is more of a design philosophy that a set of goals. Of course every RISC processor has its own personality. However, there are a number of features commonly found in machine people consider to be RISC.

This list highlights the differences between RISC and CISC processors. Naturally, the two types of instruction sets architectures have much in common - each uses registers, memory, etc. And many of these techiniques are used in CICS machines too, such as caches and instruction pipelines. But it's the fundamental differences that give RISC its speed advanage: focussing on a smaller set of less powerful instructions makes its possible to build a faster computer."

There are no hardware operations to work with a value in a register and a value in memory on the Cray T90. Therefore, every operand must first be loaded from memory into a register before any arithmetic can be performed.

CISC machines like the VAX and IBM/PC would allow:

but this violates the load store architecture listed above for RISC machines. Also violates the "uniform instruction length" since any operation involving registers needs to store the register number (a small integer), while one accessing memory would needed to have the memory address as part of the instruction. Depending on the size of memory, this could dictate large instruction sizes or place a limit on the amount of possible memory.

Back to CS575