CS 575 Supercomputing
Performance for High Performance Computing
September 15, 2003

Dr. Kris Stewart (stewart@sdsu.edu)
San Diego State University

We meet Wednesday (17 Sept 03) in BAM110 again to discuss the format for your first computational report, which will be using HTML-template you receive Wednesday in lab with your web page on Rohan from your CS 575 class account.

Lecture next week (22Sept03) will cover our Chapter 3 Memory.

This Webpage:

Introduction to Timers
Effective Measures of Cost (FLOPS and Lab 1 results)
- Overview of System Timer
Scientific Notation
Historical Perspective on HPC (Revisit CISC and RISC using Ascii Diagrams)

Introduction to Timers

Chapter 2 (Ch 2) of our text introduced us to computer hardware by describing the CISC and RISC philosophy of design. Then our Wednesday lab (Lab1) introduced you to using the campus Rohan machine and the UNIX timer, dtime, which is simpler to use than the timer, etime, mentioned in our text (Ch 6) to obtain timing for a portion of a program

etime

Record the time before you start doing X
Do X
Record the time at completion of X
Subtract the start time from the completion time

dtime

Initialized the timer before you start doing X
Do X
Record the time since the timer was initialized (at completion of X)

What is User Time?
The time spent by CPU executing the compiled and linked version of the user's program, the object code.

What is System Time?
The time spent by the CPU performing system tasks for the user's object code, such as I/O and programmer generated interupts.

Effective Measures of Cost

We will examine the "counting of operations" to assess the amount of work.

FLOPS = "FL"oating Point "OP"erations per "S" econd

and is the count of the number of floating point adds or multiplies performed by a program. In your sample codes, there are no simple adds and multiplies, instead you have system library calls, memory calculations and storage overhead. Still we will expect the work to behave in a predictable manner, for sufficiently large N. Your task is to discover how large N must be.

We are using the Sun SunFire 4800 (Rohan) which is a "scalar machine". This means that when the compiler transforms your Fortran or C commands into the native machine code the basic amount of work performed is associated with a single word of memory. Consider our sample code from lab last week

loop on index i
b(i) = sin(float(i))

What is accomplished in the loop you examined in lab last week?

Highpoints of the sample, in pseudocode are:

The integer index, i, is converted to a floating point value, float(i) [fortran]. Turned out the C did not need the conversion here, but did use the "ascii to integer" to read n from the command line, n = atoi(av[1]); [C, from command line].
The system function to compute the sine is called, sin(i).
The memory address addr(b(i)) is computed.
The value of sin(float(i)) is written to the memory address of b(i).

Write a single line of output to the screen.

Your first computational exploration, from last week's lab, is to run a variety of tests with your sample code with different input for the length of the loop. You want to investigate how easily you can generate timing data that can be related to the expected amount of work indicated in the program. You also want to become familiar with scientific notation and how to output these values.

csample 1024
Time first task timing:  6.06e-04 user, 2.93e-04 system
Time second task timing:  7.25e-05 user, 1.33e-04 system
csample 2048
Time first task timing:  1.08e-03 user, 2.59e-04 system
Time second task timing:  7.25e-05 user, 1.29e-04 system
csample 4096
Time first task timing:  2.08e-03 user, 2.59e-04 system
Time second task timing:  7.40e-05 user, 1.30e-04 system

fsample
1024
  time first task:  1.50808E-3  user   1.9672E-3  system
  time second task:  2.8432E-4  user  7.3144E-4  system
fsample
2048
  time first task:  1.73032E-3  user   5.5896E-4  system
  time second task:  2.3424E-4  user  2.436E-4  system
fsample
4096
  time first task:  2.60192E-3  user   9.0136E-4  system
  time second task:  2.4816E-4  user  2.608E-4  system

Scientific Notation

6.06e-04 means 6.06 times 10^-4 which represents .000606 since the C program used the format %9.2e.

1.50808E-3 means 1.50808 times 10^-3 which represents .00150808 since the Fortran program used default format with print *. The equivalent Fortran format specifier would be 1PE11.5

Note: The programmer has the final control over how much detail is output.

Historical Perspective

The variety of high performance computers available now motivates us to briefly examine some architecture points related to high performance computers as we cover Chapters 2, 6 and 3 of our text. The architecture viewpoint is both "personal" information and opinions from Stewart as well as quotations from our text.

In the current world of computing, the user is accustomed to using an interface tool to optimize performance on a particular hardware platform. Users' time is better spent working in a high level language, such as C or Fortan or MATLAB, than in working in assembler language. In the early days of computing (1950's or so), to achieve good performance all programming had to be done in "raw machine code", i.e. hex or octal digits depending on the machine. The computer was a new device and computer time was scarcer than programmer time, so the programmer was expected to "speak" the language of the particular machine that was available.

Luckily, time has passed. Computers have gotten faster and MUCH cheaper. Memory is cheaper and therefore more plentiful. Software tools such as compilers have advanced. There is effectively a universal operating system used on all high performance machines and it is called UNIX. The programmer has a reasonable expectation that when a solution has been developed and coded in Fortran or C, that same code will be transportable to other environments. "Reasonable" performance is expected if the compilers on the alternate platforms are good. There may be some improved performance still available by "assisting" the compiler with good programming structures.

                      ----------------     ------------
                 /--> | C or Fortran |---> | Compiler | --*
                /     | source file  |     ------------   |
               /      ----------------                    |
--------------         ----------          ------------  \|/
| Programmer | ------> | M-file | -------> | MATLAB   | --*
--------------         ----------          ------------   |
              \                                           |
               \     ---------------       ------------- \|/
                \--> | Assembler   | ----> | Assembler |--*
                     | Source file |       |           |  |
                     ---------------       -------------  |
                                                          |
                                                         \|/
                                                  ---------------
                                                  |             |
                                                  |  HARDWARE   |
                                                  |             |
                                                  ---------------

This diagram indicates how the programmer is removed from directly working with the particular hardware currently available. This makes the programmer's code writing skills valuable and makes moving from one compute environment to another reasonably straight forward. In the earlier days of computing, a programmer would become "tied" to a particular architecture since most programming had to be performed at the more detailed level of assembler language. The expertise a programmer would develop would not directly transfer to other environments.

Though the applications packages such as compilers and MATLAB relieve the programmer from concern about many of the lower level issues of dealing with the hardware, it is important for all users to have some awareness of what the hardware is actually capable of. With the new compute platforms of High Performance Computing, new concerns are addressed and the compiler cannot take care of everything.

We are taking a brief (hopefully not too specific) look at hardware.

CPU = Central Processing Unit
OS = Operating System
FPFU = Floating Point Function Units
IFU = Integer Functional Units
LFU = Logical Functional Units

               An Idealized Computer
----------------------------------------------------
|       CPU          Main Memory    File Space     |
|   -------------    -----------  --------------   |     ---------
|   |     ----- |    |   OS    |  |            |   |     |       |
|   |regis|   | |    |         |  |            |   |<--> |  I/O  |
|   | ters|   | |    |Compilers|  |  User's    |   |     |devices|
|   |     ----- |    |         |  |  file      |   |     |       |
|   |     ----- |    |Applica- |  |            |   |     ---------
|   |     |* +| |    | tion    |  |            |   |
|   |FPFUs| - | |    | Packages|  |            |   |     ---------
|   |     | / | |    |         |  |            |   |<--> |Network|
|   |     ----- |    | User    |  |            |   |     |connect|
|   |     |* +| |    |programs |  |            |   |     ---------
|   | IFUs| - | |    |         |  |            |   |       ^    ^
|   |     |mod| |    | User    |  |            |   |       |    |
|   |     ----- |    | data    |  |            |   |     modems |
|   |     |and| |    | (arrays)|  |            |   |            |
|   | LFUs| or| |    |         |  |            |   |            |
|   |     |...| |    |         |  |            |   |          other
|   |     ----- |    |         |  |            |   |        computers
|   | lots of   |    -----------  --------------   |
|   | special   |                                  |
|   | hardware  |                                  |
|   -------------                                  |
----------------------------------------------------

Registers hold one word of main memory storage. The functional units can operate on information in the registers to produce a new result, which is stored in a register and then, perhaps in memory. Address computations are integer computations and are performed in the IFUs. The "lots of neat stuff" above can include elaborate, special purpose, hardward components which we won't go into here since they vary greatly from platform to platform.

RISC = Reduced Instruction Set Computers (CDC 6600, 1975)

Our text: High Performance Computing: 2nd Edition by Kevin Dowd and Charles Severange (O'Reilly & Associates, Inc. 1998) states (p. 13).

"Characterizing RISC

RISC is more of a design philosophy that a set of goals. Of course every RISC processor has its own personality. However, there are a number of features commonly found in machine people consider to be RISC.

instruction pipelining
pipelining floating-point execution
uniform instruction length
delayed branches
load store architecture
simple addressing modes

This list highlights the differences between RISC and CISC processors. Naturally, the two types of instruction sets architectures have much in common - each uses registers, memory, etc. And many of these techiniques are used in CICS machines too, such as caches and instruction pipelines. But it's the fundamental differences that give RISC its speed advanage: focussing on a smaller set of less powerful instructions makes its possible to build a faster computer."

There are no hardware operations to work with a value in a register and a value in memory on the Cray T90. Therefore, every operand must first be loaded from memory into a register before any arithmetic can be performed.

CISC machines like the VAX and IBM/PC would allow:

register-register operations
register-memory operations
memory-memory operataions

but this violates the load store architecture listed above for RISC machines. Also violates the "uniform instruction length" since any operation involving registers needs to store the register number (a small integer), while one accessing memory would needed to have the memory address as part of the instruction. Depending on the size of memory, this could dictate large instruction sizes or place a limit on the amount of possible memory.

Back to CS575

CS 575 Supercomputing Performance for High Performance Computing September 15, 2003

Dr. Kris Stewart (stewart@sdsu.edu) San Diego State University