This material was first presented towards to end of Lab 15Sept03.
The variety of high performance computers available now motivates us to briefly examine some architecture points related to high performance computers. The architecture viewpoint is both "personal" information and opinions from Stewart as well as quotations from
Luckily, time has passed. Computers have gotten faster and MUCH cheaper. Memory is cheaper and therefore more plentiful. Software tools such as compilers have advanced. There is effectively a universal operating system used on all high performance machines and it is called UNIX. The programmer has a reasonable expectation that when a solution has been developed and coded in Fortran or C, that same code will be transportable to other environments. "Reasonable" performance is expected if the compilers on the alternate platforms are good. There may be some improved performance still available by "assisting" the compiler with good programming structures.
---------------- ------------ /--> | C or Fortran |---> | Compiler | --* / | source file | ------------ | / ---------------- | -------------- ---------- ------------ \|/ | Programmer | ------> | M-file | -------> | MATLAB | --* -------------- ---------- ------------ | \ | \ --------------- ------------- \|/ \--> | Assembler | ----> | Assembler |--* | Source file | | | | --------------- ------------- | | \|/ --------------- | | | HARDWARE | | | ---------------This diagram indicates how the programmer is removed from directly working with the particular hardware currently available. This makes the programmer's code writing skills valuable and makes moving from one compute environment to another reasonably straight forward. In the earlier days of computing, a programmer would become "tied" to a particular architecture since most programming had to be performed at the more detailed level of assembler language. The expertise a programmer would develop would not directly transfer to other environments.
Though the applications packages such as compilers and MATLAB relieve the programmer from concern about many of the lower level issues of dealing with the hardware, it is important for all users to have some awareness of what the hardware is actually capable of. With the new compute platforms of High Performance Computing, new concerns are addressed and the compiler cannot take care of everything.
We are taking a brief (hopefully not too specific) look at hardware.
An Idealized Computer ---------------------------------------------------- | CPU Main Memory File Space | | ------------- ----------- -------------- | --------- | | ----- | | OS | | | | | | | |regis| | | | | | | |<--> | I/O | | | ters| | | |Compilers| | User's | | |devices| | | ----- | | | | file | | | | | | ----- | |Applica- | | | | --------- | | |* +| | | tion | | | | | |FPFUs| - | | | Packages| | | | --------- | | | / | | | | | | |<--> |Network| | | ----- | | User | | | | |connect| | | |* +| | |programs | | | | --------- | | IFUs| - | | | | | | | ^ ^ | | |mod| | | User | | | | | | | | ----- | | data | | | | modems | | | |and| | | (arrays)| | | | | | | LFUs| or| | | | | | | | | | |...| | | | | | | other | | ----- | | | | | | computers | | lots of | ----------- -------------- | | |neat stuff | | | ------------- | ----------------------------------------------------Registers hold one word of main memory storage. The functional units can operate on information in the registers to produce a new result, which is stored in a register and then, perhaps in memory. Address computations are integer computations and are performed in the IFUs. The "lots of neat stuff" above can include elaborate, special purpose, hardward components which we won't go into here since they vary greatly from platform to platform.
RISC = Reduced Instruction Set Computers (CDC 6600, 1975)
Our text: High Performance Computing: 2nd Edition by Kevin Dowd and Charles Severange (O'Reilly & Associates, Inc. 1998) states (p. 13).
"Characterizing RISC
RISC is more of a design philosophy that a set of goals. Of course every RISC processor has its own personality. However, there are a number of features commonly found in machine people consider to be RISC.
There are no hardware operations to work with a value in a register and a value in memory on the Cray T90. Therefore, every operand must first be loaded from memory into a register before any arithmetic can be performed.
CISC machines like the VAX and IBM/PC would allow:
We will examine the "counting of operations" to assess the amount of work involved in Gaussian elimination today.
. An excellent quote from Golub and Van Loan, now leads us to discuss additional considerations.
p. 20 "Flops counting is a necessarily crude approach to the measuring of program efficiency since it ignores subscripting, memory traffic, and the countless other overheads associated with program execution."
We are using the Sun SPARC 1which is a "scalar machine". This means that when the compiler transforms your Fortran or C commands into the native machine code the basic amount of work performed is associated with a single word of memory. Consider
The address in memory of x(2) must be found (this is an address computation that is performed relative to wherever the first element of the vector x was placed). Then the contents of this memory location are loaded into a hardward register. Then the address of z(2) is computed, the location is accessed, and the contents loaded into a second hardware register. These two registers are then added and the result is placed into the location for the variable y(2) (after its address in memory is computed, of course).
High performance computers, such as the Cray T90, formerly at the San Diego Supercomputer Center, have greatly enhanced performance because of greatly enhanced hardware. On the T90, a floating point register can hold 128 words (each with 64 bits of precision). When arithmetic is performed, it is "pipelined". A vector load and vector add instruction can be "overlapped". Suppose we examine a vector add
where each of the quantities x, y, z are vectors of length 128. The T90 must determine the start location for the vector x and then begin loading each of x's components into the first vector register, V1. Then the components of z are located and loaded into vector register V2. Then the two vector registers are added with the result placed in a third vector register, V3. The components of V3 are then written back to memory, once the location of y is determined.
mem(x)->V1 -------- | x(1) | ------------------ -------- | | x(2) | | FPFU(+) -------- | --------------- | x(3) | mem(z)->V2 --->| x(1)+z(1) |--- V3->mem(y) -------- -------- --->--------------- | --------- | ... | | z(1) | ----| | x(2)+z(2) | -->|1st add| -------- -------- --------------- --------- |x(128)| | z(2) | | x(3)+z(3) | |2nd add| -------- -------- --------------- --------- | z(3) | | ... | |3rd add| -------- --------------- --------- | ... | |x(128)+z(128)| | ... | -------- --------------- --------- |z(128)| |128 add| -------- ---------Once the memory location of the start of a vector is known, it is a straight forward process to access the next component, so that although the vector load consumes a measurable amount of time, once started, they will run much more quickly than performing individual address computation then load for each component of a vector. A scalar must follow the process of address computational then load. But compilers on scalar machines are not stupid. Optimizing compilers can often spot sequential loads like the above and avoid somewhat redundant address computations, using addr(x(2))=addr(x(1))+1 for word addressing, to avoid computing addr(x(2)) from scratch.
High performance hardware is also capable of overlapping operations as indicated in the figure above - Fine Grain Parallelism. On the Cray T90, the hardware allows several operations to be performed at the same time:
In the same clock cycle,
clock ----- 1 addr(x(1)) 2 fetch x(1) addr(z(1)) 3 fetch x(2) fetch z(1) 4 fetch x(3) fetch z(2) add: x(1)+z(1) 5 fetch x(4) fetch z(3) store(x(1)+z(1)) add: x(2)+z(2) 6 fetch x(5) fetch z(4) store(x(2)+x(2)) add: x(3)+z(3)starting with clock 5 we begin to have one result stored on every clock cycle. This is pipelining which is commonly designed into current hardware for high performance computers.
Our new considerations: