CS 596 Supercomputing for Undergraduates - Week 2 (charac-hpc.txt) Kris Stewart, Assoc. Prof., SDSU (stewart@sdsu.edu) Senior Fellow, SDSC (619) 942-1012 Before we begin our study of architecture, we will cover a very quick overview of the characteristics of supercomputers. This will provide guidance while working through the text by Hennessy and Patterson. These notes come from Dan Sulzbach's presentation at the August 1990 Advanced Computing Institute at SDSC. What is a supercomputer? Supercomputers are the class of computers that are the most powerful for scientific computation with sufficient memory to get significant throughput of large-scale applications at any given time. Due to the cost of these machines ($10 million, for example), they must have general purpose capability. Few applications can afford to dedicate such an expensive resource to only one type of problem. (Maybe DOD and NCAR can.) Powerful The emphasis is on speed. What devices comprising a supercomputer can be timed? o Central Processing Unit (CPU) o Memory o I/O Subsystem How to quantify speed? The commonly unit of measurement is the number of Floating Point Operations per Second (FLOPS). o MFLOPS (MegaFLOPS - 10^6) appropriate measure in the 70's and 80's o GFLOPS (GigaFLOPS - 10^9) appropriate for the 8 processsor Cray Y-MP o TFLOPS (TeraFLOPS - 10^12) tomorrow's machine -the goal of the Intel Hypercube is to have sustained computation in TFLOPS Memory Access to very large memory capabilities is needed for large-scale applications. Also need very large memory to handle many modest to large application problems in a multi-user system such as Unix. The memory must be not only large, but also fast to avoid idle CPUs. At any given time, the speed of computers (CPUs) has been increasing continually, a factor of 10 increase every 5 years or so since 1950. In general, supercomputers are obsolete 3 to 5 years after their initial commercial delivery. A History of Supercomputers o Late 1950s CDC-1604 o Early 1960s CDC-3600 o Late 1960s CDC-6600 o Early 1970s CDC-7600 o Late 1970s CRAY-1 CYBER 205 o Early 1980s CRAY X-MP Fujitsu VP200 o Late 1980s CRAY-2,Y-MP Fijitsu VP400 NEC SX-2 Hitachi S-820 ETA10 What makes a Supercomputer Super? o Hardware -fast circuitry o Architecture -multiple operations can take place concurrently o Memory -large size and high speed Computer Hardware and Time o Clock -A synchronizing device within a computer system. The clock signal is a repetitive signal that times, controls, and coordinates most operations in a computer o clock cycle -The interval between clock signals. Cycle time is often used as a measure of a computer's potential performance. Computer Cycle Time (ns) ------------- ----------------- CDC 6600 100 (1965) CDC 7600 25 (1970) VAX 11/750 320 VAX 11/780 200 VAX 8600 80 VAX 8800 45 Supertek S-1 55 CYBER 205 20 IBM 3090 14.5 CRAY-1 12.5 CRAY X-MP 8.5 Fujitsu VP400E 7v, 14s CRAY Y-MP 6 NEC SX-X 2.9 CRAY-2 4.1 Hitachi S-820 4v, 8s CRAY Y-MP 16 4 (predicted) CRAY 3 2 (predicted) SSI 2 (predicted) NEC SX-X ('95) 1 (predicted) Peak Speed (Salesman's Number) o Maximum number of adds and multiplies performed in one second o Speed the computer cannot exceed o Peak Speed = (Ops/cycle) * (Cycles/sec) * CPUs o Peak Speed of Y-MP 8/864 - 6 nanosecond cycle time (1 ns = 10^(-9) second) - 2 operations per cycle for each CPU - Independent functional units for + and * - 8 processors - Peak speed = 2 ops/cycle * 1/6 10^9 cycles/sec * 8 CPUs = 2,667 MFLOPS = 2.7 GFLOPS Peak Speeds Computer MFLOPS ------------ ----------- CYBER 205 400 CRAY-1S 150 Convex C-240 200 (4 processors) CRAY X-MP 940 (4 processors; 8.5 ns clock) IBM 3090/600J VF 828 (6 processors) Hitachi S-810 840 -------------------------- GFLOPS -- Fujitsu VP-400E 1700 NEC SX-2 1300 CRAY-2 1950 (4 processors) Cray Y-MP 8 2667 (8 processors) Hitachi S-820 3000 Note: Some have characterized peak speed MFLOPS as Most Flattering Operational Performance Statistic. Computer Architectural Advances - Historical Perspective o Separate processor to handle I/O -1950's (IBM 709) -called an I/O "channel" -reduced CPU idle time o Instruction look-ahead -1960 (IBM Stretch computer) -instruction buffers -instruction and operand pre-fetching -reduced CPU idle time o Multiple Functional Units -early 1960's (CDC 6600 -Seymour Cray) -capitalizes on instruction look-ahead -operations performed in parallel when possible -minimizes time to perform all operations; increases operations per cycle o Vector processing -Pipelining and Chaining -1970's (CRAY 1 was first successful one) -vector is an ordered set of numbers -instruction set contains operations on vectors as well as scalars -segmented functional units allow "pipelining" operation in each segment takes one cycle on Y-MP: o addition: 7 cycles o multiplication: 8 cycles o reciprocal: 15 cycles -CPU may have vector registers -increases operations per cycle o Multiple CPUs -Parallel Processing -multiple CPUs can perform in parallel and independently -independent parts of single program can be run simultaneously -increases operations per cycle Supercomputer Memory Features o Hardware Chip access times are important o Architecture -Interleaving: memory unit is divided into a number of modules, or banks, that can be accessed simultaneously -Paths to memory: this varies among supercomputer (e.g., CRAY-2 has only one path to memory (so only one vector can be moved to/from memory at a time), but CRAY-YMP has three paths to memory for vectors (allowing 2 vectors and 1 store simultaneously) -Local memory: some supercomputers (CRAY-2) have local memory that can be accessed more rapidly than large common memory What makes the Cray Y-MP 8 so fast? o Fast Hardware -6 ns cycle time -high chip-packing density (entire CPU on one module) -sophisticated cooling system (fluorinent cooling using chilled dielectric coolant for modules and power supplied using a freon chilller unit o Vector Processing -Pipelining using segmented functional units -8 64-word vector registers per CPU -Register to register vector arithmetic -permits fast startup -Gather/Scatter instructions (for sparse matrices) o Multiple Simultaneous Operations -Overlapping of independent operations -Chaining of dependent operations o Parallel Processing -8 CPUs -Microtasking using compiler directives -Autotasking using compiler directives o Large, Fast Memory -Up to 128 million 64-bit words (each job can have 64 million words in the large batch queue) -256 interleaved banks in 4 sections, 32 subsections on Y-MP 8/864 -4 paths to memory for each CPU (2 vector loads, 1 vector store, 1 scalar/instruction) -Up to 512 million 64-bit words in a Solid State Device (SSD) with a transfer rate of 1250 Mbytes/second Cray Y-MP 8 Physical Characteristics o Size -98 sq. ft. floor space for mainframe -15 sq. ft. floor space for IOS -15 sq. ft. floor space for SSD o Weight -2.5 tons mainframe -1.5 tons IOS -1.5 tons SSD o Cooling: Fluorinent cooling for mainframe modules and power supplies, using freon chiller unit o Power: 400 Hertz power from motor-generators. (High bills) o Memory: ECL