CS 596 Supercomputing for Undergraduates - Week 2           (charac-hpc.txt)

        Kris Stewart, Assoc. Prof., SDSU (stewart@sdsu.edu)
                      Senior Fellow, SDSC (619) 942-1012


Before we begin our study of architecture, we will cover a
very quick overview of the characteristics of
supercomputers. This will provide guidance while working
through the text by Hennessy and Patterson. These notes come
from Dan Sulzbach's presentation at the August 1990
Advanced Computing Institute at SDSC.

                 What is a supercomputer?
Supercomputers are the class of computers that are the most
powerful for scientific computation with sufficient memory
to get significant throughput of large-scale
applications at any given time. Due to the cost of these
machines ($10 million, for example), they must have general
purpose capability. Few applications can afford to dedicate
such an expensive resource to only one type of problem.
(Maybe DOD and NCAR can.)

                         Powerful
The emphasis is on speed. What devices comprising a
supercomputer can be timed?
     o Central Processing Unit (CPU)
     o Memory
     o I/O Subsystem

                  How to  quantify speed?
The commonly unit of measurement is the number of Floating
Point Operations per Second (FLOPS).
     o MFLOPS (MegaFLOPS - 10^6) appropriate measure in the
70's and 80's
     o GFLOPS (GigaFLOPS - 10^9) appropriate for the 8
processsor Cray Y-MP
     o TFLOPS (TeraFLOPS - 10^12) tomorrow's machine -the
goal of the Intel Hypercube is to have sustained computation
in TFLOPS

                          Memory
Access to very large memory capabilities is needed for
large-scale applications. Also need very large memory to
handle many modest to large application problems in a
multi-user system such as Unix. The memory must be not only
large, but also fast to avoid idle CPUs.

At any given time, the speed of computers (CPUs) has been
increasing continually, a factor of 10 increase every 5
years or so since 1950. In general, supercomputers are
obsolete 3 to 5 years after their initial commercial
delivery.

                A History of Supercomputers
     o Late 1950s   CDC-1604
     o Early 1960s  CDC-3600
     o Late 1960s   CDC-6600
     o Early 1970s  CDC-7600
     o Late 1970s   CRAY-1    CYBER 205
     o Early 1980s  CRAY X-MP Fujitsu   VP200
     o Late 1980s  CRAY-2,Y-MP Fijitsu VP400 NEC SX-2 Hitachi S-820 ETA10

             What makes a Supercomputer Super?
     o Hardware -fast circuitry
     o Architecture -multiple operations can take place concurrently
     o Memory -large size and high speed

                Computer Hardware and Time
     o Clock -A synchronizing device within a computer
system. The clock signal is a repetitive signal that times,
controls, and coordinates most operations in a computer
     o clock cycle -The interval between clock signals.
Cycle time is often used as a measure of a computer's
potential performance.

       Computer             Cycle Time (ns)
     -------------         -----------------
      CDC 6600                 100 (1965)
      CDC 7600                  25 (1970)
      VAX 11/750               320
      VAX 11/780               200
      VAX 8600                  80
      VAX 8800                  45
      Supertek S-1              55
      CYBER 205                 20
      IBM 3090                  14.5
      CRAY-1                    12.5
      CRAY X-MP                  8.5
      Fujitsu VP400E             7v, 14s
      CRAY Y-MP                  6
      NEC SX-X                   2.9
      CRAY-2                     4.1
      Hitachi S-820              4v, 8s
      CRAY Y-MP 16               4 (predicted)
      CRAY 3                     2 (predicted)
      SSI                        2 (predicted)
      NEC SX-X ('95)             1 (predicted)

              Peak Speed (Salesman's Number)
     o Maximum number of adds and multiplies performed in one second
     o Speed the computer cannot exceed
     o Peak Speed = (Ops/cycle) * (Cycles/sec) * CPUs
     o Peak Speed of Y-MP 8/864
          - 6 nanosecond cycle time (1 ns = 10^(-9) second)
          - 2 operations per cycle for each CPU
               - Independent functional units for + and *
          - 8 processors
          - Peak speed = 2 ops/cycle * 1/6 10^9 cycles/sec * 8 CPUs
                       = 2,667 MFLOPS = 2.7 GFLOPS

                   Peak Speeds
       Computer                   MFLOPS
     ------------              -----------
      CYBER 205                    400
      CRAY-1S                      150
      Convex C-240                 200
           (4 processors)
      CRAY X-MP                    940
           (4 processors; 8.5 ns clock)
      IBM 3090/600J VF             828
           (6 processors)
      Hitachi S-810                840
     --------------------------  GFLOPS --
      Fujitsu VP-400E             1700
      NEC SX-2                    1300
      CRAY-2                      1950
           (4 processors)
      Cray Y-MP 8                 2667
           (8 processors)
      Hitachi S-820               3000

Note:  Some have characterized peak speed MFLOPS as Most Flattering
Operational Performance Statistic.

  Computer Architectural Advances - Historical Perspective
     o Separate processor to handle I/O
          -1950's (IBM 709)
          -called an I/O "channel"
          -reduced CPU idle time
     o Instruction look-ahead
          -1960 (IBM Stretch computer)
          -instruction buffers
          -instruction and operand pre-fetching
          -reduced CPU idle time
     o Multiple Functional Units
          -early 1960's (CDC 6600 -Seymour Cray)
          -capitalizes on instruction look-ahead
          -operations performed in parallel when possible
          -minimizes  time to perform  all operations; increases
            operations per cycle
     o Vector processing -Pipelining and Chaining
          -1970's (CRAY 1 was first successful one)
          -vector is an ordered set of numbers
          -instruction set  contains operations  on vectors  as
            well as scalars
          -segmented functional units allow "pipelining"
               operation in each segment takes one cycle
               on Y-MP:
                    o addition: 7 cycles
                    o multiplication: 8 cycles
                    o  reciprocal: 15 cycles
          -CPU may have vector registers
          -increases operations per cycle
     o Multiple CPUs -Parallel Processing
          -multiple CPUs can perform in parallel and independently
          -independent parts of single program can be run simultaneously
          -increases operations per cycle

               Supercomputer Memory Features
     o Hardware  Chip access times are important
     o Architecture
          -Interleaving: memory unit is divided into a number
            of modules, or banks, that can be accessed simultaneously
          -Paths to memory: this varies among supercomputer (e.g.,
            CRAY-2 has only one path to memory (so only one vector
            can be moved to/from memory at a time), but CRAY-YMP has
            three paths to memory for vectors (allowing 2 vectors and
            1 store simultaneously)
          -Local memory: some supercomputers (CRAY-2) have local
            memory that can be accessed more rapidly than large common
            memory

            What makes the Cray Y-MP 8 so fast?
     o Fast Hardware
          -6 ns cycle time
          -high chip-packing density (entire CPU on one module)
          -sophisticated  cooling system (fluorinent cooling using
            chilled dielectric coolant for modules and power supplied
            using a freon chilller unit
     o Vector Processing
          -Pipelining using segmented functional units
          -8 64-word vector registers per CPU
          -Register to register vector arithmetic -permits fast startup
          -Gather/Scatter instructions (for sparse matrices)
     o Multiple Simultaneous Operations
          -Overlapping of independent operations
          -Chaining of dependent operations
     o Parallel Processing
          -8 CPUs
          -Microtasking using compiler directives
          -Autotasking using compiler directives
     o Large, Fast Memory
          -Up to 128 million 64-bit words (each job can have  64
            million words in the large batch queue)
          -256 interleaved banks in 4 sections, 32 subsections on
            Y-MP 8/864
          -4 paths to memory for each CPU (2 vector loads, 1 vector
            store, 1 scalar/instruction)
          -Up to 512 million 64-bit words in a Solid State Device
            (SSD) with a transfer rate of 1250 Mbytes/second

           Cray Y-MP 8 Physical Characteristics
     o Size
          -98 sq. ft. floor space for mainframe
          -15 sq. ft. floor space for IOS
          -15 sq. ft. floor space for SSD
     o Weight
          -2.5 tons mainframe
          -1.5 tons IOS
          -1.5 tons SSD
     o Cooling: Fluorinent cooling for mainframe modules and power
       supplies, using freon chiller unit
     o Power: 400 Hertz power from motor-generators. (High bills)
     o Memory: ECL