Matrix-matrix multiply: C = A B
Guiding principal is that once you have moved data from memory to the cache or the registers of the CPU, your program should be organized to accomplish as many of the operations involving the data as possible.
Fig. A-3 Vector processor registers
© O'Reilly Publishers
(Used with permission)
Fig. A-4 A vector processor at work
© O'Reilly Publishers
(Used with permission)