The first successful implementations of a vector processor appears to be the CDC Cyber 100 and the Texas Instruments Advanced Scientific Computer. The Cyber was otherwise slower than CDC's own supercomputers like the CDC 7600 (but much smaller too), but at those data related tasks they could be quite a bit faster. However the machine also took considerable time decoding the vector instruction and getting ready to run the process, so it required very specific data sets to work on before it actually sped anything up.
The technique was first fully exploited in the famous Cray-1. Instead of leaving the data in memory like the Cyber and ASC, the Cray design had eight "vector registers" which held sixty-four 64-bit words each. The vector instructions were applied between registers, which is much faster than talking to main memory. In addition the design had completely separate pipelines for different instructions (plus was implemented in different hardware than minus for instance), allowing a batch of vector instructions themselves to be pipelined, a technique they called vector chaining. The Cray-1 normally had a performance of about 80 Mflops, but with up to three chains running it could peak at 240 Mflops – a respectible number even today.
Other examples followed. CDC tried once again with its ETA-10 machine, but it sold poorly and they took that as an opportunity to leave the supercomputing field entirely. Various Japanese companies (Fujitsu, Hitachi and NEC) introduced register-based vector machines similar to the Cray-1, typically being slightly faster and much smaller. However Cray continued to be the performance leader, continually besting the competition with a series of machines that led to the Cray-2, Cray X-MP and Cray Y-MP. Since then the supercomputer market has focussed much more on massively parallel processing rather than better implementations of vector processors.
Today the average computer at home crunches as much data watching a short QuickTime video as did all of the supercomputers in the 1970's. Vector processor elements have since been added to almost all modern CPU designs, although they are typically referred to as SIMD. In these implementations the vector processor runs beside the main scalar CPU, and is fed data from programs that know it's there.
In general terms, CPUs are able to manipulate one or two pieces of data at a time. For instance, every CPU has an instruction that essentially says "add A to B and put the result in C".
The data for A, B and C could be - in theory at least - encoded directly into the instruction. However things are never that simple. In fact the data is rarely sent in raw form, and is almost always "pointed to" by passing in an address to a memory location that holds the data. Decoding this address and getting the data out of the memory takes some time, a time delay that has historically grown more annoying as CPU speeds have increased.
In order to reduce the amount of time this takes, most modern CPUs use a technique known as instruction pipelining in which the instructions pass though several sub-units in turn. The first sub-unit reads the address and decodes it, the next gets the values, and the next does the math. With pipelining the "trick" is to start decoding the next instruction even before the first has left the CPU, in assembly line fashion, so the address decoder is constantly in use. Any particular instruction takes the same amount of time to complete (known as the latency) but the CPU can process the entire batch much faster than if it did so one at a time.
Vector processors take this concept one step further. Instead of pipelining just the instructions, they also pipeline the data itself. They are fed instructions that say not just to add A to B, but to add all of the numbers "from here to here" to all of the numbers "from there to there".
From that point the CPU can process this array (or vector) of data much faster. Instead of constantly having to decode addresses and wait for the results, it "knows" that the next address will be one larger than the last. This allows for significant savings in decoding time.
To illustrate what a difference this can make, consider the simple task of adding two groups of 10 numbers together. In a normal programming language you would write a "loop" that picked up each of the pairs of numbers in turn, and then added them. To the CPU, this would look something like this...
Each of these instructions has to be decoded and flow though the CPU's pipeline before completing, so the entire operation is sped up only by a small amount.
But to a vector processor, this task looks considerably different:
Completing that single instruction may take longer than the simple add-two-numbers instruction in the general purpose CPU. However this single instruction represents many instructions from the other CPU, so not only can it skip all of those address decodes, but it also has only a single command to decode as well.
But more than that, the vector processor typically has some form of superscalar implementation, meaning there isn't one part of the CPU adding up those 10 numbers, but perhaps two or four of them. Since the output of a vector command does not rely on the input from any other, those two (for instance) parts can each add 5 of the numbers, thereby completing the whole operation in half the time.
Not all problems can be attacked with this sort of solution. Adding these sorts of instructions adds complexity to the core CPU, which typically suffers in more mundane parts of its performance – ie, whenever it's not adding up 10 numbers in a row. The more complex instructions also adds to the complexity of the decoders, which might slow down the decoding of the more common instructions like
In fact they work best only when you have large amounts of data to work on. This is why these sorts of CPUs were found primarily in supercomputers, as the supercomputers themselves were found in places like weather prediction and physics labs, where huge amounts of data exactly like this is "crunched".
The NEC SX-6 supercomputer architecture is a NUMA architecture built out of SMP machines with 8 vector processors each.History
Description
get this number
[and so on]
get that number
add them
put the result here
get this number
get that number
get the 10 numbers here and add them to the numbers there
if
.