Pipelines
and Prefetchers
Copyright 1991,
Jack G. Ganssle
Abstract
All modern processors
use pipelines and/or prefetchers to increase performance. Here's how they
work.
Published in
Embedded Systems Programming, July 1991
There's more
than "crude" in them thar pipelines. While Congress fiddles
with trying to keep oil pipelines full, those of us
building the information economy realize that bandwidth is a far more valuable
commodity, though one that will probably never generate the
same sort of frenzied trading in the pits of Chicago.
High performance
systems are ultimately bound by memory speed. The economics are compelling:
it's pretty easy (where speed is an issue) to
justify one relatively expensive super-fast CPU. Equally fast memory is generally
out of the question, since hundreds or even thousands of
chips make up the storage array.
Usually memory
is the high performance system's biggest bottleneck. Every single instruction
must be fetched from the memory every time it
is needed. When the CPU is faster than the RAM chips, the connection between
them is the limiting performance parameter.
RF engineers
use the word "bandwidth" to describe the information capacity
of a radio channel. The stink over HDTV is
essentially a bandwidth issue: how can one cram additional picture information
onto the current 6 Mhz TV channel allocation? Bandwidth is
just as big an issue in computer systems, where we transfer millions of bytes/second
over a data bus.
Computer history
is filled with innovative and sometimes desperate attempts to maximize the
bandwidth of the CPU-to-memory connection.
Back in the 60's megabuck machines relied on interleaved core. All even addresses
went to one physical memory bank, and the odds went to
another. The operation of each bank could run in parallel, effectively doubling
the bus rate.
Now PCs and workstations
rely on huge caches, which are nothing more than RAMs stuck between the main
memory and the CPU. The cache RAM
runs at the same speed the processor does, pushing the bandwidth limitations
further away from the computer. If the next instruction
needed is in cache, the CPU can fetch it without waiting. As soon as it needs
one outside of the cached region the entire computer
effectively comes to a halt as it waits for the relatively slow memory to
cycle.
Conventional
(i.e., CISC) processors make very strange demands on the memory array. When
executing short instructions like register to
register ADDs, the CPU is usually much quicker than memory and sucks instructions
as fast as the RAM can supply them. Frequently the CPU
will slow down as it handles a slow instruction. It might take a number of
machine cycles to handle a complex addressing mode, or perhaps
hundreds to do a multiply. During these periods the bus is idle. Prefetchers
I believe the
8088 was the first widely available microprocessor to address this issue.
Intel divided the CPU in two. An Execution Unit
contained all of the traditional processor-like features. The Bus Interface
Unit (BIU) sat between the Execution Unit and the device's
pins. If the 8088 were a simple 8080 or 6800, the BIU would just pass memory
requests from the Execution Unit to the outside world.
However, Intel decide to exploit the idle times resulting from the execution
of long instructions by adding intelligence to the BIU.
Code generally
executes from a low address to a bigger one. Sure, jumps and calls reverse
the monotonically increasing fetch sequence, but
even in a short loop more often than not the next instruction byte is located
right after the one just fetched. The 8088's BIU used this
fact to keep the CPU to memory interface busy (i.e., maximize the bandwidth).
The BIU is really
rather stupid. It just blindly keeps fetching the next byte from RAM, storing
it in a little FIFO between it and the
Execution Unit. When the EU is ready for the next instruction, with luck it
is already on-chip in the fast FIFO. A lot of memory fetch
delays are thus avoided, greatly increasing the processor's throughput. The
FIFO is small, holding only 4 bytes (6 on the 8086), but the
typical match between RAM speed, CPU speed, and FIFO size is such that most
of the time the next byte is there when needed. Not
surprisingly, this process is called prefetching.
Once in a while
the CPU will decode and execute some sort of branch operation. Presumably
the BIU will have at least partially filled the
FIFO with bytes located sequentially beyond the jump; bytes that just are
not needed. Jumps and calls flush the FIFO, erasing these
unneeded entries. The BIU then starts fetching from the new execution address.
Program transfers therefore essentially stall the CPU; the
processor must wait for the first instruction at the jump destination before
proceeding, just like any simple non-prefetching computer
would. Soon, however, the BIU will again fill the FIFO, keeping a bit ahead
of the EU's needs.
Some instructions
read or write data to the memory array. Non-Harvard machines (like the 8088)
share one data bus for both instructions
and data, so a load or store operation causes a momentary disruption in normal
prefetching sequence. Loads and stores work around the
FIFO; they temporarily suspend prefetching, transfer the data, and then resume
without corrupting the FIFO's contents.
In general, a
deeper FIFO lets the BIU get further ahead of the Execution Unit, maximizing
the memory channel bandwidth. It's hard to say
just what the most efficient FIFO size is. Different CPUs have different depths;
presumably this reflects the design philosophies of the
vendors (or perhaps just their ability to cram more transistors on a die).
The Pipeline
Some microprocessor
go a step further and incorporate a so-called "Pipeline".
The pipeline is similar to a prefetcher, in that
it blindly and dumbly reads ahead, trying too fill a queue with sequential
instructions that might be needed. It offers one important
speed-enhancing refinement - it partially executes every queued instruction.
Where with a prefetcher the queued instructions just hang
around in a FIFO, the pipeline contains logic that starts to interpret each
prefetched instruction.
RISC processors
depend on pipelines to meet their much vaunted one clock per instruction specification.
Consider the R3000. This nifty
RISC processor includes a 5 stage pipeline, segmented into the following operations:
Fetch the instruction
Read operands from registers and decode instruction
Do the operation (like ADD, OR, etc)
Access memory, if needed
Write results back to CPU registers
The R3000's pipeline
is a 5 stage affair, each stage corresponding to one of the above states.
At any time the CPU is working on 5
instructions at the same time: an instruction is being handled in each stage.
After every clock all 5 march ahead to the next stage.
This arrangement
permits an effective execution rate of one instruction every clock, but the
first one takes 5 clocks to complete. It's
rather like the shipyards during World War 2 that cranked out a ship every
48 hours. Each ship took months to build, but many were in
various stages of completion. The Downside
Prefetchers and
pipelines are truly wonderful speed-enhancing features for any computer. Be
aware, however, that in some conditions their
performance is less than thrilling. Rarely, though, will the addition of either
result in code that runs slower than on a conventional
fetch-then-execute machine (like the Z80).
Remember that
both pipelines and prefetchers (which I'll just call prefetchers here unless
a distinction is really needed) flush their
contents whenever a program branch occurs. Obviously this happens for all
jumps, calls, and returns. Perhaps not so obviously, every
interrupt will flush them. Thus, the prefetcher offers no improvement for
interrupt latency, a big problem in embedded systems.
Some CPUs have
multiple address spaces. Both the 68000 and Z280 include user and system areas
that are quite different. Be careful about
changing modes (say, from system to user) without executing some sort of jump
that will clear the prefetcher. A simple mode transfer might
not cause a flush, leaving incorrect bytes behind.
Conditional jumps
may behave a bit differently than you are used to. The 64180 doesn't have
a prefetcher of any sort, yet if it knows it
will not take a conditional jump (it can decide this after reading the first
byte), then it will not read the third instruction byte. This
frees up some memory bandwidth, and is a nice, intelligent way to deal with
conditionals. A prefetching processor will usually read all of
the bytes before making the conditional jump decision, so some implicit bandwidth
is lost.
Prefetcher stall
is a very real problem. All memory bus activity stops as the CPU goes to RAM
for the argument of a load or store
operation. If the RAM asserts wait states, the entire system, including prefetcher,
is shut down for the memory transaction. Where speed
is a big problem and prefetcher stalls cannot be tolerated, put frequently
used variables in very fast memory.
Debugging with
a prefetching processor can be the stuff of horror movies. Don't figure on
using a logic analyzer to capture the execution
stream. The prefetcher grabs instructions in a stupid, monotonically increasing
sequence of addresses. Branches or other program transfers
cause this to abruptly halt and restart at the destination address. As a result,
some instructions will only be partially fetched,
completely confusing the analyzer's disassembler. Worse, an instruction is
fetched long before it is executed. If it transfers data the
load/store argument will appear in the trace data quite a bit after the instruction
proper. How will you align arguments with
instructions?
Suppose you use
a logic analyzer to trigger a pseudo breakpoint. The analyzer only sees bus
transactions, so it will break on the
instruction fetch, not its execution. This is simply wrong, since the fetched
instruction is often discarded when the prefetcher is
flushed after a jump. For example:
loop: do something
jnz loop
add bx,cx
If you wish to
break when the loop completes, a breakpoint on the ADD instruction should
do the trick - shouldn't it? Actually, a logic
analyzer will trigger the break condition every time the jump executes, since
the prefetcher will surely read the ADD. Some tools will
handle this properly, but be sure to check with the vendor. Other Problems
Be wary of especially
smart CPUs that have prefetchers - particularly those with Memory Management
Units. Suppose the MMU can produce an
address violation error if some illegal location is referenced. If code is
immediately adjacent to an illegal area, the prefetcher might
read ahead into the bad zone and create a false trigger of the violation trap.
Even without an MMU this is a problem if your system uses a
watchdog circuit that monitors the address bus.
Similarly, be
careful about memory mapped I/O when using a prefetcher. Code that runs right
next to an I/O port could cause an extraneous
prefetch read of the port.
Avoid all of
these problems by keeping your code at least one prefetch depth away from
the end of a memory protection segment, or from
memory mapped I/O addresses.
Finally, be careful
about self modifying code. If you really need to use it (and yes, in the embedded
world it is sometimes awfully hard
to avoid, despite the hysterical pleadings of an army of structured programming
advocates) be sure that when you modify the code in memory
you don't wind up with an unmodified copy in the prefetch queue. I've been
burned by this - it's not easy to find. Conclusion
Though I've drawn
a distinction between the terms "pipeline" and "prefetcher",
most designers use them
interchangeably. The engineers at Zilog are always quick to correct my use
of "prefetcher" to "pipeline" when talking
about their Z280. Intel is just as adamant to insist I call their 80x88 implementation
a prefetcher. Perhaps I'm just an easy target. Is
there a true industry-accepted distinction between the terms? My definitions
are not from Webster, but gleaned from studying many vendor
manuals.