Bus
Cycles
Copyright 1995,
Jack G. Ganssle
Abstract
Software folks
need to understand how a microprocessor handles data on its busses. Here's
the short intro.
Published in
Embedded Systems Programming, April, 1995
My October column
about DMA brought a number of interesting replies. Quite a few readers commented
that they just did not have a good idea "what all this bus cycle stuff
is about". After all, this is a software magazine; it makes sense that
hardware details, while critical to the embedded world, are a bit fuzzy to
many readers.
How many engineers
understand, in detail, a computer operation? Over the years I've asked each
technical person I've hired for a blow-by- blow description of how a microprocessor
fetches, decodes, and executes instructions. What does the instruction pointer
do? How are JMPs decoded and processed? Surprisingly few can give an accurate,
detailed explanation. Engineers are taught to treat complex components like
microprocessors as block diagrams. Connect the right peripherals according
to the cookbook and you'll never need to understand the block's internal operations.
Though modern
society is built on this concept of increasing abstraction, I maintain that
some sort of understanding of the operation of fundamental devices (like computers)
is essential to efficient solving of complex problems. Knowledge - of anything
- makes life more interesting. Unfortunately a thorough understanding computer
operation is not terribly likely to enhance your cocktail party repartee:
"exactly how much setup time does your system need?", she asked,
breathlessly, a slight tinge of pink coloring her cheeks. The party's hum
faded into the background; my vision tunneled into sharp focus on her... Uh,
wrong magazine. Suffice to say, inquiring minds want to know.
Even the purest
software type working in the embedded industry will hear no end of discussion
about bus cycles, wait states, read and writes, and the like when waiting
for the hardware weenies to put yet another Band-Aid on the malfunctioning
prototype hardware. Here's the 10 minute introduction to bus cycles, so you
can at least look knowledgeable. Tic Toc
Everyone bandies
the word "clock" around. It's common knowledge that the faster your
clock runs the more information processing gets done per unit time. How does
the clock rate relate to processor speed, and why is one even needed?
A science wag
half-facetiously commented that time is what keeps everything from happening
all at once. The same reasoning applies to a computer system. The clock signal,
which is produced by a simple oscillator on the computer board, sequences
processor operations to give the circuits time to complete one operation before
proceeding to the next. Logic takes a while to do anything: RAM may need 50-100
nanoseconds to retrieve a byte, an adder might require 10 or more nanoseconds
to compute a result. The system clock insures each operation finishes before
the next takes place.
Time is related
to clock frequency by: time=1/frequency. A 10 MHz oscillator gives 100 nanoseconds
per clock cycle. Just to confuse things, many processors divide the input
clock frequency, most of the time by 2. Your 10 MHz oscillator going into
a 80186 actually creates a 5 MHz, or 200 nsec, cycle time.
Each clock period
(the time required for one complete cycle) is called a "T state",
and is the basic unit of processor timing. Nothing completes in less than
a T state, though propagation times through individual components will generally
faster than the T state time. Designers select a T state interval greater
than the sum of all propagation delays in the memory and I/O paths. (We will
see that memories are notoriously slow; you can inject wait states to add
T states to read or write cycles to avoid the use of very expensive fast devices.)
During a single
T state the processor can do very little - perhaps output the address of the
next memory location needed. An entire memory read or write cycle requires
2 or more T states, depending on the processor selected. A "machine cycle"
is the entire time - 2, 3, or more T states - required to perform a single
read or write.
(RISC systems
are generally single-T state machines. They complete an entire instruction
in one clock cycle, generally by overlapping several operations at once.)
The venerable
Z80 uses 4 T states per machine cycle. Run it at 10 MHz (100 nsec) and you'll
need 400 nsec per machine cycle. Zilog's Z180 is an improved Z80 that, among
other things, is considerably speedier. Most of the performance gain comes
from both a faster clock and a reduction in the number of T states from 4
to 3.
The 386 is a
two T state machine. If it's instruction set and bus width were the same as
a Z80 (dear Intel - please forgive me!), it would run at twice the speed of
the 4 T state Z80 at the same clock rate. Instructions
We've talked
about T states and Machine cycles - how long does an instruction take?
Most instructions
are composed of the following pieces: a "fetch" process, instruction
decoding, perhaps some more fetching, and execution. Let's look at a few typical
operations to see how these work, using the Z80 for simplicity's sake.
A NOP is opcode
00. Despite the obvious fact that this is a "do nothing" command,
the processor must at the very least read the instruction from memory (the
"fetch" cycle), decode it, and then very cleverly execute it ("do
nothing").
The NOP executes
in one machine cycle. Essentially all of the cycle is devoted to fetching
the 00 from memory. As I mentioned, this requires 4 T states on a Z80. After
completing the fetch the processor very quickly decodes the instruction, realizes
that there's nothing to do, and then starts another machine cycle to get the
next instruction.
Now consider
a JMP 0100. An absolute jump to a 16 bit destination address, regardless of
processor, needs at least (and in this case exactly) a three byte instruction:
the opcode itself (C3 on the Z80), and two bytes of destination address.
On the Z80 three
machine cycles handle the jump. The first is a fetch that reads the C3 opcode.
The CPU quickly (near the end of this first cycle) realizes that C3 implies
a jump requiring a two byte operand. It therefore issues two back-to-back
reads - each an entire machine cycle - to bring in the destination address.
Only then can the Z80 load its instruction pointer with 0100 and start sucking
in code from the new address. A simple JMP takes 12 T-states: three machine
cycles at 4 t states per. With a 10 MHz clock, we're looking at 1.2 microseconds
execution time.
Here's where
you find the first benefit of 16 or 32 bit computers: the wider bus reduces
the number of fetches needed to read long instructions. Each machine cycle
takes time... a lot of time. Any time you can eliminate these by using smaller,
smarter instructions or a wider bus you'll get substantial performance improvements.
Now consider
LD A,(HL). This one byte opcode tells the Z80 to read from memory at the address
contained in register pair HL. The single byte opcode takes but a single machine
cycle to fetch. Once read and decoded, though, the CPU must put the contents
of HL on the address bus and read whatever is found there into register A
- requiring another machine cycle.
Taking this one
step further, execute a POP HL. Again, this single byte opcode needs only
one fetch cycle. After decoding the meaning of the byte, the Z80 realizes
that two bytes (16 bits) are required from the stack. It starts a second read
cycle, now with the stack pointer providing the address. A third then commences,
this time at address SP+1. Here, a single byte opcode needed three machine
states. Of course, if you were clever enough to use a 16 bit processor the
entire operation could complete in two: a fetch, and a word-wide read at the
SP address.
Here's a case
where a 32 bit processor brings no implicit advantage. It still needs two
cycles, even though we're only transferring 24 bits (8 bits of opcode and
16 of stack data), because the opcode is at one address and the stack is (hopefully!)
somewhere else. The CPU can only issue a single address - a single memory
operation - at a time. Of course, Harvard architecture machines, like most
DSPs, have separate data and instruction busses, and can run simultaneous
transfers. There's a corresponding performance improvement.
In these examples
the instruction execution time was buried in the read and fetch cycles. A
JMP, POP, ADD, and most other operations are quite simple. Others are not.
The Z180 includes integer multiply and divides, which use the "shift
and add/subtract" algorithm. Execution time is a function of the operands
supplied. A single machine cycle can take many, many clocks as the bus lies
idle (nothing to transfer between the processor and memory), but as the CPU
whirs along, thinking very hard.
Here's where
adding transistors to a device improves performance. Use a barrel shifter
(a sort of parallel shifter that works in a single clock cycle), and the multiply
times approach zero. Bus width reduces machine cycles by doing more at once;
transistors shorten long machine cycles by completing complex operations faster.
Fetch, Read or Write?
Though some CPUs
support oddball machine cycles (like DRAM refresh), virtually every cycle
is either a Fetch, Read, or Write. Fetches read instructions - always from
memory. Read and write cycles transfer operand data, from memory or I/O.
Fetch and memory
read cycles intuitively feel the same. Both read bytes from RAM or ROM. The
difference is subtle. Many processors have a "fetch" signal that
differentiates the two. Sometimes, as on the Z80, the exact timing may differ
a bit between the two. As the industry evolves, though, the difference in
timing and signals is disappearing. Often it's all but impossible to tell
what the CPU is doing by watching the bus, unless you notice that fetches
generally are from increasing addresses (programs execute from low addresses
to higher ones, unless there's a program transfer), while memory reads occur
much less frequently, generally from addresses not near the code. The hardware
doesn't care or need to know what's going on. An address comes out, the CPU
asserts the read signal, and the selected memory device transfers data back.
In the preceding
discussion we've looked at common instructions, and have found that each one
is nothing more than a sequence of machine cycles. If we ignore interrupts,
refresh, and other infrequent intruders, there are only two basic kinds of
machine cycles: reads and writes. Let's look at what goes on during a cycle.
The figure shows
how a Z180 read cycle. This is a typical timing diagram of the sort hardware
folks sweat over, representing how the signals on the computer bus change
over time. If you connected the signals shown to a logic analyzer you'd see
just this sort of display.
The top signal,
clock, provides the basic timing reference to the system. Each cycle is one
T-state, as indicated by the "T1", "T2", and "T3"
designations.
Shortly after
the cycle begins the processor provides an address. I've represented the 16
address lines by one bus; if it wanted to read from 0100, then A15 to A0 are
all zeroes except for A8. At about the same time the CPU also asserts its
Memory Read signal.
The processor
is telling the memory array to return a byte from address 100. The CPU drives
address and Memory Read; now it wants memory to drive data back to it on the
data bus. You'll notice that the addresses remain valid during the entire
time Memory Read is asserted. These stable signals go into a ROM, say, giving
the ROM time to pull data from the selected address and put it on the bus.
Remember, memories are slow.
Sooner or later
the ROM data will be valid. The processor specification tells the system designer
how much time is allowed before valid data must appear. The timing diagram
shows that the ROM better respond a little bit before Read goes away. This
is called the minimum "setup" time.
When the processor
removes Memory Read the cycle is almost over. Another specification, "hold
time", specifies how long the data from the ROM should remain on the
processor's data bus after Read disappears.
Setup and hold
times are truly critical. Violate the minimums, and your system will erratically
crash.
But... that's
it. What could be simpler?
Now, getting
the timing just right can be tedious, since good design implies using memories
that are fast enough, but not too fast (speed is expensive). High speed clocks
cause all sorts of trouble in insuring the setup and hold times are met. I
don't want to denigrate the problems faced by a hardware designer! It is important,
though, to realize that the basics of timing are really quite simple. Ah,
just as the basics of programming are no great mystery. Well, OK, maybe anyone
with less than a genius IQ will be able to figure out your code, but...
Write and I/O
cycles are very similar. The timing might shift a little, but the concepts
are the same. The biggest difference is that during a Write cycle the processor
drives address, Memory Write, and the data lines to memory or I/O.
Suppose the ROMs
are too slow? Since you can't speed up a slow memory chip you have to slow
down the computer. A wait state stretches the time during which Read or Write
is asserted, giving the device more time to decode an address. Each processor
has a wait state input, and associated specification for driving this line.
If you assert it by a particular time in the cycle, the processor will generate
additional T-states while keeping the address and read or write valid. In
effect, using the example in the figure, you'll get extra T2 states for as
long as the Wait input is asserted.
The penalty for
using a wait state depends on the processor. One wait state on a Z80 stretches
the machine cycle from 4 to 5 T states - not a tremendous change. Add one
wait to a two T state machine like the 386, and you've suffered a 50% performance
penalty. Ugh!
Since we use
waits simply as a way to save money by using cheap, slow memories, you can
avoid this performance hit by using cache RAM. Cache is a smallish chunk of
very fast (read, expensive) RAM that runs with zero wait states. Given that
computers often run in small loops, smart hardware can track these loops,
keeping the most recently-used parts of the code in very fast cache. Any access
outside of the cache will incur a wait state, but, if cleverly implemented,
better than 90% "hit" rates are not uncommon.
Your 33 MHz 486
most likely has a quarter meg or so of 25 nsec cache (very costly RAM), yet
lives very happily with many Mb of cheap 70 nsec DRAMs running with one or
two wait states. We'd all like 32 Mb of fast RAM; not many of us can afford
it. Conclusion
Software folks
are well aware that very simple instructions, executed at a mind-boggling
rate, give the complex actions of a sophisticated program. The hardware is
no different. Simple T states build machine cycles which result in instruction
execution. Each component is trivial, but when repeated millions of times
per second makes a computer the wonderful widget it is.
I leave you with
one thought: when in the throes of fighting some nasty, intransigent bug,
when the CPU seems to have a malicious mind of its own, remember it's only
a very simple machine. You are smarter than it is.