What
Happens at 100 Mhz?
Copyright 1996,
Jack G. Ganssle
Abstract
As CPU speeds
climb towards infinity, our debugging strategies must change. Here are some
ideas.
In "From
the Earth to the Moon" Jules Verne painted a wonderful image of the battle
between canon manufacturers and armor makers. Armorers always operate in a
catch-up mode, trying desperately to build something resistant to the newest
speedy shell. When a group builds a gigantic canon to shoot a man to the moon
the armourers all but revolt, complaining that nothing could keep up with
this latest technological advance.
Every time I
pick up an electronics magazine I'm reminded of this story. The chip vendors
announce in glittering prose their latest masterpiece of silicon speediness,
mostly without a comment about how the poor designer should develop code for
it. In real life I run a business that makes emulators, and am left feeling
a bit like those poor armourers of old.
At some point
- 100, 200, or surely by 300 Mhz - the tools we have relied on for all these
years just won't keep up.
Consider my old
favorite, the in-circuit emulator (ICE). Some sort of hardware replaces the
CPU in the product you're working on. Electronics - a lot of electronics -
uses either the same sort of CPU, or a special "bond-out" version,
to control your system.
At reasonable
speeds this is probably the best way to get visibility into your code and
hardware. The ICE is intrusive, in that it can toggle bits and exercise your
memory and I/O independent of the execution of your code. In other words,
you can stop the code and examine the state of every part of the system. Conversely,
a decent ICE runs your code non-intrusively. It's as if the emulator is the
world's most expensive microprocessor, running your code exactly as it should.
Perhaps breakpoints are pending or trace is armed; these debug features either
have no impact on the code, or at least none till they take effect.
This dual identity
- intrusive and non-intrusive behavior - comes from two characteristics of
the emulator. The first is bus isolation: the microprocessor in the ICE is
physically separated from your target system by a data bus buffer. Turn it
on, and every bus cycle generated by the CPU on the ICE's pod is mirrored
to your own electronics. The emulator does this to access a target resource.
Turn it off,
and the CPU can run ICE-specific code that your target never sees. Perhaps
the emulator's hardware/firmware is extracting the contents of the registers
so you can see these in a window on your terminal. Maybe it's getting ready
to read one byte from your target.
The second characteristic
is simply the sad fact that features (breakpoints, trace, and the like) have
a cost. An ICE uses quite a bit of electronics that have to be mounted somewhere,
somehow. As you're not likely to design an emulator into your product, we
emulator vendors design them as instruments you plug into the CPU socket.
At low speeds
there's no problem. Once the clock rate starts to climb to the crazy levels
we're seeing on the horizon the physical size of the system - a pod, a connection
via (perhaps) an adapter to your target - becomes large compared to the speed
of light. Every connector, connection, and PC board track each impose rather
significant additional delays as the signal propagates from your target up
to the guts of the ICE.
Electrons move
at around 2 nsec per foot in wire. We just can't push them faster. Perhaps
with new understandings of quantum mechanics we'll one day exploit the "spooky
action at a distance" (Einstein's words) beyond-lightspeed collapse of
the quantum wave function to transmit data between the target and ICE infinitely
fast. Since this seems unlikely is our lifetimes, we'll have to bow to reality
and develop alternative approaches.
The previously-mentioned
data bus buffer is a source of headaches as well. The very fastest devices
need over 2 nsec to switch directions. If you add perhaps another 2 nsec for
prop times through the connections, 40% of the cycle of a 100 Mhz one T-state
system is eaten up just moving the data between target and ICE.
Things degenerate
quickly at higher speeds. A 200 Mhz single T-state CPU has a 5 nsec bus cycle.
Though one can argue that cache designs reduce external bus rates to more
reasonable levels, few designers are willing to give up breakpoints in cache,
or real time trace, just to make life easier for the tools.
So, like chicken
little, I'm predicting that the roof will eventually fall in on high speed
embedded design. There are some options. First, though, it's important to
consider what very high speed embedded systems imply.
More Speed Now!
I'm told that
a number of embedded apps will always push the performance envelope. V.34
modems apparently already need 30+ Mhz 8 bit CPUs. Disk drives will eat every
bit of performance available, as the ability to suck data from the platter
in real time reduces disk cache memory sizes... and memory is expensive.
Intel preaches
the virtues of raw horsepower to reduce system costs by eliminating the need
for external DSP chips in fax machines, cellular phones, and other communications
devices.
Real time compression
and encryption needs ever faster processors, especially as data comm rates
continue to increase. Of course, the Feds have their way, we'll always be
limited to inadequate 64 bit keys (they'll have a copy tucked away, just in
case), which won't demand as much CPU performance as the 1024 bitters so many
of us use now via PGP.
Inexpensive 16
bit CPUs at speeds of 40 Mhz are available today. 8 bitters passed 20 Mhz
years ago. Raw clock rate specifications, though, mean little.
Though the chip
vendors are excruciatingly honest about specifying their clocks as real CPU
bus rate, many designers still don't understand that the crystal frequency
may be twice the bus rate. Most processors divide the crystal by two. That
40 Mhz crystal, then, may be driving your processor at a more sedate 20 Mhz.
It's hard to
build very high frequency crystals, so many of the speediest CPUs divide the
input by 1. Some don't accept a crystal. On the 386/486/Pentium, for example,
a single clock pin accepts a perfect TTL waveform only, generated by an external
oscillator.
Others multiply
the input using a phase locked loop. Many of Motorola's chips can use a wristwatch
crystal - 32768 Hertz - to run at full rated 16 Mhz. This has two advantages.
Watch crystals are cheap and very small. Even better, the 32.768 Khz input
is ideal for tracking the time of day in the real time clock modules included
on many of these parts.
Count on continued
confusion when comparing raw bus rates. Only the newest and fastest parts
do anything useful in a single clock cycle. The 68332, for instance, needs
3 clocks to read anything, including an instruction, from memory. The 186
needs 4. Zilog's Z180 is a bigger/faster/better version of the Z80; it uses
a 3 T-state bus - one less than the Z80 - immediately bettering performance
by 25%.a
One of the promises
of RISC is to simplify instruction sets to the point where each can run in
one cycle. CISC machines are evolving in this direction as well, as we see
with the Pentium and other high-end speed demons.
A one T-state
processor running at 50 Mhz needs only 20 nsec to read memory, assuming there
are no setup/hold time requirements. More common embedded CPUs require 2,
3 or 4 T-states, increasing basic machine timing to 40 to 80 nsec. Clearly,
when the holy grail of performance is the only consideration, a one T-state
machine is the way to go.
There's a bit
of a dirty secret though: few of us can afford large amounts of zero wait
state memory when the entire machine cycle races by in 20 nsec (or 10 nsec,
at 100 Mhz). The solution is cache, which is a bit of very expensive but very
fast RAM. All off-cache accesses use one or more wait states, greatly slowing
the system down (a single wait state on a 1 T-state machine halves system
throughput on that cycle).
Cost sensitive
embedded systems simply must minimize memory expenses. Those extra T-states
start to look pretty attractive when you've got a tiny budget for computer
hardware. Thus, memory cost has been a limiting factor on the speed of most
embedded systems.
AMD pulled a
neat trick with their Am186EM, which will give zero wait state accesses to
70 nsec RAMs at CPU bus speeds of 40 Mhz. It's not the fastest part in town,
but the complete system cost sure is attractive.
Speed Cometh
Despite the memory
problem, the very fastest embedded systems are now using, or will soon use,
single T-state processors at outrageous clock rates. The applications listed
above all need the best performance, and demand low costs to boot. Once I
would have said these were the pathological cases; the ones with little impact
on what most of the industry is doing. Perhaps this is changing.
Very high speed,
very high volume, embedded systems are now becoming possible due to the excess
fab capacity of the chip vendors. Many are actively designing CPUs that will
be used only by a single customer - say, a laser printer sold in enormous
quantities via the discount mail order computer houses. Speed, as a replacement
for complicated electronics or memory, in high volume applications is starting
to make economic sense.
It's not entirely
clear how we'll deal with expensive memory. Clearly on-board cache will be
ever more common. Just as clearly, memory costs do follow a descending curve,
though it always seems to lag the needs of the speediest processors. Try buying
very fast static RAM today - it's practically all on allocation due to the
enormous demands for fast external cache for the PC industry.
Since the rest
of the industry seems to live with the spin-offs (or, perhaps the cast-offs)
of the racehorses, I suspect that as time goes by more and more embedded systems,
even those produced in medium to low volumes, will use these very fast parts
now being developed for single applications.
"Wait a
minute," you exclaim, "4 and 8 bit CPUs account for the lion's share
of the embedded processors sold each year. Most run at pathetically slow speeds."
True. But take
a closer look at that 4 bit market. Most are custom or semi-custom parts tuned
to a particular application - an appliance or TV remote control. These parts
already herald some of the problems we're bound to see. Though speed isn't
an issue, tools certainly are. There cannot be a healthy, competitive tool
market for a processor used by a single customer. Engineers are developing
at the low end of the market using heroic efforts, not the latest in technology
aids.
Yes - the 8051,
Z80, 6805, and other workhorses of the 8 bit arena will never clock at hundred
Mhz rates, so a significant chunk of low end systems will always be slow.
I simply contend that time will create more applications for embedded systems;
that these will require ever more horsepower; and that many will run at breakneck
speeds or with custom parts that have no decent tool support.
As designers,
and as an industry, it's time to start coming up with development strategies
for the future. We're fast approaching the end of our ability to tune and
drag current development techniques along with the evolving direction of modern
microprocessors.
Spoiled Rotten?
Maybe we've been
spoiled. Real time trace, hardware performance analysis, complex breakpoints
- all contribute to easing the pain of development. All help us get our product
to market faster.
Yet, very few
programmers use these sorts of tools. Most developers write non-embedded code
that runs on PCs and workstations. Somehow they manage without the cool stuff
we demand. Though most of these applications don't service interrupts and
respond to real time events, some do. This, folks, is the future of our industry.
Don't get me
wrong: I believe that programmers are a very expensive resource; time to market
is critical. Any tool that increases efficiency is worth it's weight in gold.
Too many companies short-change developers with lousy equipment and noisy
cubicles while doling our millions in salaries.
Raw CPU speed
alone will make traditional embedded tools an impossibility. On-board cache,
semi-custom ASICs with on-board processors and superscaler designs will seal
their fate.
Motorola's Background
Debug Mode (BDM) is a glimpse into the future. All of their more recent parts
include a special serial port used only for debugging. Transistors are cheap
- it makes sense to integrate extra onto the processor as a special debug
port.
Similarly, many
other vendors are putting variants of JTAG (IEEE 1149.1) ports onboard their
fastest CPUs. Like the BDM these are all special serial interfaces just for
the sake of debugging code (and, perhaps, in-circuit test of production boards).
A debugger on-board
the chip eliminates all speed issues. It's cache-independent. Even when the
CPU is hidden in a huge ASIC, if just a few pins come out for the serial debugger,
then designers will have some ability to troubleshoot their code.
JTAG/BDM lets
you set simple breakpoints, single step, dump and change memory and I/O...
in short, everything you can do with a normal PC design environment, like
Microsoft's Visual C++.
The downside
is what we'll lose. Real time trace is all but out - you'll never find a chip
with megabytes of fast on-board RAM just to make debugging easier. Yes, some
chip vendors are experimenting with small on-board trace memories and other
clever approaches to give some sort of real time visibility to the code, but
all of these at best are compromises. Real performance analysis, overlay RAM,
complex breakpoints, and all the rest will be history.
And so, I predict
that high speed embedded design will get even harder, limited by physical
properties of the chips. We'll have all of the debug capability used by non-embedded
programmers, but will lose the neat stuff we rely on for real time design.
The "good"
news (for troops in the trenches) - increased job security. Development will
be harder and take longer, requiring more skilled people.
The real battles
will heat up in software tools. Source debuggers that drive the JTAG port
will be our primary tool. New, cool ways of tracking real time events will
be invented. Perhaps we'll instrument our code with calls to track execution
time. Simulation may finally come into its own as an adjunct to conventional
debugging.
Stay tuned. Watch
the market; things will continue to change. The cannon folks are pulling ahead
of the armourers.