Perform
or Perish
Copyright 1991,
Jack G. Ganssle
Abstract
How do you deal
with software performance problems?
Published in
Embedded Systems Programming, March 1991
What processor
should you use in your next product? High tech marketers push dozens of common
CPU families, some with yet more dozens of
variants within the family. Do you need 8, 16, or 32 bits? How big of an address
space? How much on-board I/O? It's hard enough to pick a
processor for a simple controller when cost is the only concern. What about
more complex applications that make substantially more demands
on the computer's raw computational capability?
Processor performance
seems to be the Holy Grail of our business. The silicon wizards give us ever
more MIPs for less and less money. But
just how much is enough for the project at hand? Who can afford to select
a chip, design hardware and tens of thousands of lines of code,
and then discover that the system just can't keep up with real time events?
I don't know
how to prove that a particular processor will be adequate for a non-trivial
task. It's awfully hard to equate a less-than-
adequate design specification with a MIP rating. This is even tougher when
programming in a high level language, or when using a canned
real time operating system.
In real life
it seems that the hardware designers often pick the CPU based on cost and
peripheral compatability, and then tell the
programmers to "make it work". Somehow. Or, when software
folks get involved in the decision, the choice sometimes winds up
being made purely on the basis of experience. "The 6501 is in all
of our products; surely we'll have no trouble with it in this
one".
In fact, a lot
of projects eventually get into trouble by overloading the processor. This
is always discovered late in the development,
during debugging or final integration, when the cost of correcting the problem
is at the maximum. Then a mad scramble to remove machine
cycles begins.
We all know the
old adage that 90% of the processor burden lies in 10% of the code. It's important
to find and optimize that 10%, not some
other section that will have little impact on the system's overall performance.
Nothing is worse than spending a week optimizing the wrong
routine!
Fortunately,
performance analysis tools and techniques abound. These range from a head-scratching
session to $20k+ special purpose
instruments. Do use the right tools (whether brainpower or technology) to
find speed problems, and then optimize only where absolutely
necessary. Otherwise, you'll be like a blind man hunting for a contact lens.
Performance Tools
Dozens of vendors
stand ready to supply you with hardware-based performance analysis tools.
Some come as add-ons to emulators; others are
stand-alone products dedicated to the sole function of identifying performance
bottlenecks. All offer easy, real time capture of your
program's activity. The only (!) drawback is cost, which ranges from a few
thousands dollars for an ICE add-on to over $20k for a
dedicated instrument.
Hardware performance
analyzers come in two very different flavors. Statistical units sample the
target system's address lines every so
often, logging the address found at each sample. Presumably, after some number
of iterations a representative picture of the target's
operation will emerge. Non-statistical instruments capture every single machine
cycle issued by your program. These are more expensive
than their statistical counterparts, but give an accurate execution profile
of even single-shot events (like an infrequently-invoked
subroutine).
Any of these
performance analyzers are particularly good at prying unexpected information
from your program. For example, many 8 bit C
compilers generate vast quantities of code to handle automatic variables,
since the stack capabilities of the CPUs leave lots to be
desired. A hardware performance analyzer will instantly show up this overzealous
stack use as a peak in processor use.
Of course, not
everyone can afford the cost of such a special purpose instrument; it's surprising
just how many programmers still make do
with primitive "burn and hope" strategies, eschewing all
development aids. Since most people spend relatively little time
solving performance problems, it often makes sense to use cheaper, more readily
available tools.
Several companies
promote software based performance analysis. Paradigm's Inside and Borland's
Turbo Profiler are possibly the best known.
Both operate by instrumenting compiled code to track its execution.
Software based
performance analyzers are a great compromise between price and capability.
They offer high accuracy and simple operation at
an absurdly low cost. Unfortunately, the products mentioned operate only on
80x88 series processors, and both expect DOS services support.
This does limit their usefulness. Fortunately more and more embedded systems
use Intel-series CPUs or their NEC near equivalents. The wide
availability of inexpensive high quality compilers and assemblers are certainly
a compelling argument in their favor.
You can often
take advantage of software performance analyzers even if your embedded environment
does not have DOS resources. Most of the
code of any system is tied up in mathematical algorithms, screen handlers,
and the like. It makes sense to prototype and measure the
performance of all of these under MS-DOS before switching to the less stable
development hardware.
Any software
performance analyzer will interfere to some extent with the code's speed.
It's a little like Heisenberg's Uncertainty
Principle, except that instead of not knowing both the mass and momentum of
a particle, the measurement of routine speed skews the
system's real time behavior. Interrupt response, in particular, will be degraded.
Still, you can generally measure enough of a system's
response to get a pretty good idea where the biggest problems are. Tricks
and Techniques
Part of the fun
of embedded programming is creating clever windows into a system's operation.
It's awfully hard to extract any sort of
information from most embedded systems; performance data is especially challenging.
Even if you have
access to the latest and most expensive marvel of performance measurement
it makes sense to understand and use other,
perhaps less sophisticated, ways to get the same or similar data. Always strive
to expand your arsenal of tools!
Pure software
types always think I'm a little daft whenever I say this, but the oscilloscope,
colloquially known as the scope, is one of
the most useful debugging tools. After all, in the embedded business our code
ultimately just pushes bits out I/O ports; scopes are a
natural choice for measuring the timing of these signals. A scope is the ideal
low-cost performance measurement tool.
Any decent scope
shows the state of two independent signals simultaneously, as a graph of amplitude
(i.e., voltage) versus time. Software
people don't really care about the vertical axis calibration since any signal,
regardless of voltage, is (for our purposes) as good as any
other. The horizontal axis shows time, the crucial component needed in effective
performance measurements.
Suppose you wrote
an interrupt service routine in C. If more than a handful of lines, you'll
probably have no idea how long it runs. Yet,
an ISR that is too slow will never keep up with real time events. Modify the
routine to set a spare parallel bit high when invoked, and
drive it low just before exiting. Then, connect a scope to that bit and measure
the duration of the pulse. With only a few seconds work
you'll know just how long the ISR executes.
Also measure
the time between pulses. This is the time available for all of the other code.
If the ISR runs for 50 out of every 100
milliseconds, then it uses 50% of all CPU resources, a rather scary amount.
If, in this case,
you happen to know the interrupt occurs every 25 milliseconds, then this simple
measurement immediately indicates that
the ISR takes so long to run that the system is missing interrupts. This is
profound, as a missed interrupt usually spells disaster in
real time systems. A scope gives the information in a heartbeat; few other
tools will.
You can time
any routine this way. Of course, if (gasp!) your routines use multiple entry
and return points then you'll have to add a lot
more bit setting and clearing statements. This is yet one more argument in
favor of a single return at the end of each routine.
A variation on
this theme is using the scope to measure "delta time" -
the period between two events. Suppose you are concerned
about how much time elapses between an interrupt being asserted and some other
action happening. For example, if the interrupt signals the
receipt of a "conversion complete" signal from and A/D converter,
how long does it take the code to read data from the A/D? If
the analog input varies continuously, fast A/D reads are probably crucial.
Put one scope channel on the interrupt signal and the other on
the A/D chip select pin. Read the difference in time between the two scope
traces.
These techniques
measure execution times of individual routines, or the latency of a single
interrupt. With only two channels and no
memory, the scope doesn't identify the time spent in individual lines of C
within a routine. Be clever! Once again, add code to drive a
bit high only during the code fragment's execution. Connect one scope channel
to the bit, and use the other to probe addresses generated
by the CPU. Look at one address bit at a time, making a paper record of when
addresses change relative to the routine's start pulse. Given
a short subroutine, it's not too hard to make a map of its execution.
Logic analyzers
are almost as useful for performance measurements. Suppose the system runs
a "main loop" in 100 milleseconds or
so. Connect the analyzer to the system's address bus and command it to acquire
one sample every 100 microseconds. 1000 acquisitions after
starting, the logic analyzer's internal memory will contain a sort of statistical
profile of where the code runs. It might miss very short
routines, but it will give you a high level picture very quickly.
You can get a
more refined look at the operation of any section of code by telling the analyzer
to "trigger", or start
collecting, when a certain address is sensed. Crank up the acquisition rate
for finer details. If the analyzer can store 1000 cycles, then
with a 10 microsecond rate you'll get a pretty good profile over 10 milliseconds
of execution. In the example of monitoring an interrupt
service routine, trigger collection on the ISR's start address.
If a scope or
logic analyzer isn't your cup of tea, consider connecting a PC to your system
via a fast serial or parallel port. You can
get some crude idea of performance levels by having every routine transmit
a unique code when invoked. Then, write a program to sift
through the data and extract performance parameters. Real time response might
be sacrificed, but again, you'll get an idea of what the
system is doing. What Then?
Usually performance
problems fall into one of three categories: excessive compiler overhead, crummy
code, or slow algorithms.
Modern C compilers
generate high quality code. Most overhead problems come from weaknesses in
the processor's architecture, like poor
stack handling abilities. Recently I saw this graphically when comparing real
time trace data on a Z80 to the original C source: ++I (with
I a 16 bit signed integer) generated about 25 instructions. Changing I from
an automatic to a static variable cut out 75% of the overhead.
Crummy code problems
come mostly from programming without thinking. Don't force the compiler to
optimize your program for you. Fold
constants yourself, remove unneeded code from loops, and minimize conversions
between types. This is just good programming style. Keep
algorithms and routines to a single page, so you can understand what is really
going on.
Given reasonable
code, don't expect huge performance improvements from tuning a line here or
there. Consider replacing a slow algorithm
with one better suited to the application. For example, never use a Taylor
series to compute transcendental functions. Many reference
books (such as Computer Approximations by Hart) contain much faster converging
series expansions. Or, when precision is not a problem but
speed is, do a fast table lookup. Close the Loop
How do you know
if a particular processor will be adequate for a project? Many of us work
for companies that provide a standard product.
New versions often use the old code, or are based on algorithms and techniques
developed on a current product. Consider instrumenting the
current code to measure exactly how much free processor time exists. Use a
real performance analyzer or one of the methods I've described.
Closing the feedback
loop! When any project is complete, spent a day learning about what you did.
Measure the performance of the system to
see just how accurate your processor utilization estimates were. The results
are always interesting and sometimes terrifying. If, as is
often the case, the numbers bear little resemblance to the original goals,
then figure out what happened, and use this information to
improve your estimating ability. Without feedback, you work forever in the
dark. Strive to learn from your successes as well as your
failures.
Track your system's
performance all during the project's development, so you're not presented
with a disaster two weeks before the
scheduled delivery. It's not a bad idea to assign CPU utilization specifications
to major routines during overall design, and then track
these targets like you do the schedule. Avoid surprises with careful planning.