The
Zen of Diagnostics
Copyright 1990,
Jack G. Ganssle
Abstract
This is the first
of a two part series about adding diagnostics into your programs.
Published in
Embedded Systems Programming, June 1990
For some inexplicable
reason most of us embedded programmers rarely concern ourselves with making
a product manufacturable. Sure, we work
like slaves meeting performance goals, but the product must do more than function
correctly - it must be designed to be producable. While
our hardware brethren tune their designs to meet cost, manufacturing, and
performance goals, we work in relative isolation; a sort of
black hole where few dare to tread.
In high school
I worked for a while in a machine shop, serving as the lowest sort of helper
to highly skilled machinists making parts for
the space program. When they discovered I intended to go to college to become
an engineer, one grizzled veteran warned me to not
"design something we can't make". Now, twenty years later, I have
to admit that these words, so casually and freely given, have
been more important than most of the EE courses I struggled through later.
Designing the most fantastic widget ever conceived is suicidal
if it cannot be made and marketed at a profit.
In a typical
manufacturing operation, boards are stuffed, assembled into units, tested
and perhaps repaired. While the code doesn't impact
board stuffing and assembly operations, it can strongly influence product
test and repair. Smart designers will produce a product that
easily fits into the company's manufacturing operation; software engineers
can contribute by writing code that speeds the daily grind of
production test.
All employees
are hopefully working towards common corporate goals, yet each has a different
vision of the company's needs and problems.
To a programmer the word "testing" conjures images of correctness
proofs, exhaustive software trials, and code coverage
analysis. A production person probably has never heard of any of these concepts;
he looks at testing as the daily routine of ensuring each
and every unit works correctly before being shipped.
Conventional
software test is a one time event; once the product is complete it is over,
forever (well, would you believe...). Product
test goes on every day. Very complex products are tested and repaired by technicians
with little formal computer training. The best are
usually culled and assigned to work in engineering support, leaving production
with workers who may be skilled but who are certainly not
rocket scientists. As software engineers it is our responsibility to the company
to give the techs the tools they need to ship product. As
software managers, it is our responsibility to convince management that this
is an important and desirable goal. Internal Diagnostics
Quite a few embedded
systems include diagnostics as part of the product's ROM to give a sort of
"go/no-go" indication without
using other test equipment. The unit's own display or status lamps show test
results.
Internal diagnostics
are worthwhile because they do give the test technician some ability to track
down problems. They're also an
effective marketing tool, giving the customer a (possibly false) feeling of
confidence in the integrity of the product each time he turns
it on.
Though internal
diagnostics are often viewed as a universal solution to finding system problems,
their value lies more in giving a crude
test of limited system functions. Not that this isn't valuable. Internal diagnostics
can test quite a bit of the unit's I/O and some of
the "kernal", or the CPU, RAM, and ROM areas.
The computer's
kernal frequently defies standalone testing, since so much of it must be functional
for the diagnostics to run at all. Most
systems couple at least the main ROM and RAM closely to the processor. The
result - a single address, data, or control line short prevents
the program from running at all.
It's easy to
waste a lot of time coding internal diagnostics that will never provide useful
information. They may satisfy vague marketing
promises of self-testing capability, but why write dishonest code? Realize
that internal diagnostics have intrinsic limitations, but if
carefully designed can yield some valuable information. Apply your engineering
expertise to the diagnostic problem; carefully analyze the
tradeoffs and identify reasonable ways to find at least some of the common
hardware failures. The first step is to separate the tests into
kernal (CPU, RAM, ROM, and decoders) and beyond-kernal (I/O) tests. Then consider
the most likely failure modes; try to design tests that
will first survive the failure, and second will identify and report it. Yes,
the kernal tests will never be very robust since lots of
hardware glitches will prevent the program from running at all. But, if carefully
designed, they'll really help your buddies in
production.
I/O tests can
run the gamut of a simple LED blinking routine to A/D and D/A loopbacks that
check converter linearity. I/O is just too big
of a subject to address in a short article. Today's range of peripherals is
so mind boggling that several large books might not adequately
cover the subject of testing even the most common devices. I won't attempt
to delve into a discussion of I/O tests here.
What portions
of the kernal should be tested? Some programmers have a tendency to test the
CPU chip itself, running through a sequence of
instructions "guaranteed" to prove that this essential chip is operating
properly. Witness the ubiquitous PC's BIOS CPU tests. I
wonder just how often failures are detected. On the PC an instruction test
failure makes the code execute a HALT, causing the CPU to look
just as dead as if the it never started. More extensive error reporting with
a defective CPU is a fool's dream. Instruction tests stem
from minicomputer days, where hundreds of discrete ICs were needed to implement
the CPU; a single chip failure might only shut down a
small portion of the processor. Today's highly integrated parts tend to either
work or not; partial failures are pretty rare.
Similarly, memory
tests rely on operating memory to run - a high tech oxymoron that makes one
question their value. Obviously, if the ROM
is not functioning, then the program won't start and the diagnostic will not
be invoked. Or, if the diagnostic is coded with subroutine
calls, a RAM (read "stack") failure will prematurely crash the test,
before providing any useful information.
The moral is
to design diagnostics so to ensure that each test uses as few unproven components
as possible.
A carefully engineered
RAM test can be quite valuable. In a multi-ROM system, it (and all the diagnostics)
should be stored in the boot
ROM, preferably near the reset address. The low order address lines must operate
to run even a trivial program, and enough of the upper
ones must work to select the boot ROM; particularly in a small system with
undecoded address lines, try to write test routines that rely
on a minimal number of working address wires.
RAM tests commonly
write out a pattern of alternating ones and zeroes, read the data back, and
repeat the test using data that is the
complement of the first set. This amateurish approach reflects poor analysis
of the problem. Before writing code, put on your hardware hat
(or get some help from another member of the design team) and consider the
most likely failure modes. Tailor the test to identify all or
most of these problems. A typical list includes:
Address/data
line shorts - In nine times out of ten these
problems will crash the diagnostic. Sometimes RAM is isolated
from the kernal by buffers; in this case
post-buffer problems can be found.
No chip select
- One or more signals may be needed to turn
each device on. The chip select complexity can range from a simple
undecoded address line to a nightmarish spaghetti of PALs and
logic. Regardless, every RAM must receive a proper chip select
to function.
Pin not inserted
- Socketed RAM devices may not be properly
seated. Sometimes a pin bends under the device.
Bad device -
The semiconductor vendors do a wonderful job
of delivering functioning chips. Rarely, though, a bad one may
slip through (or, through mishandling, one may "toggle to
the bad state"). Device geometry is now so small that it
is unusual to see the pattern sensitivity that once plagued DRAMs.
Usually the entire chip is just plain bad, making it a lot easier
to identify problems.
Multiple addressing
- This is a variant of the chip select
problem. If more than one memory device is used, several can sometimes
be turned on at once.
Refresh - Dynamic
RAMs require a periodic accesses to keep
the memory alive. This Refresh signal is generated by external
logic, or by the CPU itself. Sometimes the refresh circuitry fails.
The entire memory array must be refreshed every few milliseconds
to stay within the chips' specifications, but it's surprising
just how long DRAMs can remember their data after loosing refresh.
One to two seconds seems to be the extreme limit.
With the above
in mind, we can design a routine to test RAMs. The first criteria is that
the test itself certainly cannot make use of RAM!
Several of the
failure modes manifest themselves by the inability to store data. For example,
data bus problems, a bad device, chip select
failures, or an incorrectly inserted pin will usually exhibit a simple read/write
error. The traditional write and read of a 55, followed
by an AA, will find these problems quickly.
Writing a pattern
of 55s and AAs tests the ability of the devices to hold data, but it doesn't
insure that the RAMs are being addressed
correctly. Examples of failures that could pass this simple test are: a post-buffer
address short, an open address line (say, from a pin
not being inserted properly), or chip select failures causing multiple addressing.
It's important to run a second routine that isolates
these not-uncommon problems.
An addressing
test works by writing a unique data value to each location, and then reading
the memory to see that that value is still
stored. An easy-to-compute pattern is the low order address of the location;
at 100 store 00, at 101 store a 01, etc. This isn't really
unique, since an 8 bit location can only store 256 different values. If we
repeat the test, using the locations' high order address bytes
as the data pattern, then (for memory sizes to 64k) after two passes the entire
array will be tested uniquely. The first test insures that
address lines A0 to A7 function correctly; the second checks lines A8 to A15
and also the chip select logic.
Since this test
also insures that the RAM can store data; why do the 55, AA check? Consider
any individual address. At location 0, the
addressing diagnostic will write 0 (the low address) and later another 0 (the
high address). While addressability will be confirmed, some
doubt remains about its ability to store data. The 55, AA check tests every
bit.
The peril of
these diagnostics is that the address lines cycle throughout all of memory
as the test proceeds. If the refresh circuit has
failed, most likely the test itself will keep DRAMs alive. This is the worst
possible situation; the process of testing camouflages
failures. A simple solution is to add a long delay after writing a pattern
and before doing the read-back. This delay should be on the
order of several seconds. It is also important to constrain the test code
to a small area, so the CPU's instruction fetches don't create
artificial refreshes.
In the bad old
days small DRAMs manufacturing defects and alpha particles caused some memories
to exhibit pattern sensitivity problems;
selected cells would hold not hold a particular byte if a nearby cell held
another specific byte. Elaborate tests were devised to isolate
these problems. The "Walking Ones" test, in particular, burned an
enormous amount of computer time and could find really complex
pattern failures. Fortunately these sorts of problems just don't show up anymore.
Figure 1 is a
routine that performs all of these tests. It is cumbersomely coded in 8088
assembly language, using no RAM at all. Simpler,
prettier code using CALLs and RETs just will not be dependable, since it would
rely on the very RAM we're testing.
While it is easy
to see some justification for testing the product's RAM, ROM tests are perhaps
not as obviously valuable. If the ROM is
not working, how can it test itself? Physician - heal thyself (but not if
you're in a coma). As always, a completely dead kernal, one that
just doesn't even boot, cannot run diagnostics. If the boot ROM does at least
partially work, then some testing is valuable.
In the boot ROM
itself we can realistically expect to detect only a simple failure, like a
partially programmed device, although with some
luck it might be possible to find a shorted or open high order address line.
Luck, because if the line floats or is tied to the wrong
level, then the diagnostic code will not start.
ROMs are tedious
to program - sometimes technicians will unknowingly, in their impatience,
remove the chip before the programmer is
completely done. If you do elect to include a ROM test, be sure to locate
it early in the code so it stands a chance of executing even if
the ROM is not entirely programmed. It's easier to make an argument for ROM
testing in multiple ROM systems. If the boot ROM starts, then
diagnostics located in it can test all of the others.
While the memories
can fail in a number of ways, probably the most common is a mis-inserted pin.
If you've spent time troubleshooting
electronics, you'll know that it can be awfully hard to tell if all pins are
in the sockets. Other problems cover the usual range of
broken circuit board tracks (i.e., address, data, control lines), misprogrammed
devices, and non-functioning chip select lines.
One way to test
ROMs is to read the devices and compare each byte to a known value. Since
such redundancy is impractical, most programmers
simply compute an 8 or 16 bit sum of the data in the ROMs and compare it to
a known value. Usually this is adequate, but a number of
pathological cases will report incorrect results. For example, a long string
of zeroes will always checksum to zero, regardless of the
number of items summed.
A much better
approach than a simple checksum is the Cyclic Redundancy Check (CRC). The
CRC is a polynomial that is seeded, typically with
FFFF, and then divided into the input data (in this case the ROM data) a byte
at a time, using the dividend at each step as the new seed.
While mathematically complicated, the CRC is pretty easy to implement. Its
great virtue is that each byte of repeated strings (say,
zeroes) will yield a different CRC. The CRC is a bit harder to code than a
simple checksum, but the code listed in figure 2 is a cookbook
solution. Once again, it is written in 8088 assembly language to insure it
uses the minimal number of as-yet-untested CPU resources.
A CRC or checksum
test is easy to code and yields useful information, but is sometimes a nuisance
to implement because the correct value
must be computed and stored in ROM. No assembler or compiler has a built-in
function to build this automatically, so generally you must
run the program under an emulator, record the CRC or checksum the routine
computes, and then manually patch the resulting value into an
absolute location in ROM. The only way to automate this is to write a short
program that CRCs the linker output file and patches the
result into the ROM file. In Conclusion
It is best to
address the diagnostics problem using the engineering thought processes our
"significant others" have learned to
hate. Examine the system dispassionately and analytically, looking for all
possible failure modes. Study the tradeoffs. Look for alternate
approaches. Then, implement the best possible solution that does a reasonable
job of solving the problem yet doesn't take too much
programming time. No one said it would be easy.
Next month I'll
look at another aspect of diagnostics - how do you report an error? We'll
also look at external diagnostics that help out
the technician when the system won't even boot.



