Built-in
Diagnostics
Copyright 1996,
Jack G. Ganssle
Abstract
No system is
useful unless it can be built in production. Add simple diagnostics.
Published in
Embedded Systems Programming, April, 1996
I love small
companies. Nimble operations running with a lean staff operate less by email
and memo than by yelling across the room: "Joe, what version of the compiler
are you using?" Workers in the smallest of outfits wear many hats, from
that of software developer to production supervisor to chief maintenance engineer.
Compare this
to the corporate behemoth. Employees packed like sardines into a sea of cubicles
barely know their neighbors. Each is focused on one part of the company's
activities - perhaps getting one particular subroutine to run. Few engineers
see their product over its entire development cycle. Fewer still have the
opportunity to see the widget in production, or to work with the technicians
and assemblers who sweat 8 hours a day building the engineers' wonderful creation.
All employees
are hopefully working towards common corporate goals, yet each has a different
vision of the company's needs and problems. Too many developers, never exposed
to the harsh realities of the production floor, fail to understand that small
changes in the system's design can greatly reduce the time required to build,
test, or repair a product. Engineering is only done once (well, OK, we do
tend to iterate, sometimes forever, getting bugs out, but you get the idea),
yet production goes on day after day, for the life of the product. Remove
a few minutes of hassle from manufacturing and you've leveraged that few minutes
by the number of units built.
Hardware designers
know that pushing timing margins or ignoring electrical specs may result in
a system that works on the bench, but that will be unreliable in production
- and thus expensive to manufacture. In high volume situations they know how
to bring test points out or otherwise expose the circuits so automated test
systems can exercise every node.
Firmware folks
largely haven't learned this lesson.
To a programmer
the word "testing" conjures images of correctness proofs, exhaustive
software trials, and code coverage analysis. A production person probably
has never heard of any of these concepts; he looks at testing as the daily
routine of ensuring each and every unit works correctly before being shipped.
Very complex products are tested and repaired by technicians with little formal
computer training. The best are usually culled and assigned to work in engineering
support, leaving production with workers who may be skilled but who are certainly
not rocket scientists. As software engineers it is our responsibility to the
company to give the techs the tools they need to ship product. As software
managers, it is our responsibility to convince management that this is an
important and desirable goal.
Peripherals
I wrote about embedded monitors recently (February, 1996). If you've got a
spare serial port available, by all means include a simple monitor in your
code. By its nature firmware is an invisible black hole, barely understandable
or even visible except to the developer. A monitor is one very cheap, very
simple way to leave a back door in the code, to give access to the code itself,
and to the CPU and it resources to future engineers and production people.
You can buy,
borrow, or steal a monitor from numerous sources - the costs approach zero.
Even a simple monitor lets you change and examine memory and I/O. Giving the
hardware troubleshooter access to I/O can save him hours of work - entering
an input command to see what a port does is much simpler than trying to capture
the event on a logic analyzer. If you feel really generous with your time,
display the status of all system I/O in a table, converting cryptic hex statuses
to meaningful keywords. "Data ready" is a lot easier to understand
than "02".
A monitor by
itself, however, it almost worthless without backup documentation. In these
days of high integration ASICS, PALs, FPGAs and the like, I/O is often buried
inside an impossibly complex circuit whose operation is far from obvious.
Don't toss in the monitor and tell the user "The I command reads a port."
You who have written this firmware must surely know, and must surely have
documented somewhere, what each port is and what each bit does. Pass this
information to the poor technician.
I have written
Visual Basic applications that drive a monitor, and that displays the values
of ports in plain text. For example, Intel's 188 processor has dozens of internal
I/O ports. A bit of Basic code lets the novice technician select "UART
Status Port", and then see the setting of each bit in English (i.e.,
"data ready set" instead of "01"). If the production people
have access to a PC a simple application like this can shield them from the
bits and bytes of the machine, yet tell them the status of every peripheral
in a system.
Bear in mind
that the technicians will use a scope to isolate most problems. Design your
monitor to ease problem diagnosis with this tool. You need to do two things
to make scoping easy: allow every monitor command to be run repeatedly (scopes
are particularly good at looking at repetitive signals), and generate a trigger
pulse that syncs the scope to the command. This is a bit you toggle a bit
simultaneously with, say, the Input and Output commands.
With these two
resources any competent engineer or technician can find most common board
problems by exercising I/O ports and tracking the signals throughout the board
with the scope.
Power On Self Tests
Some systems include Power On Self Tests (POSTs) as part of the product's
ROM to give a "go/no-go" indication without using other test equipment.
The unit's own display or status lamps show test results. On the PC we see
a RAM test at boot time, and a sequence of beeps that tell us nothing, as
every vendor uses their own non-standard codes which are documented in that
manual we lost exactly a year before the silly thing broke.
Internal diagnostics
are worthwhile, though, because they do give the test technician some ability
to track down problems. They're also an effective marketing tool, giving the
customer a (possibly false) feeling of confidence in the integrity of the
product each time he turns it on.
Though internal
diagnostics are often viewed as a universal solution to finding system problems,
their value lies more in giving a crude test of limited system functions.
Not that this isn't valuable. Internal diagnostics can test quite a bit of
the unit's I/O and some of the "kernel", or the CPU, RAM, and ROM
areas.
The computer's
kernel frequently defies standalone testing, since so much of it must be functional
for the diagnostics to run at all. Most systems couple at least the main ROM
and RAM closely to the processor. The result - a single address, data, or
control line short prevents the program from running at all.
It's easy to
waste a lot of time coding internal diagnostics that will never provide useful
information. They may satisfy vague marketing promises of self-testing capability,
but why write dishonest code? Realize that internal diagnostics have intrinsic
limitations, but if carefully designed can yield some valuable information.
Apply your engineering expertise to the diagnostic problem; carefully analyze
the tradeoffs and identify reasonable ways to find at least some of the common
hardware failures.
What portions
of the kernel should be tested? Some programmers have a tendency to test the
CPU chip itself, running through a sequence of instructions "guaranteed"
to prove that this essential chip is operating properly. Witness the ubiquitous
PC's BIOS CPU tests. I wonder just how often failures are detected. On the
PC an instruction test failure makes the code execute a HALT, causing the
CPU to look just as dead as if the it never started. More extensive error
reporting with a defective CPU is a fool's dream. Instruction tests stem from
minicomputer days, where hundreds of discrete ICs were needed to implement
the CPU; a single chip failure might only shut down a small portion of the
processor. Today's highly integrated parts tend to either work or not; partial
failures are pretty rare.
Similarly, memory
tests rely on operating memory to run - a high tech oxymoron that makes one
question their value. Obviously, if the ROM is not functioning, then the program
won't start and the diagnostic will not be invoked. Or, if the diagnostic
is coded with subroutine calls, a RAM (read "stack") failure will
prematurely crash the test, before providing any useful information.
The moral is
to design diagnostics so to ensure that each test uses as few unproven components
as possible.
Do test things
that may reasonably fail, yet that will not crash the CPU. Running an RTOS?
Some sort of timer will generate ticks for context switching. If the timer
is an external component write a bit of code that checks the interrupt it
generates. Does an external controller (like and 8259) sequence other interrupts?
Certainly test the device - few technicians have the knowledge needed to diagnose
a problem in such a complex part of the circuit.
Do check out
my August, 1995 column for ideas about testing ROM and RAM.
Seeding Memory
Few embedded systems use every last byte of ROM and RAM. Unused areas are,
well, unused - who cares what is burned into the last few bytes of a ROM?
You should. Unless
you're sure your code is perfect - absolutely, positively bug free - use these
empty areas wisely. Has your code ever wandered into an used section of memory?
Has a hardware fault (e.g., bad address line) caused the code to crash after
executing a handful of instructions? Seed unused ROM with instructions that
permit you to trap these faults. You'll save yourself some time by catching
the software bugs, and the production folks will love the robust, fault-tolerant
nature of the code.
On the Z180/Z80
processors the RST7 instruction (a one byte call to location 0038) is cleverly
encoded as 0xff. Unburned ROM defaults to just this value. Write an error
trap handler at 0038 that toggles an LED or otherwise signals the failure.
Wandering code, for whatever reason, will flash the LED and immediately indicate
something is wrong.
Most Z180's systems
with an electronics failure that disables ROM will go into an infinite loop
executing 0xff instructions, which is instantly visible on a scope. The characteristic
double writes from the RST7's pushes tell an experienced scoper in a second
what is going on. Teach your production folks this simple trick.
The x86 family
has a similar single byte call - INT3 - which vectors through an interrupt
vector at 000C. Again, use this instruction as a default fill value for ROM,
and write an error processor.
Every embedded
system starts with a loop to set up initialized values in RAM. I like to precede
this loop with one that sets all of RAM to the INT3, RST7, or whatever. Then,
your odds of having a wandering program encounter the instruction are much
higher.
Some of the hardest
problems to diagnose, either in development or in production test, are erratic
interrupts. If an interrupt controller fails or is misprogrammed it may assert
an incorrect vector on the bus. Again, preempt the problem by using a complete
interrupt table with entries for all possible interrupts, not just the ones
you are using. Vector unused, unexpected, interrupts to an error handler.
Obviously there's
no guarantee that wandering code will hit these unused locations. However,
seeding the code in this manner is a nice way to arm yourself to possibly
catch a latent problem. Teach your production people what to expect if such
an error takes place.
Watchdog timers
are another type of preventative medicine we often employ to detect failures.
Too many, though, are designed for the convenience of the programmer or customer,
with no thought to the poor electronics technician trying to repair things.
A watchdog timer that resets and restarts the code invisibly is a nightmare
- how can you tell if a system is operating marginally?
At the very least
be sure the watchdog toggles an external bit that a technician can scope.
Give the poor guy a chance to see that the system crashes occasionally!
Of course, a
better solution is to include the ROM monitor discussed above, with an error
word that logs watchdog timeouts and other problems. The repair folks can
then log onto the monitor and read a record of problems.
Modern processors
often have various sleep modes that power-down all or part of the CPU at various
intervals. Neat idea, but take the time to educate your people that the processor
may be idle at different intervals. Just last week I spent an entire day trying
to find a problem with a system that seemed to crash erratically. Totally
baffled, I started timing the interval between crashes and found it to be
exactly 127 seconds, every time. The chip was powering itself down, as the
code never disabled the default sleep mode.
Conclusion
As a company, we're all in this together - right? Use your expert knowledge,
and your knowledge of everyone else's job (after all, we all should strive
to be high tech Renaissance Persons), to make the job of the techs in production
test and repair simply possible.
Be creative.
I wonder if there's a way we can put a TCP/IP link to a modem in our embedded
systems, to allow technicians a thousand miles from the customers' sites to
diagnose problems via the Internet.