The
Tao of Diagnostics
Copyright 1990,
Jack G. Ganssle
Abstract
Part 2 of a series
about embedded diagnostics.
Published in
Embedded Systems Programming, July 1990
A few weeks ago
I had our broken microwave apart in my workshop. It says something about our
business that one of my greatest fears is repairing a microprocessor-based
product; a simple chip failure often consigns an appliance to the landfill.
Fortunately, this was a simple case of the door microswitches not engaging
properly. After a bit of study, I discovered that they had to close in one
particular processor-monitored sequence, no doubt to prevent backyard mechanics
from bypassing them and getting fried. The correct sequence was (of course)
undocumented and difficult to adjust.
This is an all
too common example of poor embedded design. Every system should have some
provision for in-the-field repair. Softwareengineers have a responsibility
to make these adjustments easier. Why didn't the designers include a little
code to show what sequence the switches engaged in?
As I mentioned
last month, it's really impossible to say intelligent things about the huge
range of I/O used in embedded systems. But, for God's sake, let's use our
brains! Be sympathetic to the user's needs. Remember that the system will
fail, either in the field or in production test. Make it easy to isolate the
problem.
Like the microwave
oven example, most embedded systems interface to mechanical and electronic
sensors and actuators. Certainly the mechanical portions are prone to failure;
just as certainly analog I/O is subject to drift, noise, and other effects
that we digital people hate to acknowledge. The tests described last month
(and those you've invented in your never-ending quest to build a reliable
product) will help get the system to boot. The next step is to give the test
technician and end user a "back door" into a diagnostics suite.
Consider the
system's analog circuits. These components all exhibit slightly different
characteristics, so potentiometers are used to tune offsets and gains. Sometimes,
lots of pots are used. It's interesting to watch a test group calibrate these
sorts of instruments; frequently special test equipment is needed to monitor
the voltages during pot twiddling. Without this equipment these adjustments
simply cannot be made in the field. In most cases a bit of clever software
can take advantage of a panel display to replace the test gear. Write a bit
of code to show raw voltage, or whatever is being monitored, on the system's
own output device. Certainly you've already written low level routines to
get the data (for use in the main program); spend an afternoon writing a simple
diagnostic that calls this subroutine and formats the output.
Pots are a continuous
source of frustration to users. Think - can you come up with a better, self-calibrating
design? Try writing code that removes offset, gain, and other errors mathematically.
The Scope
When you design
diagnostics for field or in-house use, be sure to bear in mind the sorts of
tools users will have available. One of the most useful troubleshooting tools
is the venerable oscilloscope. Most test and repair technicians rely on the
scope almost exclusively. Logic analyzers, emulators, and the other tools
used in engineering are not nearly so ubiquitous in the test environment.
Remember this when writing diagnostic code.
Yin and yang.
While the scope is the universally accepted troubleshooting tool, computer-based
systems are not really well suited to scope diagnosis. Digital events tend
to be wide (requiring many channels - like for addresses and data), or very
intermittent (a 1 microsecond event once per second). Even the most sophisticated
scope can't capture these signals without some help from the code. The solution?
Write diagnostics that run in repetitive loops, and be sure to toggle a bit
(say, an I/O port) at the start of each loop. The technician can trigger the
scope's sweep (i.e., start the trace at the left side of the screen) each
time the bit is asserted. This "scope trigger point" gives an essential
reference to the sequencing of events, in many cases making the scope as useful
as logic analyzer.
The best software
engineers regularly make use of scopes during initial code debugging. It's
amazing just how much information you can extract from the code by watching
event synchronization, port assertions, or even chip selects on a scope's
display. If you are not familiar with this valuable tool, have a hardware
guru give you a lesson in its use. Debugging embedded code is hard - take
advantage of every tool you can find.
Reporting Failures
I've always hated
the annoying beep my Macintosh makes on reset. Until this year, that is, when
the computer died with a dramatic belch of smoke. Where, exactly, was the
failure? The screen went blank - was the CPU dead? Could the power supply
have failed? But wait - on reset the computer still beeps! Power must be OK,
and the CPU is probably working. Indeed, it turned out the problem was localized
to the video circuits, and a $100 mail order board brought the computer back
to life. The once annoying beep saved an expensive trip to the Mac man.
Years ago Computer
Automation installed Go/Nogo LEDs on every board in their "Naked Mini"
(their name, not mine) computers. Like the Mac's
beep, these simple indicators save users a lot of grief. Nothing this simple
is foolproof, but even an 80% success rate is worthwhile.
Certainly systems
with CRTs or other alphanumeric displays can easily show lots of useful error
information. Working in C makes formatting output especially easy. Use these
resources, but don't depend on them. An awful lot of hardware and software
must work before even a single character can be displayed on a CRT; self test
routines should depend on the absolute minimum of functioning hardware.
Learn from the
automotive companies. Cars have a lot of sensors, all wired to an under-hood
computer. Dozens (at least) of potential failure nodes exist. Ford, GM, and
others let the mechanic put the computer into a self-test mode, and flag errors
by toggling one bit very slowly. The engineers cleverly realized that a voltmeter
is about all you can count on a mechanic having and understanding, so their
software drives the bit up and down so slowly that even a meter needle can
show the transitions. Error 51 might mean "failed PCV valve", and
is indicated by 5 needle deflections, a pause, followed by one more. What
could be simpler?
A LED is just
as effective and even easier to use. If the product is too cost sensitive
to include even a 50 cent LED, provide a place to
clip one on.
If you use a
LED rather than a voltmeter, than the flashes can be quite a bit faster. A
subroutine to show one digit of a code is simple, and typically takes the
following form:
Pseudocode:
Set COUNT=# flashes
wanted LOOP: turn LED ON delay for 1/4 second turn LED off delay for 1/2 second
COUNT=COUNT-1 Go to LOOP as
long as COUNT is non-zero
Avoid using zeroes
as part of an error code. While zero might correspond to "no flash",
it is visually very confusing.
Showing error
codes to a single LED is arguably better than showing the complete code in
a conventional 7 segment or ascii display. The single bit approach is more
robust; not much hardware support is needed. If the system has a number of
LEDs, consider sending the same pattern to all of them. A single LED (or port)
failure will then be obvious, and the remaining LEDs will still show the error
code.
ROM Monitors
Let's not forget
the sophisticated troubleshooter. We've all had the unpleasant experience
of being called in to find and fix design flaws. Build in tools to make this
sort of work easier for you and your associates.
If the embedded
system includes some sort of terminal interface, then including a monitor
(or "remote debugger") is a nice way to give the high-end user access
to the system's internals. A ROM monitor may not be as powerful as an emulator
or logic analyzer, but it is easy to invoke. A built-in monitor is like a
sleeping giant, dormant, waiting to be called into action by entering a secret
command. But be careful - I once failed to check for keyboard overflow in
a product, and a user called to complain about the wierd mode (the monitor)
that the product entered when his cat sat on the keyboard.
Even a simple
monitor lets you change and examine memory and I/O. Giving the hardware troubleshooter
access to I/O can save him hours of work - entering an input command to see
what a port does is much simpler than trying to capture the event on a logic
analyzer. If you feel really generous with your time, display the status of
all system I/O in a table, converting cryptic hex statuses to meaningful keywords.
"Data ready" is a lot easier to understand than "02".
A disassembler,
assembler, and simple breakpoints is a lot more work to add, but if you go
through the trouble you can then patch small test routines into the product's
RAM. At the very least have a GO command that starts a program at any address.
Then, you can patch in instruction hex codes and start simple test loops that
perhaps cycle a particular port. The scope-happy technicians will love you
for it. Is a port very occasionally intermittent? A few bytes of code can
monitor this much more effectively than any other means.
A monitor can
serve as a diagnostics platform. It is any easy way to invoke complex test
routines, and gives the basis of a nice interface for communicating test results.
Like Microsoft's new Programmer's Work Bench, it is a sort of software bus
to hang diagnostics and other utilities from.
All of my company's
products include such a monitor. Our customers are not aware of it, but in
our lab we regularly invoke it to diagnose all sorts of problems.
A number of companies
sell commercial ROM monitors. First Systems, Microtec, and Intermetrics all
provide quite sophisticated products that can be included in a design.
Diagnostics Tricks
I could go on
at great length about using powerful troubleshooting aids like emulators and
Fluke's Microsystem Troubleshooter. These sorts of tools quickly find bus
shorts and other problems that prevent the computer from coming up at all.
If it doesn't boot, then all the internal diagnostics in the world are useless.
If the techs don't have decent tools, then they will be reduced to "shotgunning"
- replacing components at random and hoping for success.
You can make
their job a bit easier during the product's hardware design. (Yes, programmers
should be involved in hardware design, at least to the extent of contributing
their expert knowledge to make the system as close to perfect as possible).
A nice way of finding bus shorts, memory failures, and the like is to execute
a looping program, letting the technician examine each address and data line
with a scope to find the source of the trouble. Of course, if the memories
don't work, or if the address bus is shorted, how can we run a program?
On the Z80 and
8085 family the RST 7 instruction is a one byte CALL to location 38. Was Intel
clairvoyant, or was it just luck that caused them to use opcode FF for this
instruction? As a result, if you add pullup resistors to the bus, then simply
removing all memory chips will make the processor execute CALLs to 38 all
day long. The stack pointer will decrement through the processor's entire
address space, so the technician can look at address lines and check that
they cycle properly. The data bus will show return addresses after each RST
7 executes; since the stack pointer decrements, these addresses will change
as well. This trivial test gives the repetitive signal needed to effectively
use a scope to check out the hardest parts of the system.
Other CPUs usually
have a similar instruction. On the 8088 family the INT 3 instruction is a
similar one byte opcode. A one byte PUSH might even be better. Since these
instructions are not FF opcodes, pull up the bus and add a jumper field so
the technician can set the proper opcode.
Be sure that
at least some of the diagnostics can run with the absolute minimum amount
of the system working, and minimal number of boards
plugged in. Think about the example set by the Naked Mini - diagnostics were
limited to each card, reducing potential for interaction between system components.
OK, so you say
this is a one-off unit that will never be reproduced, and that has a design
life only a few weeks. Why spend time writing diagnostics? This is a valid
point, but even in these extreme cases be sure the system has at least an
"easy mode". That is, be sure that on power up (or by installing
a jumper or setting a switch) a dramatic event occurs - say, a lamp lights.
This way you can tell in a second if the computer is running and power is
applied. You don't want to spend time chasing timing problems in a complex
system when the computer hasn't even started.
As a company,
we're all in this together - right? Use your expert knowledge, and your knowledge
of everyone else's job (after all, we all should strive to be high tech Renaissance
Persons), to make the job of the techs in production test and repair easier
(or even possible).
The microwave
oven is fixed, but my workbench is still littered with broken electronics.
Now, if I could only get my car radio's FM section to work...