Troubleshooting
101
Copyright 1996,
Jack G. Ganssle
Abstract
Troubleshooting
is more art than science. Here's a few ideas.
Published in
EDN, November 1995
I've worked with
a lot of engineers over the years. Most have a single area of expertise: design
of complex high speed systems, firmware wizards, or even troubleshooting geniuses.
A few, the very best, are adept at every area of embedded design. Surely you've
met that solitary genius who quietly and competently creates a paper design,
guides it through prototyping, develops a bit of test or application code,
and somehow, without fuss, just makes it work.
This industry
continues to evolve in fascinating ways. Only a generation ago computer design
was about as complex an art as existed. With the invention of the microprocessor
this changed. Slap a micro, a couple of memory chips, and some standard peripherals
together and voila! You've got an embedded system. Most of the complexity
lived in the software.
Now we're oozing
back to complexity in hardware design, spurred on by new developments - like
FPGAs and complex PLDs - that resemble software in their use.
A wag commented
that being a programmer means never having to say you are done. Modern technology
has put us hardware folks in the same unhappy situation. "Oh, don't worry
- it's just a simple FPGA change" is the new mantra. Worse: "well,
just ship it and we'll get a better set of equations out later".
Complex hardware
design implies tough troubleshooting problems in bringing up prototype units.
High density programmable chips make life much harder as pin limitations yield
little insight into the internals of a 10,000 gate part.
We need to elevate
efficient troubleshooting from its current status as an art to that of a science.
Too many engineers, particularly young ones just out of school, are left adrift
with no idea where to turn when the damn thing doesn't work.
Speed Up by Slowing
Down
The Jedi master
engages an opponent by clearing his mind and calling on the Force. Hey, troubleshooting
is hard, so call on anything you can! At the very least follow the Jedi's
example by starting with a clear mind, a clean bench, and an organized set
of tools.
Too many designers
jump into a problem without getting ready to do battle. You see them, empty
junk food bags piled atop the poorly maintained test equipment, scattered
debris from a dozen other troubleshooting contests buried under the latest
set of schematics.
Clean up. Get
rid of all of those short-producing solder splashes and old resistor leads.
Consider mounting stand-offs on PCBs so they don't lie in the bench debris.
Sort out your tools. Make sure you have enough outlets at hand to avoid power
plug mania. Get a pile of clip leads.
Is your lab notebook
open and ready for action. What! You don't use one? Where do you record the
things you learn (like mods needed to the board) - on an easily-lost scrap
of paper? Get a bound notebook that is always at hand for your lab work. Use
it daily. Record poetry, love notes, or ideas for science fiction stories...
but log engineering details, experimental setups, and the latest neat idea
you'll try first thing in the morning.
Never just do
something - automate it. Build batch files to download your code and initialize
the tools. Program the logic analyzer setup and save it to disk. Your employer
is paying you to think; repetitive tasks that you could have automated could
be done by a monkey.
I have a love-hate
relationship with the logic analyzer. It's a fantastic tool that yields information
obtainable no other way with wonderfully precise timing resolution. It's just
such a pain to connect 50 or 100 leads to run an experiment. In digital systems
most of the analyzer's leads will go to the address and data buses. Build
a standard connector you attach these to. We buy extra analyzer pod-ends we
can permanently connect to a "standard" internal connector, greatly
speeding the process of connecting the instrument.
Avoid wire-wrapped
prototypes. Digital designs are simply too fast now. Rapid turn PCB vendors
(look at the ads in this magazine) will produce a 10 layer board in a week
for a quite reasonable fee. The PCB will eliminate all of the noise uncertainly
inherent in a wire wrapped design. As an engineering manager, I'm always terrified
by that oh-so-common statement "well, this doesn't really work, but the
PCB layout probably will." Prove it. Go with PCBs from the outset.
Assumptions A
misspent youth of blaring rock 'n roll left my hearing somewhat impaired,
but helped formulate, of all things, my philosophy of troubleshooting digital
systems. The title of the Firesign Theatre's "Everything You Know is
Wrong" album should be our modern anthem for making progress in the lab.
I hate getting
called into a troubleshooting session and finding that the engineer "knows"
that x, y, and z are not part of the problem at hand. Everything you know
is wrong! Is that 5 V supply really 5 V at the PCB? What makes you think ground
goes to the chips - when a single part has 5 or 10 ground connections, make
sure all of them are connected. Could the system be dead because there's no
clock signal? Are you sure the design isn't really working - could your experiment
be flawed?
Assume nothing.
Test everything. The PCB may have manufacturing errors on internal layers.
Power and ground may not be on the pins you expect - particularly on newer
high density SMT parts. Signals labeled without an inversion bar may actually
be active low. You might have ROMs mixed up. Perhaps someone loaded the wrong
parts on the board.
Never blindly
trust your test equipment - know how each instrument works and what its limitations
are. If two signals seem impossibly skewed by 15 nsec on the logic analyzer,
make sure this is not an artifact of setting it to sample too slowly. When
your 100 MHz scope shows a perfectly clean logic level, remember that undetected
but virulent strains of 1 nsec glitches can still be running merrily around
your circuit.
When you do see
a glitch, one that seems impossible given the circuit design, remember that
manufacturing shorts can do strange things to signals. Is the part hot? A
simple finger test may be a good short indicator.
Learn to Estimate
At the peril of sounding like one of the ancients, I do miss the culture of
the slide rule. Though accurate answers might have been elusive, we did learn
to estimate the answer for every problem before attempting a solution. Alas,
it's a skill that is fading away.
Calculator abuse
- computing without thinking - is now too ingrained in our society to waste
effort fighting. Bummer. Other instruments, though, also tempt us to mentally
coast, to do things without thinking. Take the scope: I can't count the times
an engineer mentioned that he sees the signal... but has no idea, when I ask,
about the width of the pulse. Is it 1 nsec? 1 usec? Perhaps a second wide?
Timing is critical
in computers, yet too many of us use the scope as a sort of logic probe. "Hey,
the signal is there!" Which signal? If you expect a 10 usec pulse every
msec, then any deviation from that norm is simply wrong. Know what to expect,
and then insure the waveforms are approximately correct. A misused scope will
generate a morass of misinformation.
Estimate the
performance of firmware before writing it. Sure, it's tough to know how many
microseconds an as-yet-unwritten function will chew up, but you can use your
general knowledge of systems to make some ballpark estimates about where problems
will occur.
For example,
a fast serial link might overrun a busy CPU. Estimate! 38,400 baud is about
4000 characters/second, or one character per 250 usec. 250 usec is not a lot
of time for any CPU, particularly the typical embedded 8 bitter. Your processor
will be pretty busy servicing the data. If polled, then only heroic efforts
will keep you within the 250 usec timing margin.
Suppose you chose
to implement the serial receive routine as an ISR - what is the overhead?
An assembly routine to queue incoming data will need a dozen or two instructions,
each of which will no doubt burn up two or three machine cycles. Surely you
know roughly how long a machine cycle takes (including wait states) for your
system... don't you? Given this information you can get a reasonable timing
estimate before writing a line of code.
Recently an engineer
told me that "that initialization loop is clearly the problem."
Oh yeah? He was looking for something burning up almost a second of time,
when clearly, regardless of processor, 1000h memory zeroing iterations will
run in a few milliseconds. Use your tools, one of which is your brain, to
make sure you are addressing the real problems.
Common Sense
Think, don't
do. Recently I saw a technician troubleshooting a board that exhibited multiple
problems. One chip was hot enough to fry eggs, yet he chose to work on another,
"unrelated" symptom. Dumb move - surely the part was ready to self
destruct, which surely would create yet more grief for the poor tech.
When starting
out debugging a very fast system, crank the clock rate down to absurdly low
levels. Fix the easy stuff - logic errors and the like - before tackling high
speed timing. Why deal with a vast ocean of troubles simultaneously?
When you do find
the problem, and then make a change, sometimes the modification won't help.
Before doing anything double check the change. Did you solder the wire to
the right pin? The right IC? We tend to program ourselves to look for hard
problems instead of the all- to-common simple mistakes.
Plan ahead. Think
before doing. Don't try something without knowing what the possible outcomes
are... and without having some idea what you'll do for any of those outcomes.
You may find that the next step will be the same regardless of the results
of the experiment. In this case, save time and do something else.
The best troubleshooters
are closet chess grand masters. They think many steps ahead.