Debugging
ISRs - Part 2
Copyright 1996,
Jack G. Ganssle
Abstract
This is part
2 of a two part series on debugging interrupt service routines.
Published in
Embedded Systems Programming, June, 1996
Last month I
ambitiously attacked the subject of debugging Interrupt Service Routines (ISRs).
This infinitely deep subject is worthy of a book! maybe a Britannica-sized
20 volume set would be appropriate. Still, here's another stab at covering
some of the more common problems.
Debugging INT/INTA
Cycles Before you can debug your ISR, the processor must accept the interrupt
and properly vector to the handler. Most processors service an interrupt with
the following steps: Your hardware generates the interrupt pulse The interrupt
controller (if any) prioritizes multiple simultaneous requests, and issues
a single interrupt to the processor The CPU responds with an interrupt acknowledge
cycle The controller drops an interrupt vector on the databus The CPU reads
the vector, and computes the address of the user-stored vector in memory.
It then fetches this value. The CPU pushes the current context, disables interrupts,
and jumps to the ISR
Interrupts from
internal peripherals (those on the CPU itself) will generally not generate
an external interrupt acknowledge cycle. The vectoring is handled internally
and invisibly to the wary programmer, tools in hand, trying to discover his
system's faults.
A generation
of structured programming advocates has caused many of us to completely design
the system and write all of the code before debugging. Though this is certainly
a nice goal, it's a mistake for the low level drivers in embedded systems.
I believe in an early wrestling match with the system's hardware. Connect
an emulator, and exercise the I/O ports. They never behave quite how you expected.
Bits might be inverted or transposed, or maybe there's a dozen complex configuration
registers needing setup. Work with your system, understand its quirks, and
develop notes about how to drive each I/O device. Use these notes to write
your code.
Similarly, start
prototyping your interrupt handlers with a hollow shell of an ISR. You've
got to get a lot of things right just to get the ISR to start. Don't worry
about what the handler should do until you have it at least being called properly.
Set a breakpoint
on the ISR. If your shell ISR never gets called, and the system doesn't crash
and burn, most likely the interrupt never makes it to the CPU. If you were
clever enough to fill the vector table's unused entries with pointers to a
null routine, watch for a breakpoint on that function. You may have misprogrammed
the table entry or the interrupt controller, which would then supply a wrong
vector to the CPU.
If the program
vectors to the wrong address, then use a logic analyzer or emulator's trace
to watch how the CPU services the interrupt. Trigger collection on the interrupt
itself, or on any read from the vector table in RAM. You should see the interrupt
controller drop a vector on the bus. Is it the right one? Maybe the interrupt
controller is misprogrammed.
Within a few
instructions (if interrupts are on) look for the read from the vector table.
Does it read from the right table address? If not, and if the vector was correct,
then you are either looking at the wrong system interrupt, or there's a timing
problem in the interrupt acknowledge cycle. Break out the logic analyzer and
check this carefully.
Hit the databooks
and check the format of the table's entries. On an x86-style processor, four
bytes represent the ISR's offset and segment address. If these are in the
wrong order -- and they often are -- there's no chance your ISR will execute.
Frustratingly
often the vector is fine; the interrupt just does not occur. Depending on
the processor and peripheral mix, only a handful of things could be wrong:
Did you enable
interrupts in the main routine? Without an EI instruction, no interrupt will
ever occur. One way of detecting this is to sense the CPU's INTR input pin.
If it's asserted all of the time, then generally the chip has all interrupts
disabled. Does your I/O device generate an interrupt? It's easy to check this
with external peripherals. Have you programmed the device to allow interrupt
generation? Most CPUs with internal peripherals allow you to selectively disable
each device's interrupt generation; quite often you can even disable parts
of this (like, allow interrupts on "received data" but not on "data
transmitted").
Modern peripherals
are often incredibly complex. Motorola's TPU, for example, has an entire book
dedicated to its use. You could teach an entire one semester college course
about this part! Set one bit in one register to the wrong value, and it won't
generate the interrupt you are looking for.
It's not uncommon
to see an interrupt work perfectly once, and then never work again. The only
general advice is to be sure your ISR re- enables interrupts before returning.
Then look into the details of your processor and peripherals.
Some, like the
Z80, have an external interrupt daisy chain that serves as a priority encoder.
Look at these lines with a scope. If you see the daisy chain set to a zero,
it's a sure indication that one device did not see the end-of-interrupt sequence.
On the Z80 and Z180 processors this is provided by executing the RETI instruction.
A simple RET, mixed with use of the daisy chain, will block of an interrupt
after it happens once.
Intel's x86 family
is often used with an 8259 interrupt controller. Some of the embedded CPUs
in this family have 8259-like controllers built into the processor. If you
forget to issue an EOI (end of interrupt) command to the 8259 when the ISR
is complete, you'll get that one interrupt only.
You may need
to service the peripherals as well before another interrupt comes along. Depending
on the part, you may have to read registers in the peripheral to clear the
interrupt condition. UARTs and Timers usually require this. Some have peculiar
requirements for clearing the interrupt condition, so be sure to dig deeply
into the databook.
Debugging Speed
Problems If the ISR is not fast enough your system will fail. Unfortunately,
few of the developers I talk to have any idea what "fast enough"
means. Unless you generate the interrupt map I've discussed, only random luck
will save you from speed problems.
When designing
the system answer two questions: how fast is fast enough? How will you know
if you've reached this goal?
Some people are
born lucky. Not me. I've learned that nature is perverse, and will get me
if it can. Call it high tech paranoia. Plan for problems, and develop solutions
for those problems before they occur. Assume each ISR will be too slow, and
plan accordingly.
A performance
analyzer will instantly show the minimum, maximum, and average execution time
required by your code, including your ISRs. There's no better tool for finding
real time speed issues.
Not everyone
has an analyzer. You can instrument your code to make it "scopeable".
Set a bit to a one when the ISR starts, and set it to zero when it completes.
Connect a scope and measure how long the bit says up. If the routine can run
for varying lengths of time, use a digital scope set to accumulate sweeps,
and watch for the longest iteration.
It's important
to look at total interrupt overhead in a system as well. If your ISR runs
in 100 microseconds, but gets invoked 10,000 times/second, there's serious
trouble brewing. Watch how long the bit stays asserted over long periods of
time - a second or more - and make sure it's not eating most of the CPU resources.
Set and reset
this bit in all of the ISRs to see total interrupt overhead. It's sometimes
frightening to see just how close to the wire some systems run!
Too many developers
fall into the serendipity school of debugging. They feel that if the system
works and meets external specifications, it's ready to ship. Wrong. Hardware
engineers stress their creations by running them over a temperature range.
We should do the same, instrumenting our code or otherwise using performance-measuring
tools, to be quite sure the system has sufficient margins designed in.
Debugging Missing
Interrupts A device that parses a stream of incoming characters will probably
crash very apparently if the code misses an interrupt or two. One that counts
interrupts from an encoder to measure position may only exhibit small precision
errors, a tough thing to find and troubleshoot.
Having worked
on a number of systems using encoders as position sensors, I've developed
a few tricks over the years to find these missing pulses. It's never easy.
You can build
a little circuit using a single up/down counter that counts every interrupt,
and that decrements the count on each interrupt acknowledge. If the counter
always shows a value of zero or one, everything is fine.
Most engineering
labs have counters - test equipment that just accumulates pulse counts. We
have a scope that includes a counter. Use two of these, one on the interrupt
pin and another on the interrupt acknowledge pin. The counts should always
be the same.
You can build
a counter by instrumenting the ISR to increment a variable each time it starts.
Either show this value on a display, or probe the variable using your debugger.
If you know the
maximum interrupt rate, use a performance analyzer to measure the maximum
time in the ISR. If this exceeds the fastest interrupts, there's very likely
a latent problem waiting to pounce.
Most of these
sorts of difficulties stem from slow ISRs, or from code that leaves interrupts
off for too long. Be wary of any code that executes a disable-interrupt instruction.
There's rarely a good need for it; this is usually an indication of sloppy
code.
It's rather difficult
to find a chunk of code that leaves interrupts off. The ancient 8080 had a
wonderful pin that should interrupt state all of the time. It was easy to
watch this on the scope and look for interrupts that came during that period.
Now, having advanced so far, we have no such easy troubleshooting aids. About
the best one can do is watch the INTR pin. If it stays asserted for long periods
of time, and if it's properly designed (i.e., stays asserted till INTA), then
interrupts are certainly off.
Be sure to re-enable
interrupts in your ISRs at the earliest safe spot.
Debugging Reentrancy
problems Well designed interrupt handlers are largely reentrant. Reentrant
functions, AKA "pure code", are often falsely thought to be any
code that does not modify itself. Too many programmers feel if they simply
avoid self-modifying code, then their routines are guaranteed to be reentrant,
and thus interrupt-safe. Nothing could be further from the truth.
A function is
reentrant if, while it is being executed, it can be re-invoked by itself,
or by any other routine.
Suppose your
main line routine and the ISRs are all coded in C. The compiler will certainly
invoke runtime functions to support floating point math, I/O, string manipulations,
etc. If the runtime package is only partially reentrant, than your ISRs may
very well corrupt the execution of the main line code. This problem is common,
but is virtually impossible to troubleshoot since symptoms result only occasionally
and erratically. Can you imagine the difficulty of isolating a bug which manifests
itself only occasionally, and with totally different characteristics each
time?
Now, sometimes
we're tempted to cheat and write a nearly-pure routine. If your ISR merely
increments a global 32 bit value, say, to maintain time, it would seem legal
to produce code that does nothing more than a quick and dirty increment. Beware!
Especially when writing code on an 8 or 16 bit processor, remember that the
C compiler will surely generate several instructions to do the deed. On a
186, the construct ++j might produce:
mov ax,[j] add
ax,1 ; increment low part of j mov [j],ax mov ax,[j+1] adc ax,0 ; prop carry
to high part of j mov [j+1],ax
An interrupt
in the middle of this code will leave j just partially changed; if the ISR
is reincarnated with j in transition, its value will surely be corrupt.
Watch out for
noise on the NMI line. NMI is usually an edge-triggered signal. Any bit of
noise or glitching will cause perhaps hundreds of interrupts. Since it cannot
be masked, you'll almost certainly cause a reentrancy problem. This is yet
another reason to avoid NMI for anything other than a catastrophic failure.
Even the perfectly
coded reentrant ISR leads to problems. If such a routine runs so slowly that
interrupts keep giving birth to additional copies of it, eventually the stack
will fill. Once the stack bangs into your variables the program is on its
way to oblivion. You must insure that the average interrupt rate is such that
the routine will return more often than it is invoked.
Debugging Stack
Problems Any of a number of problems can cause the stack to grow to the point
where the entire system crashes. It's tough to go back and analyze the failure
after the crash, as the program will often write all over itself or the variables,
removing all clues.
The best defense
is a strong offense. Build a stack monitor into your code.
A stack monitor
is just a few lines of assembly language that compares the stack pointer to
some limit you've set. Estimate the total stack use, and then double or triple
the size. Use this as the limit.
Put the stack
monitor into one or more frequently called ISRs. Jump to a null routine, where
a breakpoint is set, when the stack grows too much.
Be sure that
the compare is "fuzzy". The stack pointer will never exactly match
the limit.
By catching the
problem before a complete crash, you can analyze the stack's contents to see
what lead up to the problem. You may see an ISR being interrupted constantly
(that is, a lot of the stack's addresses belong to the ISR). This is a sure
indication of code that's too slow to keep up with the interrupt rate. You
can't simply leave interrupts disabled longer as the system will start missing
them. Optimize the algorithm and the code in that ISR.
Conclusion I've
made a number of recommendations, most of which fall into a philosophy of
debugging: plan for bugs, instrument your code to find them, and buy the right
tools.
Someday we'll
all write bug-free code. Till then, debug proactively. Anticipate the problems,
and design in test code and solutions from the outset.