Understand
Your User's Needs
Copyright 1991,
Jack G. Ganssle
Abstract
Understand your
user's needs; only then can you be sure the code is useful, as well as correct.
Published in
Embedded Systems Programming, November 1991
Call me Ishmael. I'm writing this in mid-Atlantic, bound from Baltimore to
Plymouth, England aboard my 35 foot sailboat. Like Melville, I find relief
from the pressures of modern life by chasing adventure at sea.
Amber II, though
30 years old, nevertheless is a child of the microprocessor revolution. Over
the years I've added a lot of electronics to make sailing easier and safer;
each addition brings yet one more embedded system aboard. With the exception
of the beautifully designed digital VHF radio, every microprocessor-based
product on the boat suffers from one or more design defects that erodes the
equipment's usefulness just a bit.
The environment
on a small boat far offshore couldn't be worse for electrical items. Salt
spray and high humidity insidiously find their way inside even the best enclosures,
rapidly corroding every soldered connection. Switch contacts are the first
to go, followed by connectors and the tracks on PC boards. A partial solution
is to use only the finest gold contacts, an obvious approach all too few vendors
employ.
Regular readers
of this column know I've long been an advocate of using software to solve
system-wide and application-wide problems. While
the marine environment is perhaps a bit extreme, every system is subject to
mechanical and electrical failures. After all, even in the most benign laboratory
conditions contacts get dirty. It makes sense to design code that will work
in at least some fashion if, say, a switch fails. In situations where failures
are likely or inevitable, a wise designer will devise software solutions,
even if the system cannot continue to run with complete functionality.
Given that dirty
or corroded contacts are a perennial source of trouble, embedded code that
relies on functional switches should always check input bits for validity.
Obviously, if only a single switch should ever be pressed at a time, then
by all means don't accept conditions where several are asserted. But be kind
to your users - a switch failure may erroneously create this condition. Can
the system ignore the extra bit (perhaps by seeing it always asserted), and
carry on?
A lot of systems
use a debouncing algorithm that will loop forever if an input is shorted.
Don't let a simple failure shut down the entire product! Assume default inputs
that make some sense where possible.
For example,
on this trip while still 1000 miles from England Amber's digital autopilot
went insane. God knows how long we went in circles till I woke up and realized
there was a problem. After an entire day of tracing the circuit and looking
for the source of the trouble I found that the front panel switches were wired
in parallel with an unused external connector. All were arranged in a matrix
scanned by the unit's 8051. These course setting switches are used one at
a time - never should more than one be pressed, and a user would never hold
a switch in for more than a few seconds. The relentless sea found its way
into the O-ring sealed computer module and created a high-impedance short
between scan lines. The code was too simple-minded to reject the impossible
signals it received, and bizarrely steered us in circles.
Such poor code
is inexcusable, as autopilots are famous for suffering corrosion problems.
Wealthier sailors usually carry three or four units in hopes that one will
survive a trip. Smarter software could help keep customers a lot happier.
The engineering costs will be a bit higher, but the extra software costs nothing
in production if ROM space is available. If it isn't, the company must weigh
the cost of unhappy customers against a microcontroller with more program
space.
Still, the designers
had addressed a similar problem, though perhaps more to satisfy their own
internal production requirements than to deal with frantic mid-ocean repairs.
While troubleshooting the scan line short I disassembled the unit at Amber's
chart table, removed the circuit board, and clipped power to the computer
so I could trace out problems with a voltmeter. Unfortunately, with the board
removed an important rotary switch could not be connected into the system.
I feared the autopilot's firmware would find that since no rotary switch input
was presented, another "impossible" condition, the code would go
haywire, making tabletop diagnosis difficult. In fact it worked even without
this input, indicating that the designers realized that during repair the
unit's mechanical construction was such that
no input could be expected. The code must have assumed some reasonable default
value instead of looping for an input.
Sure, in real
life most embedded systems don't have to run partially dismantled. Always
remember that during production test and repair, to say nothing of field repair,
your carefully engineered package might be violated. Technicians needing access
to the components will try to run it in pieces. If the system runs when opened
up, they'll have a much easier time probing with scopes and meters to find
faults.
Can your system
run with important cables removed? What happens if a cable is not connected
when power is applied? Technicians will try to
connect as little as possible when troubleshooting failed components. If the
code won't run without a cable, they might have to build extension wiring
harnesses just to gain access to the circuit boards. Certainly no one in the
field will have these harnesses. Where possible, make sure the code continues
to run in some fashion with some or all of the cables removed.
Informative
Beeps
I've written
extensively about software diagnostics in the past. In the case of Amber's
autopilot, an off-course alarm beeped incessantly till I woke up and realized
something was wrong. The unit gave no help in figuring out just what the problem
was, though a trivial amount of code could have produced beep codes indicating
which switches seemed to be on. As it was, I spent most of a day isolating
the problem, not much fun in 10 foot seas.
With no feedback
from the microcontroller, it's awfully hard to differentiate between switch,
electronics, actuator, or flux-gate compass failures. In the July 1990 issue
of Embedded Systems Programming I wrote about using an LED to blink error
codes. Your high tech Ford with an under-dash computer has such a self-test
mode: short two wires together and it will produce a two digit code indicating
what sort of failures are where. This is embedded systems programming with
style!
Sure, sometimes
embedded systems are essentially disposable in event of failure. Mission-critical
applications must be repairable, and demand firmware that helps the user even
when things fail. It's important to sit in your customer's shoes when deciding
what is truly mission critical. If we couldn't fix Amber's autopilot the two
of us aboard would have had to steer, 24 hours a day, for almost two weeks!
Software Failures
Similarly, never
assume that the software is entirely glitch-free. Yes, even your meticulously
maintained and painfully debugged code could very well harbor a latent problem.
Even small embedded systems are now getting frightfully complicated, making
proving software correctness all but impossible. After fixing a hundred bugs,
are you really sure there's not one or two obscure ones still left?
It would be nice
to write code that can survive any sort of software bug but surely this is
impossible. However, with a little forethought you can usually craft firmware
that, by its design, is robust enough to handle many sorts of faults.
Always write
exception handlers. The 80x88 traps on divide overflows, yet a staggering
number of DOS applications exit to the operating system on a divide fault.
Can you really guarantee that your application will never do a divide by zero?
Spend the few minutes needed to write a short routine to gracefully recover
from unanticipated division problems. Other processors trap on other sorts
of errors. Always fill these trap vectors with some sort of recovery routine.
If the error
is truly impossible (like a memory error) it might make sense to report the
problem and at least restart the code. Any sort of service is better than
leaving the vector unused, a sure way to turn a little software bug into a
dramatic crash.
Fill unused ROM
and even RAM locations with a single byte opcode that traps to a particular
address, and then put a handler there. For instance, the Z80/64180 goes to
location 38 when executing the RST7 (FF) opcode; the 80x88 picks up a vector
at 0C after executing INT3 (C4). The handler should try and recover gracefully,
perhaps by re-entering the program's main loop or even by restarting the code.
This approach gives he code a prayer of recovering despite momentary hardware
or software glitches that make the firmware "wander off".
Wandering code will likely wind up in the middle of data or even in the middle
of a multi-byte opcode. There's not much we can do about this, but filling
ROM and RAM with a one-byte trap will improve the recovery odds quite a bit.
Be sure you can
disable this extra-robust code during debugging. You don't want these routines
to mask real problems. Use a conditional compile or runtime switch to vector
error conditions to a breakpoint.
Similarly, during
debugging always set your emulator, simulator, or whatever to break on any
access to unused locations. Otherwise, how can you be sure the code isn't
banging on locations it shouldn't be? This is always a sign of a latent problem.
I often hear from folks whose software runs fine from system ROM but not from
emulator RAM, a sure sign of rogue code that is writing into code space.
Really complex
loops always hold potential for locking up a system. The world is indeed growing
ever more complex, and our embedded systems reflect this. Some equipment solves
torturously difficult series of equations before producing a result. Often,
iterative instead of deterministic algorithms are used to reduce matrices
or converge a series or integral. For example, Newton's method involves solving
the same equation repeatedly using the answer from step "n"
as the input to step "n+1", continuing until the errors
are below some arbitrary value. What if the input data is such that a solution
cannot be found within specified precision? Sometimes iterative solutions
can actually start to diverge, rather than converge, making a solution impossible.
Iterative algorithms are fine as
long as the software is smart enough to detect that a solution is unlikely,
and then give the user some options. Locking up into an infinite loop is always
unacceptable.
On this transatlantic
voyage our GPS (Global Positional System - a satellite navigator) hung several
times trying to reduce crummy data from weak signals or marginal satellite
geometry. Worse, even the software-controlled power switch wouldn't work when
stuck in this loop. The designers left no option but to remove the unit's
batteries, wait 30 minutes (!) and then restart it from scratch. Of course,
after a half hour without batteries we had to reload dozens of setup parameters.
Ironically, the restart required us to figure our position with the centuries
old method of celestial navigation and preload the position into the GPS.
A much better design would make the iterative loop read the keypad and exit
when a key is pressed.
An even better
approach might have been to use a real time operating system, with one task
always reading keys in the background. An OS that runs some sort of keypad
task will inherently prevent well-behaved code from getting into un-exitable
infinite loops.
Far too many
years ago I worked on an 8008 based instrument that used a Gauss-Siedel iteration
to produce an answer. We programmed it to escape the loop if the iteration
proceeded for 20 minutes without a solution (computers were a lot slower then).
In this case 7 segment LEDs displayed "HELP" to let the
user know no solution was possible. Years passed and the code was obsoleted
by an algorithm that converged quickly, every time. Memories of the earlier
version faded. One day an ashen faced technician came to me and explained
that he was repairing a very old unit. While fiddling with it, it started
flashing "HELP HELP", confirming his long-held belief in
the supernatural.
Never, never
shut the user down. He bought your product to do something. Try to keep the
widget at least partially operational no matter what might go wrong.
Brownouts
Embedded systems
often quietly compute in the background, day in and day out. You might be
willing to re-setup a lab instrument if a power outage caused the unit to
reset, but this just is not acceptable in a lot of other applications. I often
wonder why we put up with resetting every digital clock in the house after
even a 1 second power failure - in this day and age of CMOS there is no technical
reason why they shouldn't keep track of time for a least a few minutes.
With the grid
getting ever more overloaded we must expect line power based equipment to
have to deal with regular power shortages. While it might be unreasonable
to expect an embedded system to continue operating without power, I do feel
that some equipment should at least reset to a reasonable mode when power
is re-applied. For example, a remote data acquisition site should start acquiring
data as soon as power is restored, rather than enter some sort of setup mode.
There may be no user available to press the "start" key.
Can your critical
equipment come back up without human intervention? If this is an important
design criterion, be sure the code recognizes that the unit was at one point
alive. If important variables are protected in battery-backed RAM, then in
most cases it's easy to resume operation automatically. Be sure to maintain
a checksum of the really important parameters so the code knows if the machine's
data is intact.
On our sail we
ran all of the boat's equipment from a pair of 12 volt batteries. Once a day
we'd fire up the diesel to recharge the cells. If we weren't careful to switch
a full battery on-line before cranking the engine, then the tremendous amount
of current needed by the starter motor dragged the entire 12V system down
to 8V or so, which reset every piece of electronics with an embedded computer.
None of the equipment was smart enough to carry on without our help. We would
be forced to reenter a course into the autopilot, restart the radar, etc.
The Loran was especially frustrating, as sometimes we had to re-enter our
initial position. In short, no piece of embedded electronics was smart enough
to remember that it had been on after a brownout (a not unusual condition
on a cruising sailboat).
Especially frustrating
was the digital log, which uses a tiny paddlewheel to track roughly how many
miles we've sailed. Even the shortest power glitch made this unit reset distance
traveled to zero, which really interfered with navigation. I got in the habit
of writing down its distance reading before starting the engine, and then
manually accumulating these offsets. Someday I'll put a diode and big capacitor
in its power line, but a little better design would have given this vendor
a much happier customer.
An old business
adage advises one to "stick to one's knitting" - develop
and sell products to markets you truly understand. If you don't comprehend
what your user expects, and haven't got lots of experience operating in the
industry, then you cannot make a product that will really satisfy him. Make
sure your widget is designed to satisfy the user's real needs, in his real
operating environment. Non- embedded Blues
Despite various
frustrations with the boat's processors, the only true tragedy struck our
single non-embedded system, a DOS laptop that suffered a fatal attack of salt
water corrosion early on. After years of leaning on a word processor crutch,
it was a shock to revert to archaic pen and paper. They say Mozart wrote his
music essentially without corrections - I wonder how true this would have
been if he worked on a MIDI machine with graphic editing.
But wait! The
white whale's to windward! No more of this dull plodding - helm alee!