Thanks
for the Memories
Copyright 1995,
Jack G. Ganssle
Abstract
Here's some advice
about testing RAM and ROMs in your embedded system.
Published in
Embedded Systems Programming, August, 1995
It doesn't take
much to make at least the kernel of an embedded system run. With a working
CPU chip, memories that do their thing, perhaps
a dash of decoder logic, you can count on the code starting off... perhaps
not crashing until running into a problem with I/O.
Though the kernel
may be relatively simple, with the exception of the system's power supply
it's by far the most intolerant portion of an
embedded system to any sort of failure. The tiniest glitch, a single bit failure
in a huge memory array, or any problem with the processor
pretty much guarantees that nothing in the system stands a change of running.
Non-kernel failures
may not be so devastating. Some I/O troubles will cause just part of the system
to degrade, leaving much of the rest
up. My car's black box seems to have forgotten how to run the cruise control,
yet it still keeps the fuel injection and other systems
running.
In the minicomputer
era most booted with a CPU test that checked each instruction. That level
of paranoia is not longer appropriate, as a
highly integrated CPU will generally fail disastrously. If the processor can
execute any sort of a self test, it's pretty much guaranteed
to be intact.
Dead decoder
logic is just as catastrophic. No code will execute if the ROMs can't be selected.
A smart technician
can spot a dead decoder in a heartbeat using not much more than a scope. He
can make a pretty good guess that the
processor is history by looking for bizarre outputs (no clock-out; no read/write;
tristated address lines right after reset), or by
"shotgunning"; replacing the chip with a known-good one
and seeing if the problems disappear.
Large memory
arrays, though, can suffer from partial failures that are just about impossible
to troubleshoot. A defective RAM is tough to
find by any method other than shotgunning. A handful of bad locations in ROM
are equally difficult to detect.
Lots of designers
realize that memories are a potential source of trouble, so include diagnostics
in the firmware. Good idea! Given that
there's no realistic way for a technician to find a memory problem, a little
software designed to pick up these will sure make you friends
in the test department. Testing ROM
If your boot
ROM is totally misprogrammed or otherwise non-functional, then there's no
way a ROM test will do anything other than crash.
The value of a ROM test is limited to dealing with partially programmed devices
(due, perhaps, to incomplete erasure, or inadvertently
removing the device before completion of programming).
There's a small
chance that ROM tests will pick up an addressing problem, if you're lucky
enough to have a failure that leaves the boot
and ROM test working. The odds are against it, and somehow Mother Nature tends
to be very perverse.
Some developers
feel that a ROM checksum makes sense to insure the correct device is inserted.
This works best only if the checksum is
stored outside of the ROM under test. Otherwise, inserting a device with the
wrong code version will not show an error, as presumably the
code will match the (also obsolete) checksum.
In multiple-ROM
systems a checksum test can indeed detect misprogrammed devices, assuming
the test code lives in the boot ROM. If this one
device functions, and you write the code so that it runs without relying on
any other ROM, then the test will pick up many errors.
Checksums, though,
are passé. It's pretty easy for a couple of errors to cancel each
other out. Compute a CRC (Cyclic Redundancy
Check), a polynomial with terms fed back at various stages. CRCs are notoriously
misunderstood but are really quite easy to implement. The
best reference I have seen to date is "A Painless Guide to CRC Error
Detection Algorithms", by Ross Williams. It's available via
anonymous FTP from ftp.adelaide.edu.au/pub/rocksoft/crc_v3.txt.
It's not a bad
idea to add death traps to your ROM. On a Z80 0xff is a call to location 38.
Conveniently, unprogrammed areas of ROMs are
usually just this value. Tell your linker to set all unused areas to 0xff;
then, if an address problem shows up, the system will generate
lots of spurious calls. Sure, it'll trash the stack, but since the system
is seriously dead anyway, who cares? Technicians can see the
characteristic double write from the call, and can infer pretty quickly that
the ROM is not working.
Other CPUs have
similar instructions. Browse the op code list with a creative mind. Testing
RAM
The days of erratic
single bit RAM failures are thankfully gone. Once DRAMs were subject to cosmic
ray and even alpha particle problems,
so designers came up with exhaustive tests that insured no bit could interact
with any other bit within each chip.
New packaging
materials cured these problems once the chip vendors discovered that the plastic
material used to encapsulate the silicon
was one of the biggest sources of alpha particles. Now it seems most RAM failures
stem from good old-fashioned electrical and logic
problems.
RAMs fail outright,
just as any other part does. Rarely is a single bit bad; generally the entire
device, or a least some number of rows
or columns, die. (All memories are organized as matrices; each row and column
includes a driver and a sense amplifier that converts the
minuscule voltage from memory cells into conventional logic signals. These
amplifiers do fail, cause complete loss of data from that row
or column).
Decoders die,
preventing the selection of entire RAM devices. Address and data lines may
not make it to the chips, or the write signal may
just peter out on its way across the board.
All of these
problems result in fairly massive access problems. An effective RAM test need
not check every possible state of the array, as
long as it tests pretty much every location. This simplification results in
a huge decrease in the time a RAM test will take to run.
Clearly, any
such test cannot require working RAM. In the worst case, where none of the
memory works at all, a test that uses CALLs and
RETURNs will simply crash horribly at the first RETURN. This has several implications:
You cannot code
the test in C. The code produced by your compiler is difficult to control,
and will doubtless use plenty of CALLs,
RETURNs, PUSHes, and POPs. The RAM test code must be very early in the program
- before any more complex activity that requires a
functioning stack. Interrupts always make use of the stack, so be sure these
are disabled! The test itself cannot use subroutines,
variables not in registers, or the stack.
These restrictions
induce many to use only the simplest of tests. It's common to write 0x55 to
each location, read and check the result,
and then repeat the process using 0xaa. These two values are each others complement,
so at least every bit gets tested.
If any or all
of the address lines in the system are hosed this test will pass. Bummer,
that, but since every value in RAM is set to the
same value, you'll never know if you are reading location 0100 instead of
0000.
An alternative
is to follow the 0x55, 0xaa test with something that picks up address problems.
Try writing the low part of the address to
each RAM location, over the entire array, and then reading the memory to test
for correctness. For example, write 00 to 0000, 01 to 0001,
02 to 0002, etc. The address, or at least part of it, is encoded into the
data, so you can be pretty sure that the RAMs decode properly.
On an 8 bit computer
each location is byte-addressable, so at location 0100 the pattern restarts
at 00. That is, the test writes the same
data to 0000, 0100, 0200, etc. Upper address line shorts may not be detected.
Again, add another
test. Write the upper part of the address to each RAM location. 00 goes to
0000 through 00ff. Put 01 in 0100 to 01ff,
and 02 in 0200 to 02ff.
For arrays up
to 64k in length, then, running these four tests insures that each bit works,
and each cell addresses properly. The code is
quite simple and easily written without using intermediate variables or the
stack. The only downside is that testing large arrays can take
a long time: the code writes to every location 4 times, and then reads each
4 times. Even on a lousy 64k RAM this is a half million
accesses, each one burdened with all of the housekeeping code needed to sequence
the comparisons.
A faster test
will write and read the array just once. Given that we don't expect single
bit errors, there's no need to make sure we put a
0 and a 1 in each location as we did with the 0x55 and 0xaa tests.
A fast test must
send a reasonable set of different values to memory to make sure that the
array is really writable. It must be clever
enough detect addressing problems, a common source of trouble due to the vast
number of address lines running over the circuit board, and
the likelihood that one or more may be corrupt in some manner.
A very fast,
very simple solution is to create a short string of almost random bytes that
you repeatedly send to the array until all of
memory is written. Then, read the array and compare against the original string.
I use the phrase
"almost random" facetiously, but in fact it little matters
what the string is, as long as it contains a variety
of values. It's best to be include the pathological cases, like 00, 0xaa,
ox55, and 0xff. The string is something you pick when writing
the code, so it is truly not random, but other than these four specific values
you fill the rest of it with nearly any set of values,
since we're just checking basic write/read functions (remember: memory tends
to fail in fairly dramatic ways). I like to use very
orthogonal values - those with lots of bits changing between successive string
members - to create big noise spikes on the data lines.
To make sure
this test picks up addressing problems, insure the string's length is not
a factor of the length of the memory array. In
other words, you don't want the string to be aligned on the same low-order
addresses, which might cause an address error to go undetected.
For 64k of RAM,
a string 257 bytes long is perfect. 257 is prime, and its square is greater
than the size of the RAM array. Each instance
of the string will start on a different low order address.
257 has another
special magic: you can include every byte value (00 to 0xff) in the string
without effort. You can skip the actual
creation of a string in ROM by producing the values as needed, incrementing
a counter that overflows at 8 bits.
To summarize
this algorithm: set an 8 bit counter to 0, and the start address to the beginning
of RAM. Write the counter's value to RAM.
Increment it, and repeat until 257 locations were written. Now reset the counter
to 0 and iterate until all of RAM is done.
Reset the counter
to 0 and the address to the start of RAM and repeat, this time reading instead
of writing, checking each memory location
against the counter value.
Some folks skip
the read and compare step, instead reading and checksumming or CRCing the
data. This may be marginally faster, but you
cannot tell where the failure occurred unless you stop CRCing at the end of
each 257 byte block, and make the comparison there.
When speed is
a major concern modify the algorithm by skipping most of memory. Instead of
incrementing the address at each step, add a
small prime number to the address. You'll test a lot less of the RAM so may
potentially miss some failures, but if the prime number
address offset is much smaller than the row and column sizes of the RAM chips
then you'll surely pick up most commons problems. Other
Gotchas
DRAMs have memories
rather like mine - after 2 to 4 milliseconds go by they will probably forget
unless external circuitry nudges them
with a gentle reminder. This is known as "refreshing" the
devices, and is a critical part of every DRAM-based circuit extant.
More and more
processors include built-in refresh generators, but plenty of others still
rely on rather complex external circuitry. Any
failure in the refresh system is a disaster.
Any RAM test
should pick up a refresh fault - shouldn't it? After all, it will surely take
a lot longer than 2-4 msec to write out all of
the test values to even a 64k array.
Unfortunately,
refresh is basically the process of cycling address lines to the DRAMs. A
completely dead refresh system won't show up with
the test indicated, since the processor will be merrily cycling address lines
like crazy as it writes and reads the devices. There's no
chance the test will find the problem. This is the worst possible situation:
the process of running the test camouflages the failure!
The solution
is simple: after writing to all of memory, just stop toggling those pesky
address lines for a while. Run a tight do-nothing
loop for a while (very tight.... the more instructions you execution per iteration,
the more address lines will toggle), and only then do
the read test. Reads will fail if the refresh logic isn't doing its thing.
Though DRAMs
are typically spec'ed at a 2-4 msec maximum refresh interval, some hold their
data for surprisingly long times. When memories
were smaller and cells larger, each had so much capacitance you could sometimes
go for dozens of seconds without losing a bit. Today's
smaller cells are less tolerant of refresh problems, so a 1 to 2 second delay
is probably adequate.
Capacitance causes
another insidious problem that is easy to deal with: the read that follows
a write to a location that doesn't exist
(perhaps due to a completely dead RAM) will often return correct data! Follow
the algorithm above and write all of memory before starting
the read - capacitance can remember but a single value, not the complex sequence
you've written.
Including system
tests is a good idea if, and only if, the test has more meaning than just
adding a "Includes Full Diagnostics"
line to the marketing blurbs. Good algorithms are as easy to implement as
poor ones - just think the failure modes through carefully,
before writing a lot of useless code.