Computing
CRCs in Parallel
Copyright 1991,
Jack G. Ganssle
Abstract
The CRC is the
best way to guage data comm. Here's how to build a parallel hardware CRC generator.
The CRC (Cyclic
Redundancy Check) is a sophisticated checksum that has long been the most
common means of testing data for correctness. Every sector on a disk has a
CRC of the data stored, so the operating system will be aware of data dropouts.
A CRC doesn't identify which byte is in error, but it pretty much guarantees
that you'll be alerted to at least single bit errors.
All CRCs are
binary polynomials that are divided into the data stream. Although the definition
and selection of the CRC is quite complex, its use is not.
The most common
CRC polynomial is the CCITT form used by IBM's SDLC protocol. It is of the
form:
X**16 + X**12
+ X**5 + 1
This means that
the input data stream (say, the data being written to the disk drive) is exclusive
ORed into a 16 bit shift register that has feedback terms at the bit locations
with coefficients in the formula, one bit at a time. Each register bit is
a function of the CRC to that point, the input data, and the feedback taps.
In most systems
data is transferred as a serial bit stream. Floppy and hard disks, as well
as the newer optical disks, all write a single bit at a time to the medium.
Modem applications are also bit oriented. This is fortunate, since the CRC
is particularly well suited to serial data transfers. Serial shift registers,
even with the feedback terms, are easy to implement in silicon. Fairchild's
74F401 chip is a 14 pin package that will compute CRCs on serial data using
any of eight different polynomials. Many floppy disk controllers automatically
append a CRC to the data stream. With this technology CRCs are almost trivial
to add to a system.
If the peripheral
is a parallel device, the problem becomes much more complex. Think about it:
the CRC is computed by eight shifts per byte, each shift resulting in the
exclusive ORing of a number of terms within the byte. After 8 of these operations
the output is far from obvious. How do you compute a CRC on 8 bits at once?
The quest for
speed is bringing more parallel devices into the mainstream. An obvious example
is a large RAM disk. Most implementations make the hardware look just like
a hard disk, so the operating system can handle the device without special
drivers. Other peripherals may be connected using SCSI, which almost always
is designed to emulate a disk. Whenever the operating expects to see a disk,
it will also expect a CRC.
How do you implement
a CRC in a purely parallel interface? An obvious approach is to convert the
data to serial, compute the CRC, and convert it back to parallel. Although
conceptually easy, fast data transfers will require a mind-boggling clock
rate (at least 8 times the data rate), and a lot of hardware.
The new CRC (after
a byte is shifted in) is a function of the input data and the old CRC. Why
not derive formulas for each of the 16 bits of the CRC after the 8 new bits
are shifted in? Then all 16 CRC bits can be computed in a single clock cycle.
Deriving the
formulas is conceptually easy but rather tedious. If the input data is b0,
b1, ... b7, and the current CRC is a0, a1, ... a15, then compute the new CRC
after one bit arrives. Iterate seven more times, using new values of a0, a1,
... a15 each time. You'll get 16 new equations each time a new bit is shifted
in. The last set of 16 is the result; these define the values of each bit
after an entire byte is CRCed. Apply these 16 equations to a byte of data
to compute a new CRC in a single cycle.
It turns out
that a number of terms are common to many of the 16 equations. To simplify
the algorithm the common terms are identified as I, J, K, L, M, N, O, and
P. Figure 1 shows a program written in Turbo-C to compute the CRC using the
derived formulas.
This program
looks horrible! It is far more cumbersome and complicated than a loop to do
the normal serial CRC calculation. Why go through all of this grief for so
little reward?
The form of the
program closely resembles that of the equations needed to define a Programmable
Logic Device (PLD). In other words, there is a very close equivalence between
the code and a hardware implementation of the algorithm. Fortunately, the
PLD version can operate in a single clock cycle, and is much more ascetically
appealing than its awkward software cousin. The PLD
For the uninitiated,
a PLD is a sort of "super PAL". Based on EPROM technology, the PLD
is a device that can be programmed with a user's equations. Like an EPROM,
a fairly simple device programmer is used to load the formulas.
PLDs are defined
in terms of "macrocells". Every macrocell is a multiple-input OR
gate optionally connected to a flip-flop. Each of the OR gate inputs is an
extremely wide AND gate; any or all of the PLD pins (and their inversions)
can be connected to the AND gates.
You can specify
the type of flip-flop; typically D, T, JK, and SR versions are all available.
Or, you can disable the register altogether, so the macrocell becomes a big
combinatorial gate structure.
A PLD is therefore
a very large collection of AND and OR gates with some internal registers also
thrown in. The connection of these components is up to the user; you write
a set of boolean equations that makes the PLD implement the a circuit you
need.
Intel's 5C090
(equivalent to Altera's EP900) is a 40 pin PLD with 24 macrocells. It contains
enough logic to completely implement the CRC algorithm in parallel. Putting
the CRC in a PLD
Examining the
program, it quickly becomes apparent that the intermediate I through P terms
must be computed before any of the a0 - a15 outputs, but once I-P are known,
then all 16 a0 - a15 terms could be computed simultaneously. We should therefore
assign the I - P terms to combinatorial outputs in the PLD. The a0 to a15
terms (the CRC itself) are assigned to registered outputs. In this case the
current CRC is needed as part of the new one; therefore the flip-flops' output
will be fed back into the equation matrix.
It's pretty easy
to translate the Turbo-C program into PLD equations. This was my intention
when writing the code; the real purpose of the program is to provide a simulation
of the CRC function. Figure 2 shows the file that defines the PLD.
For those unfamiliar
with PLD design, the NETWORK section of the PLD file defines the nature of
each of the device's pins. b0 to b7 are input bits. I to P are combinatorial
outputs. The a0 to a15 (CRC) terms are defined as RORF (Registered Output,
Registered Feedback). The terms are latched in D flip-flops, but the current
value (before the device is clocked) goes back into the equation matrix.
Figure 3 shows
the connection of the PLD to a computer's data bus. Most data communications
processors manipulate 8 bit data, so an 8 bit data bus is shown. The input
data is a byte, but the output CRC is 16 bits. This PLD is designed to let
us multiplex the CRC onto the bus as two individual bytes. The input bits
come from the same data bus. b0 is thus tied to a0 and a8, b1 to a1 and a9,
etc.
The LOCRC and
HICRC inputs are used to dump the upper or lower CRC byte onto the bus. Once
the calculation is complete, the processor asserts these inputs low (one at
a time) and reads the two bytes. Typically these inputs are connected as input
port selects.
The PRESET input
is used to initialize the CRC before any data is transferred. The CCITT specification
requires that the CRC be initialized to all ones. When asserted, PRESET will
load a one into all of the a0-a15 terms after the next clock cycle.
Be warned! Your
interface circuit must assert clock when preset is also low. Clock is high-edge
triggered.
To use the device,
first initialize the CRC by asserting PRESET low and driving clock high. Then
transfer data one byte at a time. For each byte, drive clock high once the
input data is stable, and after the data has been present long enough to let
the I to P terms propagate (typically 100 nsec). The PLD will compute a new
value for the CRC and load it in the a0-a15 registers when clock goes high.
After a block is transferred, the CRC can then be read by the computer.
This is a pretty
complex series of equations. How can you be sure the circuit works correctly?
One way is to compare the CRC to known good values after specific data is
transferred. Table 1 shows CRC values for an input data stream of successive
zeroes. You can check the CRC after each clock against this table.
The CCITT polynomial
has a self-checking property. Once a stream of data has gone through the CRC
calculator, you can supply the one's complement of the resulting CRC to the
PLD and always get F0B8 as the new value. Always. This serves as a sanity
check when developing hardware, and as a quick way of verifying data against
a previously computed CRC during read of a device. Be sure to transfer the
low part of the CRC first, and then the high byte when making this test. Summary
The PLD lets
you compute a CRC in parallel using only a single IC package. It can support
a data rate of at least 10 Mhz, even using relatively slow devices. Although
the equations are messy and tedious, you can use the ones presented here as
a "cookbook" solution.
