Reliably detects hardware failures in
safety critical applications.
Our vision is to enable the use of cost effective unreliable
commodity hardware in safety critical systems. To achieve our vision,
we extend the limited failure detection capabilities of commodity hardware
with the help of software. In addition to a more sophisticated
failure detection, system architects can apply well known toleration
approaches to mask SDCs. Our approach works well with retries,
fail-over, and graceful degradation.
The SIListra technology developed in the Systems
Engineering research group uses arithmetic codes to
recognize erroneous program executions. For detecting
errors, processed data is encoded using an arithmetic code.
These codes facilitate detection of errors during data
storage, transport, and processing.
Software Coded Processing
The goal of Software Coded Processing is to enable applications to
detect execution errors (such as bit-flips in memory or CPU).
To achieve this goal a transformer tool (for instance our
SIListra Safety Transformer)
generates a Coded Program form an original program.
The Coded Program (see graphics below) behaves just like the original program.
It consumes input data and produces output data.
However, additionally to the output it produces Signatures:
The Signatures have two important properties:
- They only depend on execution errors,
but not on the input.
- They can be pre-calculated (e.g., by the transformer)
for the case that no execution errors happen.
To detect execution errors a verifier or watchdog must just
compare the pre-calculated Signatures with the actual
Signatures produced by the Coded Program.
If these Signatures are not equal an execution error happened
and the output cannot be trusted.
The reaction on a detected execution error is highly application
dependent. Some examples:
- retry (if time constrains allow it)
- go into a safe state (if such a safe state exists)
- fail over to a backup system (if fail-operational)
The advantages are that redundancy can be reduced.
To detect one execution error only one channel is sufficient (and not
two like in traditional approaches).
To mask one execution error two channels are sufficient (in comparison
to three with triple modular redundancy).
Arithmetic Codes are the basis that our Transformer uses to transform
the original program into a Coded Program and to pre-calculate
the Signatures.
Arithmetic Codes
Arithmetic Codes add redundancy to any value that is part of an application’s state.
This redundancy enables us to detect erroneous executions. Redundancy is added to all data words.
The redundancy added divides the domain of possible data words into valid data words and into invalid data words.
Valid data words are only a small subset of all possible data words.
An application encoded by our SIListra Safety Transformer
solely processes encoded data, i.e., all inputs
are valid code words of an arithmetic code and all computations use and produce encoded
data. Thus, we have to use solely operations that preserve the code in the error-free case.
The figure shows the relation between valid data words – also called valid code words – and
all possible data words. Correctly executed arithmetic operations preserve the arithmetic
code used for adding the redundancy. Thus, given valid code words as input, the output
of a correctly executed arithmetic operation is also a valid code word. A faulty arithmetic
operation or an operation called with non-code words produces a result that is an invalid
code word with high probability. Furthermore, arithmetic codes also detect errors
modifying data during storage or transport because such errors most likely transform a
valid code word into a non-code word.
AN Codes
For an AN-code the encoded version xc of variable x is obtained by multiplying
its original functional value xf with a constant A. To check the code, we compute the
modulus of xc with A, which is zero for a valid code word. As described above, all variables
in a program are replaced with their encoded versions, i.e., multiples of A.
An AN-code can detect faulty operations, i.e., incorrectly executed operations, and
modified operands, i.e., data that is for example hit by a bit flip. These errors are detected
because they result with high probability in a data word that is not a multiple of A.
The probability that such an error results in a valid code word is approximately 1/A.
Consider the following unencoded C code:
int calc(int x, int y, int z) {
int u = x + y;
int v = z - u;
return v;
}
Its AN-encoded version uses solely AN-encoded data:
intc calc_encoded(intc xc, intc yc, intc zc) {
intc uc = xc + yc; // uc = A * x + A * y
// = A * (x + y)
intc vc = uc - zc; // vc = A * (x + y - z)
return vc; // expected: vc % A == 0
}
More information
Whitepaper about technology behind SIListra: Whitepaper PDF