Reliably detects hardware failures in safety critical applications.

Our vision is to enable the use of cost effective unreliable commodity hardware in safety critical systems. To achieve our vision, we extend the limited failure detection capabilities of commodity hardware with the help of software. In addition to a more sophisticated failure detection, system architects can apply well known toleration approaches to mask SDCs. Our approach works well with retries, fail-over, and graceful degradation.
The SIListra technology developed in the Systems Engineering research group uses arithmetic codes to recognize erroneous program executions. For detecting errors, processed data is encoded using an arithmetic code. These codes facilitate detection of errors during data storage, transport, and processing.

Software Coded Processing

The goal of Software Coded Processing is to enable applications to detect execution errors (such as bit-flips in memory or CPU). To achieve this goal a transformer tool (for instance our SIListra Safety Transformer) generates a Coded Program form an original program. The Coded Program (see graphics below) behaves just like the original program. It consumes input data and produces output data. However, additionally to the output it produces Signatures:

Software Coded Processing

The Signatures have two important properties:

  1. They only depend on execution errors, but not on the input.
  2. They can be pre-calculated (e.g., by the transformer) for the case that no execution errors happen.

To detect execution errors a verifier or watchdog must just compare the pre-calculated Signatures with the actual Signatures produced by the Coded Program. If these Signatures are not equal an execution error happened and the output cannot be trusted.
The reaction on a detected execution error is highly application dependent. Some examples:

  • retry (if time constrains allow it)
  • go into a safe state (if such a safe state exists)
  • fail over to a backup system (if fail-operational)

The advantages are that redundancy can be reduced. To detect one execution error only one channel is sufficient (and not two like in traditional approaches). To mask one execution error two channels are sufficient (in comparison to three with triple modular redundancy).
Arithmetic Codes are the basis that our Transformer uses to transform the original program into a Coded Program and to pre-calculate the Signatures.


Arithmetic Codes

Arithmetic Codes add redundancy to any value that is part of an application’s state. This redundancy enables us to detect erroneous executions. Redundancy is added to all data words. The redundancy added divides the domain of possible data words into valid data words and into invalid data words. Valid data words are only a small subset of all possible data words.
An application encoded by our SIListra Safety Transformer solely processes encoded data, i.e., all inputs are valid code words of an arithmetic code and all computations use and produce encoded data. Thus, we have to use solely operations that preserve the code in the error-free case.

Arithmetic Codes overview

The figure shows the relation between valid data words – also called valid code words – and all possible data words. Correctly executed arithmetic operations preserve the arithmetic code used for adding the redundancy. Thus, given valid code words as input, the output of a correctly executed arithmetic operation is also a valid code word. A faulty arithmetic operation or an operation called with non-code words produces a result that is an invalid code word with high probability. Furthermore, arithmetic codes also detect errors modifying data during storage or transport because such errors most likely transform a valid code word into a non-code word.


AN Codes

For an AN-code the encoded version xc of variable x is obtained by multiplying its original functional value xf with a constant A. To check the code, we compute the modulus of xc with A, which is zero for a valid code word. As described above, all variables in a program are replaced with their encoded versions, i.e., multiples of A. An AN-code can detect faulty operations, i.e., incorrectly executed operations, and modified operands, i.e., data that is for example hit by a bit flip. These errors are detected because they result with high probability in a data word that is not a multiple of A. The probability that such an error results in a valid code word is approximately 1/A.

Consider the following unencoded C code:

int calc(int x, int y, int z) {
    int u = x + y;
    int v = z - u;
    return v;
}
        

Its AN-encoded version uses solely AN-encoded data:

intc calc_encoded(intc xc, intc yc, intc zc) {
    intc uc = xc + yc;   // uc = A * x + A * y
                        //    = A * (x + y)
    intc vc = uc - zc;   // vc = A * (x + y - z)
    return vc;          // expected: vc % A == 0
}
        

More information

Whitepaper about technology behind SIListra: Whitepaper PDF