Fault-tolerant computer systems

Fault-tolerant computer systems are systems designed around the concepts of fault tolerance. In essence, they must be able to continue working to a level of satisfaction in the presence of faults.

Fault tolerance is not just a property of individual machines; it may also characterise the rules by which they interact. For example, the (TCP) is designed to allow reliable two-way communication in a packet-switched network, even in the presence of communications links which are imperfect or overloaded. It does this by requiring the endpoints of the communication to expect packet loss, duplication, reordering and corruption, so that these conditions do not damage data integrity, and only reduce throughput by a proportional amount.

Recovery from errors in fault-tolerant systems can be characterised as either 'roll-forward' or 'roll-back'. When the system detects that it has made an error, roll-forward recovery takes the system state at that time and corrects it, to be able to move forward. Roll-back recovery reverts the system state back to some earlier, correct version, for example using checkpointing, and moves forward from there. Roll-back recovery requires that the operations between the checkpoint and the detected erroneous state can be made idempotent. Some systems make use of both roll-forward and roll-back recovery for different errors or different parts of one error.

Most fault-tolerant computer systems are designed to handle several possible failures, including hardware-related faults such as hard disk failures, input or output device failures, or other temporary or permanent failures; software bugs and errors; interface errors between the hardware and software, including driver failures; operator errors, such as erroneous keystrokes, bad command sequences or installing unexpected software and physical damage or other flaws introduced to the system from an outside source.

Hardware fault-tolerance is the most common application of these systems, designed to prevent failures due to hardware components. Most basically, this is provided by redundancy, particularly dual modular redundancy. Typically, components have multiple backups and are separated into smaller "segments" that act to contain a fault, and extra redundancy is built into all physical connectors, power supplies, fans, etc. There are special software and instrumentation packages designed to detect failures, such as fault masking, which is a way to ignore faults by seamlessly preparing a backup component to execute something as soon as the instruction is sent, using a sort of voting protocol where if the main and backups don't give the same results, the flawed output is ignored.

...
Wikipedia