*** Welcome to piglix ***

Reliability, availability and serviceability


Reliability, availability and serviceability (RAS) is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by International Business Machines (IBM) as a term to describe the robustness of their mainframe computers.

Computers designed with higher levels of RAS have many features that protect data integrity and help them stay available for long periods of time without failure — this data integrity and uptime is a particular selling point for mainframes and fault-tolerant systems.

While RAS originated as a hardware-oriented term, systems thinking has extended the concept of reliability-availability-serviceability to systems in general, including software.

Note the distinction between reliability and availability: reliability measures the ability of a system to function correctly, including avoiding data corruption, whereas availability measures how often the system is available for use, even though it may not be functioning correctly. For example, a server may run forever and so have ideal availability, but may be unreliable, with frequent data corruption.

Physical faults can be temporary or permanent.

Transient and intermittent faults can typically be handled by detection and correction by e.g., ECC codes or instruction replay (see below). Permanent faults will lead to uncorrectable errors which can be handled by replacement by duplicate hardware, e.g., processor sparing, or by the passing of the uncorrectable error to high level recovery mechanisms. A successfully corrected intermittent fault can also be reported to the operating system (OS) to provide information for predictive failure analysis.

Example hardware features for improving RAS include the following, listed by subsystem:


...
Wikipedia

...