High availability software

High availability software refers to the use of software to ensure that systems are running (available) most of the time. High availability is a characteristic of a system and is defined as the percentage of time that the system is functioning. It can be formally defined as (1 – (down time/ total time))*100%. Although the minimum required availability varies by task, systems typically attempt to achieve 99.999% (5-nines) availability. This characteristic is weaker than fault tolerance, which typically seeks to provide 100% availability, albeit with significant price and performance penalties.

High availability software is measured by its performance when a subsystem fails, its ability to resume service in a state close to the state of the system at the time of the original failure, and its ability to perform other service-affecting tasks (such as software upgrade or configuration changes) in a manner that eliminates or minimizes down time. All faults that affect availability – hardware, software, and configuration need to be addressed by High Availability Software to maximize availability.

Typical high availability software provides features that:

Enable hardware and software redundancy: These features include:

A service is not available if it cannot service all the requests being placed on it. The “scale-out” property of a system refers to the ability to create multiple copies of a subsystem to address increasing demand, and to efficiently distribute incoming work to these copies (Load balancing (computing)) preferably without shutting down the system. High availability software should enable scale-out without interrupting service.

Enable active/standby communication (notably Checkpointing): Active subsystems need to communicate to standby subsystems to ensure that the standby is ready to take over where the active left off. High Availability Software can provide communications abstractions like redundant message and event queues to help active subsystems in this task. Additionally, an important concept called “checkpointing” is exclusive to highly available software. In a checkpointed system, the active subsystem identifies all of its critical state and periodically updates the standby with any changes to this state. This idea is commonly abstracted as a distributed hash table – the active writes key/value records into the table and both the active and standby subsystems read from it. Unlike a “cloud” distributed hash table (Chord (peer-to-peer), Kademlia, etc.) a checkpoint is fully replicated. That is, all records in the “checkpoint” hash table are readable so long as one copy is running. Another technique, called an [application checkpoint], periodically saves the entire state of a program.

...
Wikipedia