Checkpointing

Checkpointing is a technique to add fault tolerance into computing systems. It basically consists of saving a snapshot of the application's state, so that it can restart from that point in case of failure. This is particularly important for long running applications that are executed in failure-prone computing systems.

In distributed computing, checkpointing is a technique that helps tolerate failures that otherwise would force long-running application to restart from the beginning. The most basic way to implement checkpointing, is to stop the application, copy all the required data from the memory to reliable storage (e.g., Parallel file system) and then continue with the execution. In case of failure, when the application restarts, it does not need to start from scratch. Rather, it will read the latest state ("the checkpoint") from the stable storage and execute from that.

There are two main approaches for checkpointing in such systems: coordinated checkpointing and uncoordinated checkpointing. In the coordinated checkpointing approach, processes must ensure that their checkpoints are consistent. This is usually achieved by some kind of algorithm. In uncoordinated checkpointing, each process checkpoints its own state independently. It must be stressed that simply forcing processes to checkpoint their state at fixed time intervals is not sufficient to ensure global consistency. The need for establishing a consistent state (i.e., no missing messages or duplicated messages) may force other processes to roll back to their checkpoints, which in turn may cause other processes to roll back to even earlier checkpoints, which in the most extreme case may mean that the only consistent state found is the initial state (the so-called domino effect).

One of the original and now most common means of application checkpointing was a "save state" feature in interactive applications, in which the user of the application could save the state of all variables and other data to a storage medium at the time they were using it and either continue working, or exit the application and at a later time, restart the application and restore the saved state. This was implemented through a "save" command or menu option in the application. In many cases it became standard practice to ask the user if they had unsaved work when exiting the application if they wanted to save their work before doing so.

...
Wikipedia