Chaos Monkey is a tool invented in 2011 by Netflix to test the resilience of its IT infrastructure. The tool works by intentionally disabling computers in Netflix's production network to test how remaining systems respond to the outage. Chaos Monkey is now part of a larger suite of tools called Simian Army designed to simulate and test responses to various system failures and edge cases.
In software development, the ability of software to tolerate failures, to be resilient, and to ensure optimal quality of service is often treated as a non-functional requirement. However, due to factors like short deadlines or lack of knowledge of the field, development teams often skip or overlook these requirements.
In 2011, Greg Orzell while working on the Netflix cloud migration had the idea to change this paradigm by setting up a tool that would cause breakdowns in their production environment, the environment used by Netflix customers. This proposal changed the assumptions in software development from a model where there would be no breakdowns to a model where breakdowns were certain, ensuring that built in resilience would no longer be an option, but an obligation:
"At Netflix, our culture of freedom and responsibility led us not to force engineers to design their code in a specific way. Instead, we discovered that we could align our teams around the notion of infrastructure resilience by isolating the problems created by server neutralization and pushing them to the extreme. We have created Chaos Monkey, a program that randomly chooses a server and disables it during its usual hours of activity. Some will find that crazy, but we could not depend on the random occurrence of an event to test our behavior in the face of the very consequences of this event. Knowing that this would happen frequently has created a strong alignment among engineers to build redundancy and process automation to survive such incidents, without impacting the millions of Netflix users. Chaos Monkey is one of our most effective tools to improve the quality of our services."
By regularly "killing" random instances its possible to test a redundant architecture to ensure that a server failure does not noticeably impact end-users. The name Chaos monkey is explained in the book "Chaos Monkeys" by Antonio Garcia Martinez:
"Imagine a monkey entering a "data center", these "farms" of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices and returns everything that passes by the hand. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy."