What is Chaos Engineering? : The Ultimate Guide | Principles of Chaos Engineering [ OverView ]
Last updated on 25th Dec 2021, Blog, General
Chaos Engineering is the discipline of experimenting on a distributed system in order to induce artificial failures to build confidence in the system’s capability to withstand turbulent conditions in production.
- What is Chaos engineering?
- The concepts behind Chaos engineering
- Advanced principles of Chaos engineering
- Chaos engineering best practises
- Examples of Chaos engineering
- Chaos engineering tools
- What’s the role of Chaos Engineering in distributed systems?
- Benefits of Chaos Engineering?
What is Chaos engineering?
Chaos engineering is the process of testing a distributed computing system to ensure that it can withstand unexpected disruptions. It relies on the underlying concepts of Chaos theory, which focus on random and unpredictable behaviour. The goal of Chaos engineering is to identify weakness in a system through controlled experiments that introduce random and unpredictable behaviour.
One of the main advantages of chaotic engineering is that organisations can use it to identify vulnerabilities before a hacker happens or before systems fail. Changes made as a result of Chaos engineering testing increase the trust in the organisation’s systems.
Some IT groups organise Chaos engineering game days where teams attempt to break or breach systems. They use failure modes and effective analysis or other tactics to gain insight into potential points of failure in their organisation’s systems.
- Blind spots. Locations where monitoring software cannot collect enough data.
- Hidden insects. glitches or other issues that can cause software to malfunction.
- performance bottlenecks. Situations where efficiency and performance can be improved.
- As more companies move towards the cloud or enterprise edge, their systems are becoming more distributed and complex. The same can be said about software development methods where the emphasis is on continuous delivery. Those development processes are also becoming increasingly complex. As the infrastructure of an organisation and the processes of working within that infrastructure become more complex, the need to adapt to the Chaos grows.
The concepts behind Chaos engineering:
The main concept behind Chaos engineering is to break down a system for the purpose of gathering information that will help improve the system’s resilience. Chaos Engineering is an approach to software testing and quality assurance. It is suitable for modern distributed systems and processes.
Chaos engineering is specifically applied to distributed computing environments. A distributed computing system is a group of computers connected by a network and shared resources. These systems can break down when unforeseen circumstances arise. With large distributed systems, components often have complex and unpredictable dependencies, and it is difficult to troubleshoot errors or predict when an error will occur.
There are many ways a distributed system can fail. Their size and complexity can lead to seemingly random events. And the more complex the system, the more unpredictable and chaotic its behaviour is.
Chaos engineering experiments deliberately create turbulent conditions in a distributed system to test the system and detect vulnerabilities. Some examples of problems that may be uncovered by a Chaos experiment include:
- The network is reliable.
- There is zero latency.
- Bandwidth is infinite.
- The network is secure.
- Topology never changes.
- is an administrator.
- The transportation cost is zero.
- The network is homogeneous.
Advanced principles of Chaos engineering:
Computer scientist at Sun Microsystems L. Peter Deutsch and his colleagues developed a list of eight misconceptions of distributed computing. These are the misconceptions that programmers and engineers often make about distributed systems. They are a good starting point when applying Chaos engineering to a problem. The eight misconceptions include:
It is debated whether these misconceptions are still delusions, but Chaos engineers continue to use them as core principles in understanding systems and network problems. Their underlying theme is that systems and networks are never perfect or 100% reliable. Because of this, we have the concept of “five nines” for highly available systems. Instead of striving for 100% availability, the closest engineers can get is 99.999% perfection.
These false assumptions are easy to make in distributed computing environments, and they are the basis for seemingly random problems arising from complex distributed systems.
Chaos engineering best practises:
Chaos engineering is complicated. Following these best practises can help avoid the problems that stem from the misconceptions listed above:
Understand the general behaviour of the system. Having a solid understanding of the system when it is healthy will help diagnose problems.
Simulate realistic scenarios. Focus on injecting potential failures and bugs. For example, if latency has been a problem in the past, inject a bug that causes latency.
Test using real-world conditions. It gives the most accurate results. Chaos engineering is often done in production environments, especially when it is too cumbersome or expensive to replicate a large, distributed system for testing purposes.
Reduce the blast radius. Chaos engineering can be highly disruptive. Success demands coordination among IT staff, developers and business units. Experiments are rarely run at peak times in a production environment, and ideally, no one using the system will be able to tell that Chaos experiments are taking place. There should be redundancies to ensure that services remain available if experiments cause problems.
Develop Your Skills with SQL Certification TrainingWeekday / Weekend BatchesSee Batch Details
- Imagine a distributed system that can handle a certain number of transactions per second. Chaos engineering testing can be used to find out how software will respond when that transaction limit is reached. Will performance suffer or will the system crash?
- Chaos engineering can also be used to test how a distributed system behaves when it experiences a lack of resources or a single point of failure. If the system fails, developers can implement design changes. Once the changes are made, the test is repeated to verify the desired results.
- One notable real-world system failure had a Chaos engineering connection. In 2015, Amazon’s DynamoDB experienced an availability problem in one of its regional regions. That lapse caused more than 20 Amazon Web services to fail in an area that relied on DynamoDB. Sites using the services – including Netflix – we’re closed for several hours. However, Netflix experienced fewer failures than other sites, as it built and used a Chaos engineering tool called Chaos Kong to prepare for such a scenario.
- Chaos Kong disables entire AWS Availability Zones, which are AWS data centres that serve a geographic area. Using the tool Netflix had experienced responding to regional outages such as DynamoDB caused the problem. The company’s ability to deal with outages is often cited in explaining the importance of Chaos engineering.
Examples of Chaos engineering:
- Anarchy Kong. Disables the entire AWS Availability Zone.
- Anarchy Monkey. Randomly disables instances of the production environment to cause system failure but is designed not to impact customer activity.
- Chaos Gorilla. Like Chaos Monkey but on a larger scale.
- Latency introduces latency to simulate network outages and degradation.
- Chaos Monkey terminates the service instance
- Chaos Monkey is a tool that enables Chaos engineering by creating problems on the system. Here, it is shown the termination example of a service.
- Netflix’s Simian Army continues to grow as more Chaos-inducing programs are created to test the streaming service’s capabilities.
Chaos engineering tools:
Netflix was a notable pioneer of Chaos engineering and one of the first to use it in production systems. Netflix designed and developed the open source Chaos test automation platform collectively dubbed Simian Army.
The Simian Army Suite includes a number of tools, including:
Some other Chaos engineering tools include:
Simur. An open source failure-inducing program. LinkedIn uses this program to conduct Chaos engineering experiments.
Monkey-Ops. An open source tool implemented in Go and built to test and eliminate random components and deployment configurations.
Gremlin. A Chaos engineering program that works with AWS and Kubernetes and focuses on the retail and finance sectors. It comes with built-in redundancy which prevents engineering experiments posing a threat to the system.
AWS Fault Injection Simulator. Contains fault templates that AWS can inject into production instances. The platform has built-in redundancy and protective measures to keep failure injection testing due to system problems.
- Network is reliable
- Latency is zero
- Bandwidth is infinite
- Network is secure
- Topology doesn’t change
- Is an administrator
- Transportation cost is zero
- Network is homogeneous
What’s the role of Chaos Engineering in distributed systems?
Distributed systems are inherently more complex than monolithic systems, so it is hard to predict all the ways they can fail. The Eight Misconceptions of Distributed Systems, shared by Peter Deutsch and others at Sun Microsystems, describes the false assumptions that programmers new to distributed applications always make.
Misconceptions of Distributed Systems:
Many of these misconceptions drive the design of Chaos Engineering experiments such as “packet-loss attacks” and “latency attacks”. For example, network outages can cause a variety of failures for applications that severely impact customers. Applications may stall as long as they wait endlessly for a packet. Applications may permanently consume memory or other Linux system resources. And even after a network outage has passed, applications may fail to retry paused operations, or retry too aggressively. The application may also require a manual restart. Each of these examples needs to be tested and prepared.
- Traffic Team (eg Nginx, Apache, DNS)
- Streaming team (eg Kafka)
- Storage Team (eg S3)
- Data Team (eg Hadoop/HDFS)
- Database team (eg MySQL, Amazon RDS, PostgreSQL)
- Some companies, such as Remind, are integrating Chaos Engineering into their normal release cycle, as are other best practice tests to ensure that reliability is baked into every feature.
Benefits of Chaos Engineering?
Customers: The increased availability and durability of the service means that no outage disrupts their day-to-day life.
Business: Chaos Engineering can help prevent excessively large losses in revenue and maintenance costs, produce happier and more engaged engineers, improve on-call training for engineering teams, and for the company as a whole. SEV (Incident) Management can improve the program.
Technical: Insights from Chaos experiments could mean a reduction in incidents, a reduction in on-call burden, an increased understanding of system failure modes, better system design, faster average time to SEV, and a reduction in repeated SEVs.
These service teams are often the first to practice and promote Chaos Engineering within a company:
As web systems have become much more complex with the rise of distributed systems and microservices, it has become difficult to predict system failures. So to prevent failures from happening, we all need to be proactive in our efforts to learn from failure.
Chaos Engineering is a tool to make your job easier. By continually testing and verifying your system’s failure modes, you’ll reduce your operational burden, increase your availability, and sleep better at night.
Several engineering organisations, including Netflix and Stitch Fix, have dedicated Chaos engineering teams. These teams are often small in size, consisting of 2-5 engineers. The Chaos Engineering team owns and advocates Chaos Engineering throughout the organisation. However, they are not the only engineers doing day-to-day Chaos Engineering – they empower teams in their engineering organisation to use Chaos Engineering.