How can a development team manage the chaos of creating and deploying software? Some think the answer is to embrace it, curate it, and then unleash it occasionally—sometimes at planned moments, and sometimes by surprise. From there, everyone can watch the effects and see if their codebase is ready to withstand these sudden shocks.
This unleashing of chaos can take many forms. Fifty years ago, it might have been as simple as flipping the power switch off and on again—and then checking for problems. Today, developers and companies inject chaos using a variety of tools and products, some proprietary and some open source. Some of these tools are sophisticated, targeting the insides of runtime mechanisms such as the Java Virtual Machine. Others act more crudely—simply reaching into the cloud and shutting down either individual instances or entire clusters.
In any case, the goal is to find a good way for developers to simulate any kind of chaos before it happens in real life. They explore various resources and then search for a way to fiddle with those resources to test how the code reacts when a resource disappears or becomes severely limited.
Chaos in the Cloud
Netflix, for instance, manages a cloud of machines that must rapidly start up in the evenings when viewing peaks—and then shut down later when people start to go to sleep. Its cloud is constantly changing. The company has to make sure that these comings and goings don’t trigger widespread failure and outages.
The developers at Netflix set out to build resilience into their stack by creating tools such as Chaos Monkey and Simian Army to dispense the kind of chaos that keeps developers aware of how things may fail. Chaos Monkey, for example, will reach into the stack and shut down individual instances. Simian Army is the original name for a tool that organizes Chaos Monkey and some other tools including Janitor Monkey, Swabbie, and Conformity Monkey. Together, these tools generate chaos and watch for code that doesn’t recover correctly.
As these tools find homes in multiple clouds, the cloud providers themselves have started supporting and deploying them. Azure users can turn to Proofdock’s Chaos Platform. Google shares a version that it has optimized for its own platform, Google Cloud Chaos Monkey. Amazon’s Fault Injection Service offers a general tool for testing AWS instances with a variety of potential problems. IBM Cloud users can turn to Gremlin, another tool for orchestrating the kind of problems that bedevil cloud applications.
“Not everyone is Netflix or such, so it takes time.” explained Sylvain Hellegouarch, a developer at Reliably, an SRE-software firm. “A lot of people start with the simplest problem, which is resource exhaustion in some fashion, or networking problems, especially like latency. Anything related to networking is good.”
Overprovisioned, or Underprovisioned?
Hellegouarch’s job title at Reliably is distinctive: chaos 3ngineer of complex distributed microservices systems. The company offers a number of services, including support for using Chaos Toolkit, an open-source collection of software packages designed for managing chaos.
Chaos Toolkit can orchestrate a number of modules or extensions that inject some form of problem or resource constraint. First, the chaos engineer using the toolkit creates a hypothesis and defines a successful, steady state for the running software. Then, the toolkit executes some code (often written in Python) and monitors the running software for anomalies or crashes.
“People move to a microservice architecture or a distributed anything and they want to see what happens when they lose the network between two machines,” said Hellegouarch. “Then they watch the CPU because nobody really knows how to provision correctly. So either you underprovision your system and you reach a CPU limit very quickly, or you overprovision and you’re paying a big bill for nothing really.”
Chaos Toolkit itself is distributed with an Apache license. Often, using it means no more than creating new configuration files with YAML. Moreover, the toolkit is designed to run in a containerized environment to simplify testing modern microservice architectures with multiple instances running in pods.
Chaos in Code
Chaos is not limited to protocols or resources, however. That being the case, some chaos engineers are exploring injecting faults or exceptions directly into the running code. ChaosMachine, a research tool developed at the KTH Royal Institute of Technology in Sweden, works with Java byte code. It can create exceptions and watch how the software behaves, enabling what its creators believe creates a steady stream of errors that can stress the Java code at the level of the try-catch blocks.
Two other options for other languages are Pythonfuzz and Google’s OSS-Fuzz. They will test code in other languages through a variety of techniques, such as calling functions with random and extreme parameters.
Some developers are also creating tools that go beyond general software, focusing on chaos associated with specific niche applications and environments. Cryptofuzz, for instance, is designed to test cryptographic libraries and the various protocols they support. It pays particular attention to crashes, memory leaks, buffer overflows, and uninitialized variables.
All of these approaches are just a beginning. The field is still evolving rapidly, as developers, DevOps engineers, and architects look at their codebases and start to ask “what if” questions. Once they start thinking along these lines, they realize that they can build tests; they can control and bottle the chaos just enough to build it into their quality-assurance pipeline.
“The problem we have as developers is we look at the happy path too much, and we don’t think about the other potential problems.” says Hellegouarch. “We can’t control the chaos, but simply being attuned to it means we can address it.”
Go to Publisher: TechBeacon
Author: Peter Wayner