DevSecOps and the importance of Fire-Drills

The usual experience with fire-drills is that they are a noisy, brief, and vaguely annoying disruption to the work day.  But in DevSecOps, Fire-Drills take on a new and important meaning: routine and random testing of the failure modes of what you build and the processes you use to support and secure what you have built. Deliberately disrupt part of a functioning system to see whether, and how, it recovers. Do it randomly. Do it often.

The idea feels revolutionary, but it is actually evolutionary: it is informed by the rich body of knowledge in the Business Continuity & Disaster Recovery Planning field, recent innovations such as Netflix's Chaos Monkey and Security Monkey, the Rugged Ops movement and Dr Werner Vogels' observation that "everything fails all the time" based on the statistical certainty of realizing MTBF when operating at cloud scale.

What is fire-drilling in the DevSecOps context? Fire-drills simply exercise one or more failure modes to discover whether a system is resilient to those failure modes (or, more likely, exhibits cascading failure modes where a small failure in one place cascades into larger failures).  The timing of the fire-drill is unannounced, and the subject of the fire-drill might be random (such as with chaos monkey), or it might be carefully selected to test a hypothesis about how the system fails, degrades, or recovers.

But, isn't this incredibly disruptive and risky?  Executed poorly, yes; but executed carefully, fire-drills will reveal critical weaknesses, build operational and response competence, and help you get ahead of attackers. Since, if you find the weakness before an attacker does, you can implement safeguards and engineer greater resilience. And that is the objective: to find failure modes that create weaknesses that attackers can exploit, and guard against or eliminate them. Far better to discover that a critical transaction processing system fails un-gracefully during an internal test where you have control over the timing than to discover it during a critical business window due to unexpected legitimate load, or an external attack. In addition, fire-drills can reveal points of process frailty, such as run-books that are incomplete or unclear, or when a deployment pipeline is circumvented by ‘hand-jamming’.   This is particularly important because the process and procedure gaps can be particularly hard to recover from without ‘heroics’ on the part of the one or two team members who are in-the-know.  Knowledge silos such as this are destructive to resilience and are easily discovered with thoughtful application of fire-drills.

The real art to the fire-drill is actually science: study the system to develop a model of how it works.  Then develop a set of testable hypotheses for failure modes if specific components are disrupted. Risk rate the expected failure mode, and the worst case scenarios, and use this set of information to determine the timing and pace of the fire-drills. As you build confidence, competence, rigor, and trust, you can move toward a random selection model more akin to Netflix's Chaos Monkey. This is ultimately a great place to get to, because the expectation of resilient design becomes "baked in" to the operating model.

Initially, the team experiencing the fire-drills is likely to be frustrated, but with good timing and follow through on lessons learned the fire-drill will become a valued part of operational rigor and even a point of team pride.