Emergency response drills
In the previous post we discussed our Incident Command System (ICS) at Bitstamp. Our current framework is the result of many many Friday Failure sessions or internally referred to as Emergency Response Drills.
Admittedly it took us a while to get into this practice as we needed to clean up a lot of legacy infrastructure to even be able to have the confidence that these kind of trainings could yield results. We practice our ER drills on our staging systems and while this goes against modern chaos monkey/engineering ways we have opted for this because:
- our IAAC and application deployment mimics our production environment to the letter and
- it allows us to destroy and restore mission critical systems without impacting internal and external customers.
The following creates a safe, fun and engaging environment where we can train our engineers and write up system recovery guides for practiced scenarios.
A few days before the ER drill the "chaos engineers" grab a drink and devise scenarios for the ER drill. The main idea is always to expose engineering to known hotspots, uncover unknowns in the process and familiarize application owners on how their code can fail. We have a library of scenarios at this point that ranges from infrastructure and applications failures to security breaches and we often reuse scenarios that affect critical components of the trading system just to remain vigilant. An example of such scenarios are: matching engine failures, shared storage failures, 3rd party services failures, active penetration tests that can take down systems and so on. We also have very simple scenarios where the point is not to really solve the technical problem but instead practice the ICS.
So every other Fridays we reserve a few hours before lunch for our ER drills. Now these are preplanned so we have to somehow mimic real life situations especially how system anomalies are detected. It helps a lot that our monitoring in stagings is also a production copy so all the same dashboards are there. When we are practicing we notify and reserve responder teams beforehand but they don't know the exact time of the scenario start. They need to rely on component monitoring for this. When the scenario starts the creators of the scenario are passive observers and the scenario participants have zero knowledge of the scenario in play beforehand - having outside observers in drills is a must! This is a gold mine of information for post drill discussions - a detached perspective uncovers all the flaws that are not obvious to active responders as they are too involved in problem solving. As an observer it is your duty to write all these flaws down and use the to improve either the framework or processes involved.
At this point the scenario resolution is taking place and you will either have a successful resolution or there might be a failure scenario - you ran out of time, recovery procedures are not defined etc. Both outcomes are OK and personally I prefer the second outcome as we had just uncovered something that we are not prepared for. We take all this information and use it for our after action reports - these mimic post mortems but in a bit more relaxed fashion.
After Action Reports - AARs
These should be done immediately after the drill and focus on the usual key points: what we did good, what we did bad, what needs improving, what we liked / did not like. The last point is a bit different as we allow responders to get a bit more subjective - we use this to discuss soft skill problems such as intra team communication, time pressures, Incident Commander's command delegation etc. Admittedly gathering this information is perhaps the most difficult as it can be personal but it gives great feedback for the ICS framework and for team leads as well, because it exposes new opportunities in which we can train and educate our colleagues to help them become better.
For the usual points we follow the well defined pattern of post mortems - celebrate what was good, improve and iterate on what was bad. If a system recovery was unsuccessful for a given scenario then we usually allow 14 days to implement fixes and write up procedures and we go through the same scenario again. We are firm believers in incremental improvements so we will practice the same scenarios until we are satisfied with the outcome and we will return to the scenario as we are aware that there is always room for improvement.
There are always dragons
While this all sounds really simple and straightforward on paper it is anything but. Our first ER drill (matching engine failure) was a complete disaster. Responders eventually restored the engine, some restoration procedures were found to be outdated or did not even work and the overall atmosphere of the exercise was abysmal. Our spirits were down days after the exercise and admittedly it was one of my big failures. An exercise that was supposed to be engaging, fun and productive turned into a nightmare.
The problem was not of a technical nature - we have great engineers that can tackle most technical issues, instead the problem was that the teams were not familiar enough with our ICS. Our first drill resembled an uncoordinated mess of communication and often yelling, zero problem compartmentalization, lack of internal and external reporting, lack of common documentation sets - we probably ticked all the don't boxes at some point. It was also a big revelation for me - in order for diverse teams to function cohesively it is imperative that we train them on common frameworks (such as our ICS) where the end goal is not for them to solve problems but instead to familiarize themselves with the framework. We now have scenarios where the technical problem that needs to be solved is inconsequential, the main goal of the scenario is to educate and train responders to use the ICS and improve upon it. We incorporated these drills into our onboarding procedures - all our new hire engineers are exposed to them. It is our view that it brings the new hires up to speed much faster and they can have hands on experience with systems during a failure scenario.
How to conquer these dragons:
- training responders how to handle incidents is extremely important. Train and use your ICS or other response framework regularly.
- practice communication channels during drills - it is often the missing piece to successful problem resolution. Bonus challenge: involve C-level management in one of your drills. If they do not know what is the status of the drill then your communication roles need work.
- practice the what we liked / did not like questions on your AARs. They will expose soft skill issues. Use the findings to educate responders further.
- always iterate on your response framework - it should be fluid and evolve over time to fit your needs.
- involve new hires / juniors in these drills. They can assume scribe / liaison roles as a start.
Some last words
Such drills are now a fundamental part of Engineering at Bitstamp. We train our employees how to be active responders in emergency situations and react to such events in an organized and methodical way. Having recovery guides, runbooks, redundancies are great, but I would argue that having trained employees that know how to form ad-hoc resolver teams and collectively solve difficult problems is something that is even more important. If you wish to implement such practices at your company take a look at our ICS and schedule your next Failure Friday.