Chaos Engineering
Break something on purpose to validate your assumptions/hypotheses!
- Resilience testing = Chaos testing
- Improve resilience and performance
- Uncover hidden issues
- Expose blind spots
Phases
NothingLearningRehearseExperimentation- Inject failure
- Failure mode analysis: Define hypothesis
- Validate the assertions
Validation tests- Assert the hypotheses
Continuous fault injection- Experiment in a loop
Game Days
- It's a kind of chaos testing, but focused people components (operational procedures)
- Did everyone know what to do?
- Were runbooks correct and complete?
- Did it recover on the expected time?
- It works like a training mechanism! (Train on-calls)
Types of validation tests
- Dependency isolation
- Recovery actions
- Alarm sensitivity
- Sufficient capacity
Process
Chaos engineering is about identifying the Steady State of a system (the stable state of a system)
- Hypothesis for a failure
- Run experiment using fault injection
- Verify results
- Improve the system