Skip to content

Resilience Lifecycle

  • Set Objectives -> Design & Implement -> Evaluate & test -> Operate -> Respond & learn

Failure Modes

  • Gray failures
  • Database corruption
  • Overload

  • Solutions

  • Multi Regions, Multi AZ, Backups

High Availability

  • Have the app running in which you don't have a single point of failure

Disaster Recovery

  • RPO / RTO
  • Create bounded recovery times, recover fast!

Resilience Properties

  • Fault isolation
  • Sufficient capacity
  • Timely output
  • Correct output
  • Redundancy

Categories of failures (SEEMS)

  • Single points of failure
  • Excessive load
  • Excessive latency
  • Misconfiguration and bugs
  • Shared fate: violating intended fault isolation

FMEA? GMUD?

Resilience Trade-offs

  • Cost & effort
  • Complexity
  • Operational burden
  • Consistency & latency