Skip to content

Fault Tolerance

  • Enables the system to remain operational despite failure within one or multiple of its components

Failure prevention

Replication & Redundancy

  • Active-Active architecture
  • Multiple database replicas actively receive workload
  • The multiple database sync between each other
  • If a database replica goes down, the others same its load
  • Pros: load is distributed. Cons: synchronization is very complex
  • Active-Passive architecture
  • One primary (master/leader) replica that receives all the workload and multiple passive/follower replicas
  • The other passive replicas take periodic snapshots of the master replica
  • Pros: implementation is easier with snapshots. Cons: won't distribute the load

Failure detection & isolation

  • An auxiliary Monitoring Service that checks the health of the instances by:
  • Sending periodic health-check requests to the services
  • Receiving period heart-beat requests from the services
  • Whenever an unhealthy service is detected it is then cut off from the user facing

  • The monitoring service can be even more smart by:

  • Monitoring the error rate in the hosts
  • Monitoring the response time on the hosts ... and detect if it's in an unhealthy state

Recovery

  • Mitigation
  • Cut off traffic to the unhealthy host
  • Rollback to a previous code version
  • Rollback to a previous database state
  • Processes
  • Automatic alerts to engineers
  • Predefined handbooks/playbooks on what to do in certain situations
  • Mechanisms
  • Automatic failovers, restarts, rollbacks, auto-scaling policies