Skip to content

Quality attributes

  1. Scalability

    • LB
    • DB Sharding
    • API gateway (organization scalability/flexibility - decouple the frontend from the internal structure)
  2. Performance

    • CDN
    • DB Indexing (partitioning and sorting)
    • API gateway caching (availability over consistency)
    • Buffering (message broker)
    • Regionally infrastructure (multi data center deployment)
  3. Availability (Fault tolerance)

    • Database replication and redundancy (durability)
    • Failover to other region
    • Availability over Consistency (AP over CP)
    • Snapshots & Backups
    • Postmortems
  4. Testability

  5. Deployability
  6. Maintainability
  7. Portability
  8. Security
  9. Observability
  10. Consistency
  11. Efficiency
  12. Usability

Scalability

  • It's the measure of a system's ability to handle a growing amount of work by adding resources to the system
  • The resources need to be well managed in order to be used effectively and avoid overheads

  • Vertical Scalability (scale up)

  • Add or upgrade existing resources on a single computer
  • E.g., more memory, cpu
  • Any application can benefit from it (node code changes required)
  • Horizontal Scalability (scale out)
  • Add more resources in form of new instances/machines
  • No limit on scalability
  • The system should be designed to scale horizontally (sometimes not easily refactored, e.g., concurrency issues)
  • Increased complexity, coordination overhead
  • Team/Org Scalability
  • More engineers
  • Better to have smaller groups of engineers each working on a module/service of the system

Performance

Response Time

  • Also known as End-to-End latency
  • Composed of the processing time + waiting time
  • A common mistake is to consider the e2e latency to be only the processing time

  • The metrics used is the xth percentile

  • It's the value below which x% of the occurrences can be found
  • E.g., 300 ms of 99% latency percentile: 99% of the requests are responded in under 300 ms
  • The other portion with the worst metrics (e.g., the other 1%) is the tail latency
  • We want the tail latency to be as short as possible, e.g., 99% (1% tail latency)

Throughput

  • Amount of work performed by the system per unit of time
  • When the throughput is so high that the system can't handle, a performance degradation starts whole system perform in a degraded state. This point is called degradation point

Availability

$$Availability = \frac{Uptime}{Uptime+Downtime}$$

Other metrics:

  • MTBF: Mean Time Between Failures
  • Average time the system is operational
  • Because the uptime
  • MTTR: Mean Time to Recovery
  • Average time to detect and recover from a failure
  • Basically the downtime

$$Availability = \frac{MTBF}{MTBF+MTTR}$$

Availability Daily Downtime Weekly Downtime Monthly Downtime
99% (2 nines) 14 min 25 s 1 h 40 min 7 h 18 m
99.9% (3 nines) 1 min 26 s 43 m 12 s 8 h 45 min
  • Usual causes of unavailability
  • Human Error: faulty code
  • Software Errors: long garbage collections, out-of-memory exceptions
  • Hardware Failures: network failures, servers, routers, power outages

  • Fault Tolerance is a way to achieve a higher availability despite failures