Quality attributes
-
Scalability
- LB
- DB Sharding
- API gateway (organization scalability/flexibility - decouple the frontend from the internal structure)
-
Performance
- CDN
- DB Indexing (partitioning and sorting)
- API gateway caching (availability over consistency)
- Buffering (message broker)
- Regionally infrastructure (multi data center deployment)
-
Availability
(Fault tolerance)- Database replication and redundancy (durability)
- Failover to other region
- Availability over Consistency (AP over CP)
- Snapshots & Backups
- Postmortems
-
Testability
- Deployability
- Maintainability
- Portability
- Security
- Observability
- Consistency
- Efficiency
- Usability
Scalability
- It's the measure of a system's ability to
handle a growing amount of work
by adding resources to the system -
The resources need to be well managed in order to be used effectively and avoid overheads
-
Vertical Scalability (scale up)
- Add or upgrade existing resources on a single computer
- E.g., more memory, cpu
- Any application can benefit from it (node code changes required)
- Horizontal Scalability (scale out)
- Add more resources in form of new instances/machines
- No limit on scalability
- The system should be designed to scale horizontally (sometimes not easily refactored, e.g., concurrency issues)
- Increased complexity, coordination overhead
- Team/Org Scalability
- More engineers
- Better to have smaller groups of engineers each working on a module/service of the system
Performance
Response Time
- Also known as End-to-End latency
- Composed of the
processing time
+waiting time
-
A common mistake is to consider the e2e latency to be only the processing time
-
The metrics used is the
xth percentile
- It's the value below which x% of the occurrences can be found
- E.g., 300 ms of 99% latency percentile: 99% of the requests are responded in under 300 ms
- The other portion with the worst metrics (e.g., the other 1%) is the
tail latency
- We want the tail latency to be as short as possible, e.g., 99% (1% tail latency)
Throughput
- Amount of work performed by the system per unit of time
- When the throughput is so high that the system can't handle, a
performance degradation
starts whole system perform in a degraded state. This point is calleddegradation point
Availability
$$Availability = \frac{Uptime}{Uptime+Downtime}$$
Other metrics:
MTBF
: Mean Time Between Failures- Average time the system is operational
- Because the uptime
MTTR
: Mean Time to Recovery- Average time to detect and recover from a failure
- Basically the downtime
$$Availability = \frac{MTBF}{MTBF+MTTR}$$
Availability | Daily Downtime | Weekly Downtime | Monthly Downtime |
---|---|---|---|
99% (2 nines) | 14 min 25 s | 1 h 40 min | 7 h 18 m |
99.9% (3 nines) | 1 min 26 s | 43 m 12 s | 8 h 45 min |
- Usual causes of unavailability
Human Error
: faulty codeSoftware Errors
: long garbage collections, out-of-memory exceptions-
Hardware Failures
: network failures, servers, routers, power outages -
Fault Tolerance is a way to achieve a higher availability despite failures