This is your DevOps Reliability Cheat Sheet.
π₯ 1️⃣ High Availability (HA)
Definition
System remains operational even if some components fail.
Analogy
Hospital with multiple doctors — if one is absent, others handle patients.
AWS Example
EC2 across 2 AZs behind ALB.
π₯ 2️⃣ Fault Tolerance (FT)
Definition
System continues operating with zero interruption during failure.
Analogy
Airplane with two engines — one fails, flight continues without disruption.
Example
Active-Active multi-region architecture.
π Now the Important DevOps Reliability Terms
π 3️⃣ SLA – Service Level Agreement
Definition
Formal contract between provider and customer.
Defines:
-
Availability %
-
Support response time
-
Compensation if broken
Example
AWS EC2 SLA = 99.99% uptime
Analogy
Restaurant promises food within 20 minutes or free.
π 4️⃣ SLO – Service Level Objective
Definition
Internal reliability target.
Example:
“We aim for 99.9% uptime.”
Not customer-facing contract.
Analogy
Your personal fitness goal: “I will run 5km daily.”
π 5️⃣ SLI – Service Level Indicator
Definition
Metric used to measure performance.
Examples:
-
Uptime %
-
Error rate
-
Latency
-
Request success rate
Analogy
Weight machine measuring your fitness progress.
⏱ 6️⃣ MTTR – Mean Time To Recovery
Definition
Average time to restore service after failure.
Formula:
Total downtime / Number of incidents
Analogy
How fast a mechanic fixes your car.
Lower MTTR = better reliability.
π₯ 7️⃣ MTBF – Mean Time Between Failures
Definition
Average time system runs before failing.
Higher MTBF = better stability.
⚡ 8️⃣ RTO – Recovery Time Objective
Definition
Maximum acceptable downtime.
Example:
“We must restore service within 30 minutes.”
πΎ 9️⃣ RPO – Recovery Point Objective
Definition
Maximum acceptable data loss.
Example:
“We can only lose 5 minutes of data.”
π 1️⃣0️⃣ Failover
Automatic switch to backup system when primary fails.
π 1️⃣1️⃣ Failback
Switching back to primary system after recovery.
π 1️⃣2️⃣ Multi-AZ
Deploy across multiple availability zones within a region.
π 1️⃣3️⃣ Multi-Region
Deploy across multiple regions for disaster recovery.
𧨠1️⃣4️⃣ Blast Radius
Scope of failure impact.
Good architecture minimizes blast radius.
π 1️⃣5️⃣ Error Budget
Allowed downtime based on SLO.
Example:
99.9% uptime → 43 minutes downtime per month.
Error budget = 43 minutes.
π¦ 1️⃣6️⃣ Availability %
Common Levels:
| Level | Downtime per year |
|---|---|
| 99% | 3.65 days |
| 99.9% | 8.7 hours |
| 99.99% | 52 minutes |
| 99.999% | 5 minutes |
π‘ 1️⃣7️⃣ Resilience
System’s ability to recover from failure.
π 1️⃣8️⃣ Scalability
Ability to handle increased load.
-
Vertical scaling (bigger server)
-
Horizontal scaling (more servers)
π 1️⃣9️⃣ Observability
Ability to understand system health.
Includes:
-
Metrics
-
Logs
-
Traces
Tools:
Prometheus, Grafana, CloudWatch
π¨ 2️⃣0️⃣ Incident Management
Process to handle outages:
-
Detect
-
Mitigate
-
Root cause
-
Postmortem
π§ Full Relationship Map (Very Important)
SLA → external contract
SLO → internal goal
SLI → measurement
Error Budget → allowed failure
MTTR → recovery speed
MTBF → stability measure
RTO/RPO → disaster recovery limits