This is your DevOps Reliability Cheat Sheet.

🔥 1️⃣ High Availability (HA)

Definition

System remains operational even if some components fail.

Analogy

Hospital with multiple doctors — if one is absent, others handle patients.

AWS Example

EC2 across 2 AZs behind ALB.

🔥 2️⃣ Fault Tolerance (FT)

Definition

System continues operating with zero interruption during failure.

Analogy

Airplane with two engines — one fails, flight continues without disruption.

Example

Active-Active multi-region architecture.

🛠 Now the Important DevOps Reliability Terms

📊 3️⃣ SLA – Service Level Agreement

Definition

Formal contract between provider and customer.

Defines:

Availability %
Support response time
Compensation if broken

Example

AWS EC2 SLA = 99.99% uptime

Analogy

Restaurant promises food within 20 minutes or free.

📈 4️⃣ SLO – Service Level Objective

Definition

Internal reliability target.

Example:
“We aim for 99.9% uptime.”

Not customer-facing contract.

Analogy

Your personal fitness goal: “I will run 5km daily.”

📏 5️⃣ SLI – Service Level Indicator

Definition

Metric used to measure performance.

Examples:

Uptime %
Error rate
Latency
Request success rate

Analogy

Weight machine measuring your fitness progress.

⏱ 6️⃣ MTTR – Mean Time To Recovery

Definition

Average time to restore service after failure.

Formula:
Total downtime / Number of incidents

Analogy

How fast a mechanic fixes your car.

Lower MTTR = better reliability.

💥 7️⃣ MTBF – Mean Time Between Failures

Definition

Average time system runs before failing.

Higher MTBF = better stability.

⚡ 8️⃣ RTO – Recovery Time Objective

Definition

Maximum acceptable downtime.

Example:
“We must restore service within 30 minutes.”

💾 9️⃣ RPO – Recovery Point Objective

Definition

Maximum acceptable data loss.

Example:
“We can only lose 5 minutes of data.”

🔄 1️⃣0️⃣ Failover

Automatic switch to backup system when primary fails.

🔁 1️⃣1️⃣ Failback

Switching back to primary system after recovery.

🌍 1️⃣2️⃣ Multi-AZ

Deploy across multiple availability zones within a region.

🌎 1️⃣3️⃣ Multi-Region

Deploy across multiple regions for disaster recovery.

🧨 1️⃣4️⃣ Blast Radius

Scope of failure impact.

Good architecture minimizes blast radius.

📊 1️⃣5️⃣ Error Budget

Allowed downtime based on SLO.

Example:
99.9% uptime → 43 minutes downtime per month.

Error budget = 43 minutes.

🚦 1️⃣6️⃣ Availability %

Common Levels:

Level	Downtime per year
99%	3.65 days
99.9%	8.7 hours
99.99%	52 minutes
99.999%	5 minutes

🛡 1️⃣7️⃣ Resilience

System’s ability to recover from failure.

🔁 1️⃣8️⃣ Scalability

Ability to handle increased load.

Vertical scaling (bigger server)
Horizontal scaling (more servers)

📉 1️⃣9️⃣ Observability

Ability to understand system health.

Includes:

Metrics
Logs
Traces

Tools:
Prometheus, Grafana, CloudWatch

🚨 2️⃣0️⃣ Incident Management

Process to handle outages:

Detect
Mitigate
Root cause
Postmortem

🧠 Full Relationship Map (Very Important)

SLA → external contract
SLO → internal goal
SLI → measurement
Error Budget → allowed failure
MTTR → recovery speed
MTBF → stability measure
RTO/RPO → disaster recovery limits

Sunday, February 15, 2026

DevOps Reliability Cheat Sheet

🔥 1️⃣ High Availability (HA)

Definition

Analogy

AWS Example

🔥 2️⃣ Fault Tolerance (FT)

Definition

Analogy

Example

🛠 Now the Important DevOps Reliability Terms

📊 3️⃣ SLA – Service Level Agreement

Definition

Example

Analogy

📈 4️⃣ SLO – Service Level Objective

Definition

Analogy

📏 5️⃣ SLI – Service Level Indicator

Definition

Analogy

⏱ 6️⃣ MTTR – Mean Time To Recovery

Definition

Analogy

💥 7️⃣ MTBF – Mean Time Between Failures

Definition

⚡ 8️⃣ RTO – Recovery Time Objective

Definition

💾 9️⃣ RPO – Recovery Point Objective

Definition

🔄 1️⃣0️⃣ Failover

🔁 1️⃣1️⃣ Failback

🌍 1️⃣2️⃣ Multi-AZ

🌎 1️⃣3️⃣ Multi-Region

🧨 1️⃣4️⃣ Blast Radius

📊 1️⃣5️⃣ Error Budget

🚦 1️⃣6️⃣ Availability %

🛡 1️⃣7️⃣ Resilience

🔁 1️⃣8️⃣ Scalability

📉 1️⃣9️⃣ Observability

🚨 2️⃣0️⃣ Incident Management

🧠 Full Relationship Map (Very Important)

Configuring Java and Maven