Sunday, February 15, 2026

DevOps Reliability Cheat Sheet

 This is your DevOps Reliability Cheat Sheet.


πŸ”₯ 1️⃣ High Availability (HA)

Definition

System remains operational even if some components fail.

Analogy

Hospital with multiple doctors — if one is absent, others handle patients.

AWS Example

EC2 across 2 AZs behind ALB.


πŸ”₯ 2️⃣ Fault Tolerance (FT)

Definition

System continues operating with zero interruption during failure.

Analogy

Airplane with two engines — one fails, flight continues without disruption.

Example

Active-Active multi-region architecture.


πŸ›  Now the Important DevOps Reliability Terms


πŸ“Š 3️⃣ SLA – Service Level Agreement

Definition

Formal contract between provider and customer.

Defines:

  • Availability %

  • Support response time

  • Compensation if broken

Example

AWS EC2 SLA = 99.99% uptime

Analogy

Restaurant promises food within 20 minutes or free.


πŸ“ˆ 4️⃣ SLO – Service Level Objective

Definition

Internal reliability target.

Example:
“We aim for 99.9% uptime.”

Not customer-facing contract.

Analogy

Your personal fitness goal: “I will run 5km daily.”


πŸ“ 5️⃣ SLI – Service Level Indicator

Definition

Metric used to measure performance.

Examples:

  • Uptime %

  • Error rate

  • Latency

  • Request success rate

Analogy

Weight machine measuring your fitness progress.


⏱ 6️⃣ MTTR – Mean Time To Recovery

Definition

Average time to restore service after failure.

Formula:
Total downtime / Number of incidents

Analogy

How fast a mechanic fixes your car.

Lower MTTR = better reliability.


πŸ’₯ 7️⃣ MTBF – Mean Time Between Failures

Definition

Average time system runs before failing.

Higher MTBF = better stability.


⚡ 8️⃣ RTO – Recovery Time Objective

Definition

Maximum acceptable downtime.

Example:
“We must restore service within 30 minutes.”


πŸ’Ύ 9️⃣ RPO – Recovery Point Objective

Definition

Maximum acceptable data loss.

Example:
“We can only lose 5 minutes of data.”


πŸ”„ 1️⃣0️⃣ Failover

Automatic switch to backup system when primary fails.


πŸ” 1️⃣1️⃣ Failback

Switching back to primary system after recovery.


🌍 1️⃣2️⃣ Multi-AZ

Deploy across multiple availability zones within a region.


🌎 1️⃣3️⃣ Multi-Region

Deploy across multiple regions for disaster recovery.


🧨 1️⃣4️⃣ Blast Radius

Scope of failure impact.

Good architecture minimizes blast radius.


πŸ“Š 1️⃣5️⃣ Error Budget

Allowed downtime based on SLO.

Example:
99.9% uptime → 43 minutes downtime per month.

Error budget = 43 minutes.


🚦 1️⃣6️⃣ Availability %

Common Levels:

LevelDowntime per year
99%3.65 days
99.9%8.7 hours
99.99%52 minutes
99.999%5 minutes

πŸ›‘ 1️⃣7️⃣ Resilience

System’s ability to recover from failure.


πŸ” 1️⃣8️⃣ Scalability

Ability to handle increased load.

  • Vertical scaling (bigger server)

  • Horizontal scaling (more servers)


πŸ“‰ 1️⃣9️⃣ Observability

Ability to understand system health.

Includes:

  • Metrics

  • Logs

  • Traces

Tools:
Prometheus, Grafana, CloudWatch


🚨 2️⃣0️⃣ Incident Management

Process to handle outages:

  • Detect

  • Mitigate

  • Root cause

  • Postmortem


🧠 Full Relationship Map (Very Important)

SLA → external contract
SLO → internal goal
SLI → measurement
Error Budget → allowed failure
MTTR → recovery speed
MTBF → stability measure
RTO/RPO → disaster recovery limits

Configuring Java and Maven

  1️⃣ Configure Java Environment Open the Java environment file. sudo vi /etc/profile.d/java.sh Add these lines inside the file: expor...