Monday, February 9, 2026

RPO and RTO

 

RPO — Recovery Point Objective

“How much data can we afford to lose?”

Simple definition

RPO defines the maximum acceptable data loss, measured in time.

If a system fails, RPO answers:
“Up to what point in time must data be recovered?”


Example

  • RPO = 15 minutes

  • Disaster happens at 10:00 AM

  • You must be able to restore data up to 9:45 AM

  • Data created between 9:45–10:00 can be lost

๐Ÿ“Œ Smaller RPO = more frequent backups / replication
๐Ÿ“Œ Lower RPO = higher cost & complexity


Real-world RPO examples

SystemTypical RPO
Banking / TradingSeconds to minutes
E-commerce orders< 5 minutes
Internal toolsHours
Logs / Analytics24 hours

How RPO is achieved

  • Backup frequency (hourly, daily)

  • Database replication

  • Snapshot schedules

  • Cross-region replication


RTO — Recovery Time Objective

“How fast must the system be back online?”

Simple definition

RTO defines the maximum acceptable downtime after a failure.

If a system goes down, RTO answers:
“How quickly must service be restored?”


Example

  • RTO = 30 minutes

  • System crashes at 10:00 AM

  • Service must be fully operational by 10:30 AM

๐Ÿ“Œ Lower RTO = faster recovery systems
๐Ÿ“Œ Lower RTO = higher cost


Real-world RTO examples

SystemTypical RTO
PaymentsMinutes
Customer-facing apps< 1 hour
Internal dashboardsSeveral hours
Batch processing1 day

How RTO is achieved

  • Standby infrastructure (warm / hot)

  • Automated failover

  • Load balancers + health checks

  • Infrastructure as Code (fast rebuild)

  • Pre-tested runbooks


RPO vs RTO (Easy Comparison)

AspectRPORTO
FocusData lossDowntime
Measured inTimeTime
QuestionHow much data can we lose?How fast must we recover?
Controlled byBackup & replicationFailover & automation

One-liner to remember (๐Ÿ”ฅ interview gold)

RPO is about data loss, RTO is about downtime.

Or even shorter:

RPO = how far back,
RTO = how fast forward.


Mapping to DR strategies (very important)

DR StrategyRPORTO
Backup & RestoreHighHigh
Pilot LightMediumMedium
Warm StandbyLowLow
Active-ActiveNear-zeroNear-zero


RPO – Recovery Point Objective

Full form: Recovery Point Objective

Meaning:
The maximum acceptable amount of data loss, measured in time.

๐Ÿ‘‰ Answers the question:

“Up to what point in time must data be recovered after a failure?”

Example:

  • RPO = 15 minutes

  • Disaster at 10:00 AM

  • Data must be recoverable up to 9:45 AM


RTO – Recovery Time Objective

Full form: Recovery Time Objective

Meaning:
The maximum acceptable downtime before a system must be restored.

๐Ÿ‘‰ Answers the question:

“How quickly must the system be back online after a failure?”

Example:

  • RTO = 30 minutes

  • System fails at 10:00 AM

  • Service must be running by 10:30 AM


Quick Memory Trick ๐Ÿ”ฅ

  • RPOPoint in time (data loss)

  • RTOTime to recover (downtime)

Develop and maintain Infrastructure as Code using Terraform, Ansible, and cloud-native IaC tools

 

1) “Develop and maintain Infrastructure as Code using Terraform, Ansible, and cloud-native IaC tools.”

What it means (in simple words)

Instead of clicking in cloud portals, you define infrastructure in code and deploy it consistently:

  • Terraform → provisions cloud resources (VPC/VNet, subnets, EKS/AKS, IAM, RDS, etc.)

  • Ansible → configures servers/apps after provisioning (packages, config files, users, services)

  • Cloud-native IaC → AWS CloudFormation, Azure Bicep/ARM, GCP Deployment Manager (or Pulumi)

Why it’s helpful in development

  • Repeatable environments: Dev/Test/Prod become identical (less “works on my machine”)

  • Faster onboarding: New env in minutes

  • Safer changes: PR review + plan output

  • Audit trail: Git history shows who changed infra and why

  • Disaster recovery: You can recreate infra from code

How to implement (real-world approach)

A. Terraform structure (common enterprise pattern)

  • Create modules: network, eks/aks, rds, iam, monitoring

  • Separate environments using workspaces or folders:

    • environments/dev, environments/stage, environments/prod

  • Use remote state:

    • S3 + DynamoDB lock (AWS) / Azure Storage + locking / GCS (GCP)

B. Terraform workflow

  1. Dev opens PR with Terraform changes

  2. CI runs:

    • terraform fmt (format)

    • terraform validate

    • tflint (lint)

    • checkov/tfsec (security scans)

    • terraform plan and posts output to PR

  3. After approval:

    • terraform apply runs via pipeline (not a laptop)

C. Where Ansible fits
Use Ansible when you have:

  • VMs, legacy apps, OS-level configs

  • Need to install agents (Datadog, Splunk), configure Nginx, patching, hardening
    Implementation:

  • Terraform creates the VM + network + security groups

  • Ansible configures the VM (idempotent playbooks)

  • Use Ansible Vault/Secrets manager for credentials

Rule of thumb: Terraform creates infrastructure; Ansible configures software and OS.


2) “Implement HA/DR strategies, load balancing, and hybrid networking solutions.”

What it means

You design systems that:

  • Stay up when components fail (HA: High Availability)

  • Recover from disasters (DR: Disaster Recovery)

  • Distribute traffic safely (load balancing)

  • Connect on-prem + cloud securely (hybrid networking)

Why it’s helpful in development

  • Your app won’t go down during deployments, node failures, AZ outages

  • You meet enterprise requirements (uptime/SLA)

  • You avoid “big bang outages” by designing for failure

How to implement HA (practical patterns)

Compute HA

  • Run services across multiple Availability Zones

  • Kubernetes:

    • replicas across AZs

    • PodDisruptionBudgets

    • readiness/liveness probes

    • autoscaling (HPA + Cluster Autoscaler)

Database HA

  • Managed DB Multi-AZ (RDS Multi-AZ / Azure SQL HA / Cloud SQL HA)

  • Read replicas for scaling reads

  • Backup + point-in-time recovery

Load Balancing

  • L7 LB (HTTP/HTTPS): Ingress controller / ALB / App Gateway

  • L4 LB (TCP): NLB / Azure Load Balancer
    Implement:

  • TLS termination

  • health checks

  • weighted routing / blue-green / canary (advanced)

How to implement DR (real DR)

Define:

  • RPO (how much data loss is acceptable)

  • RTO (how fast must you recover)

Common DR strategies:

  • Backup & Restore (cheap, slower RTO)

  • Pilot Light (core running, scale on disaster)

  • Warm Standby (smaller version running)

  • Active-Active (best RTO, complex & expensive)

Implementation checklist:

  • Automated backups + tested restores

  • Infrastructure reproducible with IaC

  • Replicated data (cross-region where required)

  • Runbooks + game days (DR drills)

Hybrid networking (on-prem + cloud)

Typical solutions:

  • Site-to-Site VPN (quick, cheaper)

  • Dedicated private link: AWS Direct Connect / Azure ExpressRoute (more stable)

  • DNS strategy: split-horizon DNS, private endpoints

  • Routing: BGP, route tables, NAT, firewall rules

  • Security: least privilege, segmentation, inspection (firewalls), zero trust


3) “Integrate monitoring, SLOs, and quality gates into CI/CD pipelines for resilient delivery.”

What it means

You don’t just deploy fast — you deploy safely:

  • Monitoring tells you if systems are healthy

  • SLOs define reliability targets

  • Quality gates block bad changes before production

Why it’s helpful

  • Stops broken builds and insecure deployments early

  • Prevents outages by using measurable “go/no-go” checks

  • Makes release quality consistent across teams

How to implement Monitoring (standard stack)

  • Metrics: Prometheus / CloudWatch / Azure Monitor

  • Logs: ELK / Loki / Splunk

  • Traces: OpenTelemetry + Jaeger/Tempo

  • Alerts: PagerDuty / Opsgenie / Teams/Slack

Minimum monitoring baseline:

  • Golden signals: latency, traffic, errors, saturation

  • Dashboards per service + per dependency

  • Alerting tied to customer impact (not noisy)

What are SLOs (in practical terms)

  • SLA = promise to customer (contract)

  • SLO = internal target (engineering goal)
    Example:

  • SLO: 99.9% successful requests per 30 days

  • Error budget = allowed failure (0.1%) → guides release speed

How to integrate “quality gates” in CI/CD

Think of gates as checkpoints:

CI gates (before merge)

  • Unit tests + coverage threshold

  • Linting + formatting

  • SAST (code security)

  • Dependency scan (SCA)

  • IaC scan (tfsec/checkov)

  • Container scan (Trivy)

  • Policy-as-code (OPA/Conftest)

CD gates (before production)

  • Manual approval for prod (optional)

  • Smoke tests on staging

  • Deployment strategy checks:

    • canary/blue-green

    • automatic rollback if health degrades

  • SLO-based gate:

    • “If error rate > X% or latency > Yms → block / rollback”

Post-deploy gates (progressive delivery)

  • Argo Rollouts / Flagger

  • Metrics analysis during rollout

  • Automated rollback on bad metrics

✅ The “resilient delivery” part comes from progressive rollouts + SLO-based automated rollback.

CPU / GPU / TPU / NPU in Kubernetes & MLOps

 




1️⃣ CPU / GPU / TPU / NPU in Kubernetes & MLOps

๐Ÿง  CPU — Control Plane & Orchestration

In Kubernetes:

  • Runs:

    • API servers

    • Controllers

    • CI/CD tools

    • FastAPI / backend services

  • Handles:

    • Request routing

    • Feature engineering

    • Pre/post-processing

๐Ÿ“Œ In MLOps

  • Data ingestion

  • Model orchestration

  • Pipeline coordination (Airflow, Kubeflow)

  • Lightweight inference

๐Ÿ‘‰ CPUs coordinate, not accelerate.


๐ŸŸข GPU — Training & High-Throughput Inference

In Kubernetes:

  • GPUs are exposed as extended resources

  • Pods explicitly request them

  • Managed via NVIDIA device plugin

๐Ÿ“Œ In MLOps

  • Model training

  • Batch inference

  • LLM serving

  • Embedding generation

๐Ÿ‘‰ GPUs do heavy parallel math.


๐ŸŸ  TPU — Large-Scale AI Training

In Kubernetes:

  • Mostly used in Google Cloud

  • Integrated via specialised runtimes

๐Ÿ“Œ In MLOps

  • Training massive neural networks

  • Research-scale workloads

๐Ÿ‘‰ TPUs are specialised accelerators, not general compute.


๐ŸŸฃ NPU — Edge & Low-Latency AI

In Kubernetes:

  • Rare in standard clusters

  • Used in:

    • Edge Kubernetes

    • AI PCs

    • Embedded systems

๐Ÿ“Œ In MLOps

  • On-device inference

  • Privacy-preserving AI

  • Real-time decisions

๐Ÿ‘‰ NPUs run AI close to the user.


2️⃣ ๐ŸŽฏ GPU Scheduling in Kubernetes (Interview Gold)

How Kubernetes schedules GPUs

  • GPUs are not shared by default

  • You request them explicitly:

resources: limits: nvidia.com/gpu: 1

What happens:

  • Scheduler places the pod on a node with a free GPU

  • GPU is exclusively assigned to that pod


Advanced GPU Scheduling Concepts

  • Node labels & taints
    → ensure only GPU workloads land on GPU nodes

  • GPU pools
    → separate training vs inference clusters

  • Autoscaling
    → scale GPU nodes based on demand

  • Cost optimisation
    → avoid idle GPUs (very expensive!)

๐Ÿ‘‰ Senior takeaway:

“GPUs must be carefully scheduled to balance performance, cost, and availability.”


3️⃣ ๐Ÿงช Real Cloud Instance Examples (Practical)

AWS

  • CPU: t3, m5

  • GPU: g5, p4d

  • Used for:

    • LLM inference

    • Model training

Azure

  • CPU: D-series

  • GPU: NC, ND, NV

  • Used for:

    • Enterprise AI

    • Regulated workloads

GCP

  • CPU: n2

  • GPU: A100, L4

  • TPU: v4, v5

  • Used for:

    • Large-scale training

    • Research & AI platforms

๐Ÿ‘‰ In interviews:

“We typically separate CPU workloads from GPU workloads using dedicated node pools.”


4️⃣ ๐Ÿง  Connecting This to LLMs & RAG Systems

Typical LLM + RAG Architecture

User Request ↓ CPU (FastAPI / Gateway) ↓ GPU (Embedding Model) ↓ Vector DB (CPU) ↓ GPU (LLM Inference) ↓ CPU (Post-processing)

Where each processor fits

  • CPU

    • API layer

    • Prompt orchestration

    • Retrieval logic

  • GPU

    • Embeddings

    • LLM inference

  • TPU

    • Model training (if applicable)

  • NPU

    • Edge inference (future AI PCs)

๐Ÿ‘‰ Key insight:

CPUs orchestrate the flow, GPUs do the intelligence.


5️⃣ ๐Ÿ”ฅ LinkedIn-Friendly Post (Copy–Paste Ready)

๐Ÿš€ CPU, GPU, TPU, NPU — How Modern AI Platforms Really Work

In real-world AI and MLOps platforms, not all compute is equal.

๐Ÿง  CPUs orchestrate pipelines, APIs, and control logic.
GPUs power model training and high-throughput inference.
๐Ÿงช TPUs accelerate large-scale AI training.
๐Ÿ“ฑ NPUs enable low-power, real-time AI on edge devices.

In Kubernetes, GPUs are scheduled explicitly and used only where needed—because idle GPUs are expensive.

In LLM and RAG systems, CPUs handle orchestration and retrieval, while GPUs perform embeddings and inference. Choosing the right compute at each stage is key to performance, cost optimisation, and scalability.

๐Ÿ”‘ Takeaway: Successful MLOps isn’t just about models—it’s about matching the right workload to the right compute.

#Kubernetes #MLOps #LLM #RAG #CloudComputing #DevOps #AIInfrastructure

Automated Reconciliation

 

What “Automated Reconciliation” means (deep, but simple)

1) Two states exist all the time

  • Desired state (Source of Truth): what you declared in Git
    (Helm values, Kustomize overlays, Kubernetes YAMLs, etc.)

  • Actual state: what is really running in the Kubernetes cluster
    (Deployments/Services/Ingress/ConfigMaps as they currently exist)

A GitOps tool/controller (Argo CD / Flux) continuously compares these.


2) Continuous comparison (the control loop)

Think of a controller doing this loop forever:

  1. Fetch desired state from Git (and render it if needed: Helm/Kustomize).

  2. Read live state from the cluster (via Kubernetes API).

  3. Diff them (desired vs live).

  4. If different → mark as OutOfSync (drift detected or change pending).

This is exactly like a thermostat:

  • Desired temp = Git

  • Current temp = Cluster

  • Thermostat checks repeatedly and decides if action is needed


3) Drift: how it happens (two common cases)

A) Intentional change in Git

  • You update Git: image 1.4.1 → 1.4.2

  • Cluster still running 1.4.1

  • Tool detects diff → OutOfSync

  • Then it applies and becomes Synced

B) Unwanted manual change (real drift)

  • Someone runs: kubectl scale deployment api --replicas=10

  • Git still says replicas=3

  • Tool detects drift → OutOfSync

  • Reconciliation will bring it back to 3 (if auto-sync/self-heal enabled)

So drift can be either:

  • “pending deployment” (Git ahead of cluster)

  • “configuration drift” (cluster changed without Git)


4) “Auto-fix drift” (Self-heal)

When auto-sync + self-heal is enabled:

  • The tool automatically applies Git’s desired state back onto the cluster.

  • That “fix” is usually done by kubectl apply-like patching (server-side apply / strategic merge / replace depending on tool settings).

So reconciliation is:

  • Detect drift

  • Correct drift

  • Repeat forever

GitOps – Core Principles

 

GitOps – Core Principles (Important Points)

  • Git as Single Source of Truth

    • All infrastructure and application definitions live in Git

    • Git reflects the desired state

  • Declarative Configuration

    • Define what the system should look like, not how to do it

    • Typically Kubernetes YAML, Helm, Kustomize

  • Automated Reconciliation

    • Continuous comparison of desired state (Git) vs actual state (cluster)

    • Auto-fix drift if they differ

  • Pull-Based Model

    • Cluster pulls changes from Git

    • No direct kubectl apply to production

  • Change via Pull Requests

    • Review, approve, audit before deployment

    • Safer and more controlled

  • Versioning & Rollback

    • Git history = audit log

    • Easy rollback to last known good commit

  • Security & Access Control

    • No human access to production clusters

    • Git permissions define who can change what

  • Collaboration

    • Developers, Ops, SREs collaborate via Git workflows


Imperative vs Declarative Approach (Argo CD Context)

Imperative Approach

  • You tell the system step by step

  • Example:

    • kubectl create

    • kubectl scale

    • kubectl delete

  • Manual and command-driven

  • State lives in the cluster, not Git

  • Harder to track, audit, and rollback

Declarative Approach (GitOps Way)

  • You define the final desired state

  • Example:

    • replicas: 3

    • image: app:v1.4.2

  • Argo CD figures out how to reach that state

  • Git is the source of truth

  • Easy rollback, audit, automation

Argo CD is fully declarative by design


Reconciliation – Meaning (Very Important Concept)

  • Reconciliation = Desired State vs Actual State

  • Argo CD continuously:

    1. Reads desired state from Git

    2. Reads actual state from Kubernetes

    3. Compares both

    4. Fixes differences automatically or flags them

Why it matters

  • Prevents configuration drift

  • Recovers from manual or accidental changes

  • Keeps environments consistent

  • Enables self-healing systems


GitOps Feature Set (Exam / Interview Points)

  • Git as single source of truth

  • Automated reconciliation

  • Pull-based deployments

  • Declarative infrastructure

  • Version-controlled changes

  • Easy rollback

  • Drift detection

  • Strong audit trail

  • Separation of CI and CD

  • Self-service deployments

  • Scalable for large environments

  • Works across multi-cloud and on-prem

  • Enables continuous delivery


Automated Tools Used in GitOps (Key Ones)

  • Argo CD

    • Kubernetes-native GitOps CD

    • Continuous reconciliation

    • Drift detection and auto-sync

  • Flux

    • GitOps operator for Kubernetes

    • Pulls changes from Git automatically

  • Jenkins X

    • GitOps-driven CI/CD for Kubernetes

    • Environment promotion via Git

  • Terraform (with GitOps)

    • Infrastructure as Code

    • Git stores desired infra state

    • Often paired with GitOps workflows

Tuesday, February 23, 2021

Jenkins Pipeline

 




What is Jenkins pipeline and types of Jenkins pipeline?

A pipeline is a set of interlinked jobs done one by one in an order. To integrate and implement continuous delivery pipelines, a Jenkins pipeline provides a combination of plugins. The instructions to be performed are given through code. A continuous delivery pipeline can be represented as

*CI CD pipeline (Continuous Integration Continuous Delivery)

*Scripted pipeline

*Declarative pipeline

The Only difference in scripted pipeline and Declarative pipeline is syntactic approach.

A declarative pipeline always starts with pipeline. Scripted pipeline starts with word node. Declarative pipelines break down stages into individual stages that can contain multiple steps. Scripted pipelines use Groovy code and references to the Jenkins pipeline DSL within the stage elements without the need for steps.




 

RPO and RTO

  RPO — Recovery Point Objective “How much data can we afford to lose?” Simple definition RPO defines the maximum acceptable data loss , ...