Monday, February 9, 2026

RPO — Recovery Point Objective

“How much data can we afford to lose?”

Simple definition

RPO defines the maximum acceptable data loss, measured in time.

If a system fails, RPO answers:
“Up to what point in time must data be recovered?”

Example

RPO = 15 minutes
Disaster happens at 10:00 AM
You must be able to restore data up to 9:45 AM
Data created between 9:45–10:00 can be lost

📌 Smaller RPO = more frequent backups / replication
📌 Lower RPO = higher cost & complexity

Real-world RPO examples

System	Typical RPO
Banking / Trading	Seconds to minutes
E-commerce orders	< 5 minutes
Internal tools	Hours
Logs / Analytics	24 hours

How RPO is achieved

Backup frequency (hourly, daily)
Database replication
Snapshot schedules
Cross-region replication

RTO — Recovery Time Objective

“How fast must the system be back online?”

Simple definition

RTO defines the maximum acceptable downtime after a failure.

If a system goes down, RTO answers:
“How quickly must service be restored?”

Example

RTO = 30 minutes
System crashes at 10:00 AM
Service must be fully operational by 10:30 AM

📌 Lower RTO = faster recovery systems
📌 Lower RTO = higher cost

Real-world RTO examples

System	Typical RTO
Payments	Minutes
Customer-facing apps	< 1 hour
Internal dashboards	Several hours
Batch processing	1 day

How RTO is achieved

Standby infrastructure (warm / hot)
Automated failover
Load balancers + health checks
Infrastructure as Code (fast rebuild)
Pre-tested runbooks

RPO vs RTO (Easy Comparison)

Aspect	RPO	RTO
Focus	Data loss	Downtime
Measured in	Time	Time
Question	How much data can we lose?	How fast must we recover?
Controlled by	Backup & replication	Failover & automation

One-liner to remember (🔥 interview gold)

RPO is about data loss, RTO is about downtime.

Or even shorter:

RPO = how far back,
RTO = how fast forward.

Mapping to DR strategies (very important)

DR Strategy	RPO	RTO
Backup & Restore	High	High
Pilot Light	Medium	Medium
Warm Standby	Low	Low
Active-Active	Near-zero	Near-zero RPO – Recovery Point Objective Full form: Recovery Point Objective Meaning: The maximum acceptable amount of data loss, measured in time. 👉 Answers the question: “Up to what point in time must data be recovered after a failure?” Example: RPO = 15 minutes Disaster at 10:00 AM Data must be recoverable up to 9:45 AM RTO – Recovery Time Objective Full form: Recovery Time Objective Meaning: The maximum acceptable downtime before a system must be restored. 👉 Answers the question: “How quickly must the system be back online after a failure?” Example: RTO = 30 minutes System fails at 10:00 AM Service must be running by 10:30 AM Quick Memory Trick 🔥 RPO → Point in time (data loss) RTO → Time to recover (downtime)

Develop and maintain Infrastructure as Code using Terraform, Ansible, and cloud-native IaC tools

1) “Develop and maintain Infrastructure as Code using Terraform, Ansible, and cloud-native IaC tools.”

What it means (in simple words)

Instead of clicking in cloud portals, you define infrastructure in code and deploy it consistently:

Terraform → provisions cloud resources (VPC/VNet, subnets, EKS/AKS, IAM, RDS, etc.)
Ansible → configures servers/apps after provisioning (packages, config files, users, services)
Cloud-native IaC → AWS CloudFormation, Azure Bicep/ARM, GCP Deployment Manager (or Pulumi)

Why it’s helpful in development

Repeatable environments: Dev/Test/Prod become identical (less “works on my machine”)
Faster onboarding: New env in minutes
Safer changes: PR review + plan output
Audit trail: Git history shows who changed infra and why
Disaster recovery: You can recreate infra from code

How to implement (real-world approach)

A. Terraform structure (common enterprise pattern)

Create modules: network, eks/aks, rds, iam, monitoring
Separate environments using workspaces or folders:
- environments/dev, environments/stage, environments/prod
Use remote state:
- S3 + DynamoDB lock (AWS) / Azure Storage + locking / GCS (GCP)

B. Terraform workflow

Dev opens PR with Terraform changes
CI runs:
- terraform fmt (format)
- terraform validate
- tflint (lint)
- checkov/tfsec (security scans)
- terraform plan and posts output to PR
After approval:
- terraform apply runs via pipeline (not a laptop)

C. Where Ansible fits
Use Ansible when you have:

VMs, legacy apps, OS-level configs
Need to install agents (Datadog, Splunk), configure Nginx, patching, hardening
Implementation:
Terraform creates the VM + network + security groups
Ansible configures the VM (idempotent playbooks)
Use Ansible Vault/Secrets manager for credentials

✅ Rule of thumb: Terraform creates infrastructure; Ansible configures software and OS.

2) “Implement HA/DR strategies, load balancing, and hybrid networking solutions.”

What it means

You design systems that:

Stay up when components fail (HA: High Availability)
Recover from disasters (DR: Disaster Recovery)
Distribute traffic safely (load balancing)
Connect on-prem + cloud securely (hybrid networking)

Why it’s helpful in development

Your app won’t go down during deployments, node failures, AZ outages
You meet enterprise requirements (uptime/SLA)
You avoid “big bang outages” by designing for failure

How to implement HA (practical patterns)

Compute HA

Run services across multiple Availability Zones
Kubernetes:
- replicas across AZs
- PodDisruptionBudgets
- readiness/liveness probes
- autoscaling (HPA + Cluster Autoscaler)

Database HA

Managed DB Multi-AZ (RDS Multi-AZ / Azure SQL HA / Cloud SQL HA)
Read replicas for scaling reads
Backup + point-in-time recovery

Load Balancing

L7 LB (HTTP/HTTPS): Ingress controller / ALB / App Gateway
L4 LB (TCP): NLB / Azure Load Balancer
Implement:
TLS termination
health checks
weighted routing / blue-green / canary (advanced)

How to implement DR (real DR)

Define:

RPO (how much data loss is acceptable)
RTO (how fast must you recover)

Common DR strategies:

Backup & Restore (cheap, slower RTO)
Pilot Light (core running, scale on disaster)
Warm Standby (smaller version running)
Active-Active (best RTO, complex & expensive)

Implementation checklist:

Automated backups + tested restores
Infrastructure reproducible with IaC
Replicated data (cross-region where required)
Runbooks + game days (DR drills)

Hybrid networking (on-prem + cloud)

Typical solutions:

Site-to-Site VPN (quick, cheaper)
Dedicated private link: AWS Direct Connect / Azure ExpressRoute (more stable)
DNS strategy: split-horizon DNS, private endpoints
Routing: BGP, route tables, NAT, firewall rules
Security: least privilege, segmentation, inspection (firewalls), zero trust

3) “Integrate monitoring, SLOs, and quality gates into CI/CD pipelines for resilient delivery.”

What it means

You don’t just deploy fast — you deploy safely:

Monitoring tells you if systems are healthy
SLOs define reliability targets
Quality gates block bad changes before production

Why it’s helpful

Stops broken builds and insecure deployments early
Prevents outages by using measurable “go/no-go” checks
Makes release quality consistent across teams

How to implement Monitoring (standard stack)

Metrics: Prometheus / CloudWatch / Azure Monitor
Logs: ELK / Loki / Splunk
Traces: OpenTelemetry + Jaeger/Tempo
Alerts: PagerDuty / Opsgenie / Teams/Slack

Minimum monitoring baseline:

Golden signals: latency, traffic, errors, saturation
Dashboards per service + per dependency
Alerting tied to customer impact (not noisy)

What are SLOs (in practical terms)

SLA = promise to customer (contract)
SLO = internal target (engineering goal)
Example:
SLO: 99.9% successful requests per 30 days
Error budget = allowed failure (0.1%) → guides release speed

How to integrate “quality gates” in CI/CD

Think of gates as checkpoints:

CI gates (before merge)

Unit tests + coverage threshold
Linting + formatting
SAST (code security)
Dependency scan (SCA)
IaC scan (tfsec/checkov)
Container scan (Trivy)
Policy-as-code (OPA/Conftest)

CD gates (before production)

Manual approval for prod (optional)
Smoke tests on staging
Deployment strategy checks:
- canary/blue-green
- automatic rollback if health degrades
SLO-based gate:
- “If error rate > X% or latency > Yms → block / rollback”

Post-deploy gates (progressive delivery)

Argo Rollouts / Flagger
Metrics analysis during rollout
Automated rollback on bad metrics

✅ The “resilient delivery” part comes from progressive rollouts + SLO-based automated rollback.

CPU / GPU / TPU / NPU in Kubernetes & MLOps

1️⃣ CPU / GPU / TPU / NPU in Kubernetes & MLOps

🧠 CPU — Control Plane & Orchestration

In Kubernetes:

Runs:
- API servers
- Controllers
- CI/CD tools
- FastAPI / backend services
Handles:
- Request routing
- Feature engineering
- Pre/post-processing

📌 In MLOps

Data ingestion
Model orchestration
Pipeline coordination (Airflow, Kubeflow)
Lightweight inference

👉 CPUs coordinate, not accelerate.

🟢 GPU — Training & High-Throughput Inference

In Kubernetes:

GPUs are exposed as extended resources
Pods explicitly request them
Managed via NVIDIA device plugin

📌 In MLOps

Model training
Batch inference
LLM serving
Embedding generation

👉 GPUs do heavy parallel math.

🟠 TPU — Large-Scale AI Training

In Kubernetes:

Mostly used in Google Cloud
Integrated via specialised runtimes

📌 In MLOps

Training massive neural networks
Research-scale workloads

👉 TPUs are specialised accelerators, not general compute.

🟣 NPU — Edge & Low-Latency AI

In Kubernetes:

Rare in standard clusters
Used in:
- Edge Kubernetes
- AI PCs
- Embedded systems

📌 In MLOps

On-device inference
Privacy-preserving AI
Real-time decisions

👉 NPUs run AI close to the user.

2️⃣ 🎯 GPU Scheduling in Kubernetes (Interview Gold)

How Kubernetes schedules GPUs

GPUs are not shared by default
You request them explicitly:


resources:
  limits:
    nvidia.com/gpu: 1

What happens:

Scheduler places the pod on a node with a free GPU
GPU is exclusively assigned to that pod

Advanced GPU Scheduling Concepts

Node labels & taints
→ ensure only GPU workloads land on GPU nodes
GPU pools
→ separate training vs inference clusters
Autoscaling
→ scale GPU nodes based on demand
Cost optimisation
→ avoid idle GPUs (very expensive!)

👉 Senior takeaway:

“GPUs must be carefully scheduled to balance performance, cost, and availability.”

3️⃣ 🧪 Real Cloud Instance Examples (Practical)

AWS

CPU: t3, m5
GPU: g5, p4d
Used for:
- LLM inference
- Model training

Azure

CPU: D-series
GPU: NC, ND, NV
Used for:
- Enterprise AI
- Regulated workloads

GCP

CPU: n2
GPU: A100, L4
TPU: v4, v5
Used for:
- Large-scale training
- Research & AI platforms

👉 In interviews:

“We typically separate CPU workloads from GPU workloads using dedicated node pools.”

4️⃣ 🧠 Connecting This to LLMs & RAG Systems

Typical LLM + RAG Architecture


User Request
   ↓
CPU (FastAPI / Gateway)
   ↓
GPU (Embedding Model)
   ↓
Vector DB (CPU)
   ↓
GPU (LLM Inference)
   ↓
CPU (Post-processing)

Where each processor fits

CPU
- API layer
- Prompt orchestration
- Retrieval logic
GPU
- Embeddings
- LLM inference
TPU
- Model training (if applicable)
NPU
- Edge inference (future AI PCs)

👉 Key insight:

CPUs orchestrate the flow, GPUs do the intelligence.

5️⃣ 🔥 LinkedIn-Friendly Post (Copy–Paste Ready)

🚀 CPU, GPU, TPU, NPU — How Modern AI Platforms Really Work

In real-world AI and MLOps platforms, not all compute is equal.

🧠 CPUs orchestrate pipelines, APIs, and control logic.
⚡ GPUs power model training and high-throughput inference.
🧪 TPUs accelerate large-scale AI training.
📱 NPUs enable low-power, real-time AI on edge devices.

In Kubernetes, GPUs are scheduled explicitly and used only where needed—because idle GPUs are expensive.

In LLM and RAG systems, CPUs handle orchestration and retrieval, while GPUs perform embeddings and inference. Choosing the right compute at each stage is key to performance, cost optimisation, and scalability.

🔑 Takeaway: Successful MLOps isn’t just about models—it’s about matching the right workload to the right compute.

#Kubernetes #MLOps #LLM #RAG #CloudComputing #DevOps #AIInfrastructure

What “Automated Reconciliation” means (deep, but simple)

1) Two states exist all the time

Desired state (Source of Truth): what you declared in Git
(Helm values, Kustomize overlays, Kubernetes YAMLs, etc.)
Actual state: what is really running in the Kubernetes cluster
(Deployments/Services/Ingress/ConfigMaps as they currently exist)

A GitOps tool/controller (Argo CD / Flux) continuously compares these.

2) Continuous comparison (the control loop)

Think of a controller doing this loop forever:

Fetch desired state from Git (and render it if needed: Helm/Kustomize).
Read live state from the cluster (via Kubernetes API).
Diff them (desired vs live).
If different → mark as OutOfSync (drift detected or change pending).

This is exactly like a thermostat:

Desired temp = Git
Current temp = Cluster
Thermostat checks repeatedly and decides if action is needed

3) Drift: how it happens (two common cases)

A) Intentional change in Git

You update Git: image 1.4.1 → 1.4.2
Cluster still running 1.4.1
Tool detects diff → OutOfSync
Then it applies and becomes Synced

B) Unwanted manual change (real drift)

Someone runs: kubectl scale deployment api --replicas=10
Git still says replicas=3
Tool detects drift → OutOfSync
Reconciliation will bring it back to 3 (if auto-sync/self-heal enabled)

So drift can be either:

“pending deployment” (Git ahead of cluster)
“configuration drift” (cluster changed without Git)

4) “Auto-fix drift” (Self-heal)

When auto-sync + self-heal is enabled:

The tool automatically applies Git’s desired state back onto the cluster.
That “fix” is usually done by kubectl apply-like patching (server-side apply / strategic merge / replace depending on tool settings).

So reconciliation is:

Detect drift
Correct drift
Repeat forever

GitOps – Core Principles (Important Points)

Git as Single Source of Truth
- All infrastructure and application definitions live in Git
- Git reflects the desired state
Declarative Configuration
- Define what the system should look like, not how to do it
- Typically Kubernetes YAML, Helm, Kustomize
Automated Reconciliation
- Continuous comparison of desired state (Git) vs actual state (cluster)
- Auto-fix drift if they differ
Pull-Based Model
- Cluster pulls changes from Git
- No direct kubectl apply to production
Change via Pull Requests
- Review, approve, audit before deployment
- Safer and more controlled
Versioning & Rollback
- Git history = audit log
- Easy rollback to last known good commit
Security & Access Control
- No human access to production clusters
- Git permissions define who can change what
Collaboration
- Developers, Ops, SREs collaborate via Git workflows

Imperative vs Declarative Approach (Argo CD Context)

Imperative Approach

You tell the system step by step
Example:
- kubectl create
- kubectl scale
- kubectl delete
Manual and command-driven
State lives in the cluster, not Git
Harder to track, audit, and rollback

Declarative Approach (GitOps Way)

You define the final desired state
Example:
- replicas: 3
- image: app:v1.4.2
Argo CD figures out how to reach that state
Git is the source of truth
Easy rollback, audit, automation

✅ Argo CD is fully declarative by design

Reconciliation – Meaning (Very Important Concept)

Reconciliation = Desired State vs Actual State
Argo CD continuously:
1. Reads desired state from Git
2. Reads actual state from Kubernetes
3. Compares both
4. Fixes differences automatically or flags them

Why it matters

Prevents configuration drift
Recovers from manual or accidental changes
Keeps environments consistent
Enables self-healing systems

GitOps Feature Set (Exam / Interview Points)

Git as single source of truth
Automated reconciliation
Pull-based deployments
Declarative infrastructure
Version-controlled changes
Easy rollback
Drift detection
Strong audit trail
Separation of CI and CD
Self-service deployments
Scalable for large environments
Works across multi-cloud and on-prem
Enables continuous delivery

Automated Tools Used in GitOps (Key Ones)

Argo CD
- Kubernetes-native GitOps CD
- Continuous reconciliation
- Drift detection and auto-sync
Flux
- GitOps operator for Kubernetes
- Pulls changes from Git automatically
Jenkins X
- GitOps-driven CI/CD for Kubernetes
- Environment promotion via Git
Terraform (with GitOps)
- Infrastructure as Code
- Git stores desired infra state
- Often paired with GitOps workflows

Wednesday, February 5, 2025

DataLake, database and datawarehouse knowledge

Tuesday, February 23, 2021

Jenkins Pipeline

What is Jenkins pipeline and types of Jenkins pipeline?

A pipeline is a set of interlinked jobs done one by one in an order. To integrate and implement continuous delivery pipelines, a Jenkins pipeline provides a combination of plugins. The instructions to be performed are given through code. A continuous delivery pipeline can be represented as

*CI CD pipeline (Continuous Integration Continuous Delivery)

*Scripted pipeline

*Declarative pipeline

The Only difference in scripted pipeline and Declarative pipeline is syntactic approach.

A declarative pipeline always starts with pipeline. Scripted pipeline starts with word node. Declarative pipelines break down stages into individual stages that can contain multiple steps. Scripted pipelines use Groovy code and references to the Jenkins pipeline DSL within the stage elements without the need for steps.

Monday, February 9, 2026

RPO — Recovery Point Objective

Simple definition

Example

Real-world RPO examples

How RPO is achieved

RTO — Recovery Time Objective

Simple definition

Example

Real-world RTO examples

How RTO is achieved

RPO vs RTO (Easy Comparison)

One-liner to remember (🔥 interview gold)

Mapping to DR strategies (very important)

RPO – Recovery Point Objective

RTO – Recovery Time Objective

Quick Memory Trick 🔥

1) “Develop and maintain Infrastructure as Code using Terraform, Ansible, and cloud-native IaC tools.”

What it means (in simple words)

Why it’s helpful in development

How to implement (real-world approach)

2) “Implement HA/DR strategies, load balancing, and hybrid networking solutions.”

What it means

Why it’s helpful in development

How to implement HA (practical patterns)

How to implement DR (real DR)

Hybrid networking (on-prem + cloud)

3) “Integrate monitoring, SLOs, and quality gates into CI/CD pipelines for resilient delivery.”

What it means

Why it’s helpful

How to implement Monitoring (standard stack)

What are SLOs (in practical terms)

How to integrate “quality gates” in CI/CD

1️⃣ CPU / GPU / TPU / NPU in Kubernetes & MLOps

🧠 CPU — Control Plane & Orchestration

🟢 GPU — Training & High-Throughput Inference

🟠 TPU — Large-Scale AI Training

🟣 NPU — Edge & Low-Latency AI

2️⃣ 🎯 GPU Scheduling in Kubernetes (Interview Gold)

How Kubernetes schedules GPUs

Advanced GPU Scheduling Concepts

3️⃣ 🧪 Real Cloud Instance Examples (Practical)

AWS

Azure

GCP

4️⃣ 🧠 Connecting This to LLMs & RAG Systems

Typical LLM + RAG Architecture

Where each processor fits

5️⃣ 🔥 LinkedIn-Friendly Post (Copy–Paste Ready)

What “Automated Reconciliation” means (deep, but simple)

1) Two states exist all the time

2) Continuous comparison (the control loop)

3) Drift: how it happens (two common cases)

4) “Auto-fix drift” (Self-heal)

GitOps – Core Principles (Important Points)

Imperative vs Declarative Approach (Argo CD Context)

Imperative Approach

Declarative Approach (GitOps Way)

Reconciliation – Meaning (Very Important Concept)

Why it matters

GitOps Feature Set (Exam / Interview Points)

Automated Tools Used in GitOps (Key Ones)

Wednesday, February 5, 2025

Tuesday, February 23, 2021