Monday, February 9, 2026

Develop and maintain Infrastructure as Code using Terraform, Ansible, and cloud-native IaC tools

 

1) “Develop and maintain Infrastructure as Code using Terraform, Ansible, and cloud-native IaC tools.”

What it means (in simple words)

Instead of clicking in cloud portals, you define infrastructure in code and deploy it consistently:

  • Terraform → provisions cloud resources (VPC/VNet, subnets, EKS/AKS, IAM, RDS, etc.)

  • Ansible → configures servers/apps after provisioning (packages, config files, users, services)

  • Cloud-native IaC → AWS CloudFormation, Azure Bicep/ARM, GCP Deployment Manager (or Pulumi)

Why it’s helpful in development

  • Repeatable environments: Dev/Test/Prod become identical (less “works on my machine”)

  • Faster onboarding: New env in minutes

  • Safer changes: PR review + plan output

  • Audit trail: Git history shows who changed infra and why

  • Disaster recovery: You can recreate infra from code

How to implement (real-world approach)

A. Terraform structure (common enterprise pattern)

  • Create modules: network, eks/aks, rds, iam, monitoring

  • Separate environments using workspaces or folders:

    • environments/dev, environments/stage, environments/prod

  • Use remote state:

    • S3 + DynamoDB lock (AWS) / Azure Storage + locking / GCS (GCP)

B. Terraform workflow

  1. Dev opens PR with Terraform changes

  2. CI runs:

    • terraform fmt (format)

    • terraform validate

    • tflint (lint)

    • checkov/tfsec (security scans)

    • terraform plan and posts output to PR

  3. After approval:

    • terraform apply runs via pipeline (not a laptop)

C. Where Ansible fits
Use Ansible when you have:

  • VMs, legacy apps, OS-level configs

  • Need to install agents (Datadog, Splunk), configure Nginx, patching, hardening
    Implementation:

  • Terraform creates the VM + network + security groups

  • Ansible configures the VM (idempotent playbooks)

  • Use Ansible Vault/Secrets manager for credentials

Rule of thumb: Terraform creates infrastructure; Ansible configures software and OS.


2) “Implement HA/DR strategies, load balancing, and hybrid networking solutions.”

What it means

You design systems that:

  • Stay up when components fail (HA: High Availability)

  • Recover from disasters (DR: Disaster Recovery)

  • Distribute traffic safely (load balancing)

  • Connect on-prem + cloud securely (hybrid networking)

Why it’s helpful in development

  • Your app won’t go down during deployments, node failures, AZ outages

  • You meet enterprise requirements (uptime/SLA)

  • You avoid “big bang outages” by designing for failure

How to implement HA (practical patterns)

Compute HA

  • Run services across multiple Availability Zones

  • Kubernetes:

    • replicas across AZs

    • PodDisruptionBudgets

    • readiness/liveness probes

    • autoscaling (HPA + Cluster Autoscaler)

Database HA

  • Managed DB Multi-AZ (RDS Multi-AZ / Azure SQL HA / Cloud SQL HA)

  • Read replicas for scaling reads

  • Backup + point-in-time recovery

Load Balancing

  • L7 LB (HTTP/HTTPS): Ingress controller / ALB / App Gateway

  • L4 LB (TCP): NLB / Azure Load Balancer
    Implement:

  • TLS termination

  • health checks

  • weighted routing / blue-green / canary (advanced)

How to implement DR (real DR)

Define:

  • RPO (how much data loss is acceptable)

  • RTO (how fast must you recover)

Common DR strategies:

  • Backup & Restore (cheap, slower RTO)

  • Pilot Light (core running, scale on disaster)

  • Warm Standby (smaller version running)

  • Active-Active (best RTO, complex & expensive)

Implementation checklist:

  • Automated backups + tested restores

  • Infrastructure reproducible with IaC

  • Replicated data (cross-region where required)

  • Runbooks + game days (DR drills)

Hybrid networking (on-prem + cloud)

Typical solutions:

  • Site-to-Site VPN (quick, cheaper)

  • Dedicated private link: AWS Direct Connect / Azure ExpressRoute (more stable)

  • DNS strategy: split-horizon DNS, private endpoints

  • Routing: BGP, route tables, NAT, firewall rules

  • Security: least privilege, segmentation, inspection (firewalls), zero trust


3) “Integrate monitoring, SLOs, and quality gates into CI/CD pipelines for resilient delivery.”

What it means

You don’t just deploy fast — you deploy safely:

  • Monitoring tells you if systems are healthy

  • SLOs define reliability targets

  • Quality gates block bad changes before production

Why it’s helpful

  • Stops broken builds and insecure deployments early

  • Prevents outages by using measurable “go/no-go” checks

  • Makes release quality consistent across teams

How to implement Monitoring (standard stack)

  • Metrics: Prometheus / CloudWatch / Azure Monitor

  • Logs: ELK / Loki / Splunk

  • Traces: OpenTelemetry + Jaeger/Tempo

  • Alerts: PagerDuty / Opsgenie / Teams/Slack

Minimum monitoring baseline:

  • Golden signals: latency, traffic, errors, saturation

  • Dashboards per service + per dependency

  • Alerting tied to customer impact (not noisy)

What are SLOs (in practical terms)

  • SLA = promise to customer (contract)

  • SLO = internal target (engineering goal)
    Example:

  • SLO: 99.9% successful requests per 30 days

  • Error budget = allowed failure (0.1%) → guides release speed

How to integrate “quality gates” in CI/CD

Think of gates as checkpoints:

CI gates (before merge)

  • Unit tests + coverage threshold

  • Linting + formatting

  • SAST (code security)

  • Dependency scan (SCA)

  • IaC scan (tfsec/checkov)

  • Container scan (Trivy)

  • Policy-as-code (OPA/Conftest)

CD gates (before production)

  • Manual approval for prod (optional)

  • Smoke tests on staging

  • Deployment strategy checks:

    • canary/blue-green

    • automatic rollback if health degrades

  • SLO-based gate:

    • “If error rate > X% or latency > Yms → block / rollback”

Post-deploy gates (progressive delivery)

  • Argo Rollouts / Flagger

  • Metrics analysis during rollout

  • Automated rollback on bad metrics

✅ The “resilient delivery” part comes from progressive rollouts + SLO-based automated rollback.

RPO and RTO

  RPO — Recovery Point Objective “How much data can we afford to lose?” Simple definition RPO defines the maximum acceptable data loss , ...