Monday, February 9, 2026

CPU / GPU / TPU / NPU in Kubernetes & MLOps

 




1️⃣ CPU / GPU / TPU / NPU in Kubernetes & MLOps

๐Ÿง  CPU — Control Plane & Orchestration

In Kubernetes:

  • Runs:

    • API servers

    • Controllers

    • CI/CD tools

    • FastAPI / backend services

  • Handles:

    • Request routing

    • Feature engineering

    • Pre/post-processing

๐Ÿ“Œ In MLOps

  • Data ingestion

  • Model orchestration

  • Pipeline coordination (Airflow, Kubeflow)

  • Lightweight inference

๐Ÿ‘‰ CPUs coordinate, not accelerate.


๐ŸŸข GPU — Training & High-Throughput Inference

In Kubernetes:

  • GPUs are exposed as extended resources

  • Pods explicitly request them

  • Managed via NVIDIA device plugin

๐Ÿ“Œ In MLOps

  • Model training

  • Batch inference

  • LLM serving

  • Embedding generation

๐Ÿ‘‰ GPUs do heavy parallel math.


๐ŸŸ  TPU — Large-Scale AI Training

In Kubernetes:

  • Mostly used in Google Cloud

  • Integrated via specialised runtimes

๐Ÿ“Œ In MLOps

  • Training massive neural networks

  • Research-scale workloads

๐Ÿ‘‰ TPUs are specialised accelerators, not general compute.


๐ŸŸฃ NPU — Edge & Low-Latency AI

In Kubernetes:

  • Rare in standard clusters

  • Used in:

    • Edge Kubernetes

    • AI PCs

    • Embedded systems

๐Ÿ“Œ In MLOps

  • On-device inference

  • Privacy-preserving AI

  • Real-time decisions

๐Ÿ‘‰ NPUs run AI close to the user.


2️⃣ ๐ŸŽฏ GPU Scheduling in Kubernetes (Interview Gold)

How Kubernetes schedules GPUs

  • GPUs are not shared by default

  • You request them explicitly:

resources: limits: nvidia.com/gpu: 1

What happens:

  • Scheduler places the pod on a node with a free GPU

  • GPU is exclusively assigned to that pod


Advanced GPU Scheduling Concepts

  • Node labels & taints
    → ensure only GPU workloads land on GPU nodes

  • GPU pools
    → separate training vs inference clusters

  • Autoscaling
    → scale GPU nodes based on demand

  • Cost optimisation
    → avoid idle GPUs (very expensive!)

๐Ÿ‘‰ Senior takeaway:

“GPUs must be carefully scheduled to balance performance, cost, and availability.”


3️⃣ ๐Ÿงช Real Cloud Instance Examples (Practical)

AWS

  • CPU: t3, m5

  • GPU: g5, p4d

  • Used for:

    • LLM inference

    • Model training

Azure

  • CPU: D-series

  • GPU: NC, ND, NV

  • Used for:

    • Enterprise AI

    • Regulated workloads

GCP

  • CPU: n2

  • GPU: A100, L4

  • TPU: v4, v5

  • Used for:

    • Large-scale training

    • Research & AI platforms

๐Ÿ‘‰ In interviews:

“We typically separate CPU workloads from GPU workloads using dedicated node pools.”


4️⃣ ๐Ÿง  Connecting This to LLMs & RAG Systems

Typical LLM + RAG Architecture

User Request ↓ CPU (FastAPI / Gateway) ↓ GPU (Embedding Model) ↓ Vector DB (CPU) ↓ GPU (LLM Inference) ↓ CPU (Post-processing)

Where each processor fits

  • CPU

    • API layer

    • Prompt orchestration

    • Retrieval logic

  • GPU

    • Embeddings

    • LLM inference

  • TPU

    • Model training (if applicable)

  • NPU

    • Edge inference (future AI PCs)

๐Ÿ‘‰ Key insight:

CPUs orchestrate the flow, GPUs do the intelligence.


5️⃣ ๐Ÿ”ฅ LinkedIn-Friendly Post (Copy–Paste Ready)

๐Ÿš€ CPU, GPU, TPU, NPU — How Modern AI Platforms Really Work

In real-world AI and MLOps platforms, not all compute is equal.

๐Ÿง  CPUs orchestrate pipelines, APIs, and control logic.
GPUs power model training and high-throughput inference.
๐Ÿงช TPUs accelerate large-scale AI training.
๐Ÿ“ฑ NPUs enable low-power, real-time AI on edge devices.

In Kubernetes, GPUs are scheduled explicitly and used only where needed—because idle GPUs are expensive.

In LLM and RAG systems, CPUs handle orchestration and retrieval, while GPUs perform embeddings and inference. Choosing the right compute at each stage is key to performance, cost optimisation, and scalability.

๐Ÿ”‘ Takeaway: Successful MLOps isn’t just about models—it’s about matching the right workload to the right compute.

#Kubernetes #MLOps #LLM #RAG #CloudComputing #DevOps #AIInfrastructure

RPO and RTO

  RPO — Recovery Point Objective “How much data can we afford to lose?” Simple definition RPO defines the maximum acceptable data loss , ...