1️⃣ CPU / GPU / TPU / NPU in Kubernetes & MLOps

🧠 CPU — Control Plane & Orchestration

In Kubernetes:

Runs:
- API servers
- Controllers
- CI/CD tools
- FastAPI / backend services
Handles:
- Request routing
- Feature engineering
- Pre/post-processing

📌 In MLOps

Data ingestion
Model orchestration
Pipeline coordination (Airflow, Kubeflow)
Lightweight inference

👉 CPUs coordinate, not accelerate.

🟢 GPU — Training & High-Throughput Inference

In Kubernetes:

GPUs are exposed as extended resources
Pods explicitly request them
Managed via NVIDIA device plugin

📌 In MLOps

Model training
Batch inference
LLM serving
Embedding generation

👉 GPUs do heavy parallel math.

🟠 TPU — Large-Scale AI Training

In Kubernetes:

Mostly used in Google Cloud
Integrated via specialised runtimes

📌 In MLOps

Training massive neural networks
Research-scale workloads

👉 TPUs are specialised accelerators, not general compute.

🟣 NPU — Edge & Low-Latency AI

In Kubernetes:

Rare in standard clusters
Used in:
- Edge Kubernetes
- AI PCs
- Embedded systems

📌 In MLOps

On-device inference
Privacy-preserving AI
Real-time decisions

👉 NPUs run AI close to the user.

2️⃣ 🎯 GPU Scheduling in Kubernetes (Interview Gold)

How Kubernetes schedules GPUs

GPUs are not shared by default
You request them explicitly:


resources:
  limits:
    nvidia.com/gpu: 1

What happens:

Scheduler places the pod on a node with a free GPU
GPU is exclusively assigned to that pod

Advanced GPU Scheduling Concepts

Node labels & taints
→ ensure only GPU workloads land on GPU nodes
GPU pools
→ separate training vs inference clusters
Autoscaling
→ scale GPU nodes based on demand
Cost optimisation
→ avoid idle GPUs (very expensive!)

👉 Senior takeaway:

“GPUs must be carefully scheduled to balance performance, cost, and availability.”

3️⃣ 🧪 Real Cloud Instance Examples (Practical)

AWS

CPU: t3, m5
GPU: g5, p4d
Used for:
- LLM inference
- Model training

Azure

CPU: D-series
GPU: NC, ND, NV
Used for:
- Enterprise AI
- Regulated workloads

GCP

CPU: n2
GPU: A100, L4
TPU: v4, v5
Used for:
- Large-scale training
- Research & AI platforms

👉 In interviews:

“We typically separate CPU workloads from GPU workloads using dedicated node pools.”

4️⃣ 🧠 Connecting This to LLMs & RAG Systems

Typical LLM + RAG Architecture


User Request
   ↓
CPU (FastAPI / Gateway)
   ↓
GPU (Embedding Model)
   ↓
Vector DB (CPU)
   ↓
GPU (LLM Inference)
   ↓
CPU (Post-processing)

Where each processor fits

CPU
- API layer
- Prompt orchestration
- Retrieval logic
GPU
- Embeddings
- LLM inference
TPU
- Model training (if applicable)
NPU
- Edge inference (future AI PCs)

👉 Key insight:

CPUs orchestrate the flow, GPUs do the intelligence.

5️⃣ 🔥 LinkedIn-Friendly Post (Copy–Paste Ready)

🚀 CPU, GPU, TPU, NPU — How Modern AI Platforms Really Work

In real-world AI and MLOps platforms, not all compute is equal.

🧠 CPUs orchestrate pipelines, APIs, and control logic.
⚡ GPUs power model training and high-throughput inference.
🧪 TPUs accelerate large-scale AI training.
📱 NPUs enable low-power, real-time AI on edge devices.

In Kubernetes, GPUs are scheduled explicitly and used only where needed—because idle GPUs are expensive.

In LLM and RAG systems, CPUs handle orchestration and retrieval, while GPUs perform embeddings and inference. Choosing the right compute at each stage is key to performance, cost optimisation, and scalability.

🔑 Takeaway: Successful MLOps isn’t just about models—it’s about matching the right workload to the right compute.

#Kubernetes #MLOps #LLM #RAG #CloudComputing #DevOps #AIInfrastructure

Mlopsworld

Monday, February 9, 2026

CPU / GPU / TPU / NPU in Kubernetes & MLOps

1️⃣ CPU / GPU / TPU / NPU in Kubernetes & MLOps

🧠 CPU — Control Plane & Orchestration

🟢 GPU — Training & High-Throughput Inference

🟠 TPU — Large-Scale AI Training

🟣 NPU — Edge & Low-Latency AI

2️⃣ 🎯 GPU Scheduling in Kubernetes (Interview Gold)

How Kubernetes schedules GPUs

Advanced GPU Scheduling Concepts

3️⃣ 🧪 Real Cloud Instance Examples (Practical)

AWS

Azure

GCP

4️⃣ 🧠 Connecting This to LLMs & RAG Systems

Typical LLM + RAG Architecture

Where each processor fits

5️⃣ 🔥 LinkedIn-Friendly Post (Copy–Paste Ready)

RPO and RTO

Search This Blog