1️⃣ CPU / GPU / TPU / NPU in Kubernetes & MLOps
๐ง CPU — Control Plane & Orchestration
In Kubernetes:
-
Runs:
-
API servers
-
Controllers
-
CI/CD tools
-
FastAPI / backend services
-
-
Handles:
-
Request routing
-
Feature engineering
-
Pre/post-processing
-
๐ In MLOps
-
Data ingestion
-
Model orchestration
-
Pipeline coordination (Airflow, Kubeflow)
-
Lightweight inference
๐ CPUs coordinate, not accelerate.
๐ข GPU — Training & High-Throughput Inference
In Kubernetes:
-
GPUs are exposed as extended resources
-
Pods explicitly request them
-
Managed via NVIDIA device plugin
๐ In MLOps
-
Model training
-
Batch inference
-
LLM serving
-
Embedding generation
๐ GPUs do heavy parallel math.
๐ TPU — Large-Scale AI Training
In Kubernetes:
-
Mostly used in Google Cloud
-
Integrated via specialised runtimes
๐ In MLOps
-
Training massive neural networks
-
Research-scale workloads
๐ TPUs are specialised accelerators, not general compute.
๐ฃ NPU — Edge & Low-Latency AI
In Kubernetes:
-
Rare in standard clusters
-
Used in:
-
Edge Kubernetes
-
AI PCs
-
Embedded systems
-
๐ In MLOps
-
On-device inference
-
Privacy-preserving AI
-
Real-time decisions
๐ NPUs run AI close to the user.
2️⃣ ๐ฏ GPU Scheduling in Kubernetes (Interview Gold)
How Kubernetes schedules GPUs
-
GPUs are not shared by default
-
You request them explicitly:
What happens:
-
Scheduler places the pod on a node with a free GPU
-
GPU is exclusively assigned to that pod
Advanced GPU Scheduling Concepts
-
Node labels & taints
→ ensure only GPU workloads land on GPU nodes -
GPU pools
→ separate training vs inference clusters -
Autoscaling
→ scale GPU nodes based on demand -
Cost optimisation
→ avoid idle GPUs (very expensive!)
๐ Senior takeaway:
“GPUs must be carefully scheduled to balance performance, cost, and availability.”
3️⃣ ๐งช Real Cloud Instance Examples (Practical)
AWS
-
CPU:
t3,m5 -
GPU:
g5,p4d -
Used for:
-
LLM inference
-
Model training
-
Azure
-
CPU:
D-series -
GPU:
NC,ND,NV -
Used for:
-
Enterprise AI
-
Regulated workloads
-
GCP
-
CPU:
n2 -
GPU: A100, L4
-
TPU: v4, v5
-
Used for:
-
Large-scale training
-
Research & AI platforms
-
๐ In interviews:
“We typically separate CPU workloads from GPU workloads using dedicated node pools.”
4️⃣ ๐ง Connecting This to LLMs & RAG Systems
Typical LLM + RAG Architecture
Where each processor fits
-
CPU
-
API layer
-
Prompt orchestration
-
Retrieval logic
-
-
GPU
-
Embeddings
-
LLM inference
-
-
TPU
-
Model training (if applicable)
-
-
NPU
-
Edge inference (future AI PCs)
-
๐ Key insight:
CPUs orchestrate the flow, GPUs do the intelligence.
5️⃣ ๐ฅ LinkedIn-Friendly Post (Copy–Paste Ready)
๐ CPU, GPU, TPU, NPU — How Modern AI Platforms Really Work
In real-world AI and MLOps platforms, not all compute is equal.
๐ง CPUs orchestrate pipelines, APIs, and control logic.
⚡ GPUs power model training and high-throughput inference.
๐งช TPUs accelerate large-scale AI training.
๐ฑ NPUs enable low-power, real-time AI on edge devices.In Kubernetes, GPUs are scheduled explicitly and used only where needed—because idle GPUs are expensive.
In LLM and RAG systems, CPUs handle orchestration and retrieval, while GPUs perform embeddings and inference. Choosing the right compute at each stage is key to performance, cost optimisation, and scalability.
๐ Takeaway: Successful MLOps isn’t just about models—it’s about matching the right workload to the right compute.
#Kubernetes #MLOps #LLM #RAG #CloudComputing #DevOps #AIInfrastructure

