These are core performance metrics used in:
-
ML APIs
-
Microservices
-
Cloud systems
-
Trading systems
-
High-frequency systems
Let’s break this down properly.
1️⃣ Latency
π Definition
Latency = Time taken to respond to a single request
If you send one API request:
How long until you receive response?
That time = latency.
Usually measured in milliseconds (ms).
π§ Simple Analogy — Restaurant
You order food.
Time between:
-
Placing order
-
Food arriving
That waiting time = latency.
π― Example
| System | Latency |
|---|---|
| Local ML model | 10–50ms |
| Cloud API | 50–200ms |
| Large LLM | 500ms–2s |
Lower latency = faster response.
π€ Interview Explanation
Latency measures the time taken for a system to process and return a response for a single request. For real-time ML systems, we aim for low latency, typically under 100ms.
2️⃣ Throughput
π Definition
Throughput = Number of requests processed per second
Measured as:
π§ Analogy — Cashiers in Supermarket
If:
1 cashier → 10 customers per minute
10 cashiers → 100 customers per minute
More processed customers per time = higher throughput.
π― Example
| System | Throughput |
|---|---|
| Basic Flask app | 100 req/s |
| Scaled API | 500 req/s |
| High-performance service | 10,000+ req/s |
π§ Important Relationship
Latency and throughput are related.
If requests increase:
-
Latency increases
-
Throughput may drop
π€ Interview Explanation
Throughput represents how many requests a system can handle per unit time, typically measured in requests per second. High-throughput systems are critical in high-load applications like trading or recommendation engines.
3️⃣ Availability
π Definition
Availability = Percentage of time system is up and running
Formula:
π§ Analogy — ATM Machine
If ATM works 99% of the time → good
If it fails often → bad availability
π― Example
| Availability | Downtime per year |
|---|---|
| 95% | ~18 days |
| 99% | ~3.6 days |
| 99.9% | ~8.7 hours |
| 99.99% | ~52 minutes |
Each “9” matters a lot.
π€ Interview Explanation
Availability measures the percentage of time a system remains operational. In production ML systems, 99.9% or higher availability is typically required.
4️⃣ Scalability
π Definition
Scalability = Ability to handle increasing load without performance degradation
π§ Analogy — Expandable Restaurant
If 10 customers come → fine
If 100 customers come → open more tables
System automatically adjusts.
Types of Scalability
Horizontal Scaling
Add more servers.
Vertical Scaling
Increase machine power.
π― Real Example
Auto-scaling group in AWS:
-
Traffic increases
-
New EC2 instances created
-
Latency stays low
π€ Interview Explanation
Scalability is the ability of a system to handle increased traffic by adding resources either horizontally or vertically without significant performance loss.
π₯ Deep Understanding (How These Metrics Interact)
Think of system like highway:
-
Latency = travel time per car
-
Throughput = cars per second
-
Availability = road open %
-
Scalability = adding more lanes
If traffic increases:
-
Without scaling → congestion → latency ↑
-
With scaling → lanes added → latency stable
π How To Prepare In-Depth
To master this for interviews:
Step 1: Understand Measurement Tools
-
Latency → Prometheus, Grafana
-
Throughput → Load testing (Locust, JMeter)
-
Availability → Uptime monitoring
-
Scalability → Auto-scaling config
Step 2: Learn Real Numbers
For ML API:
-
Latency target: <100ms
-
Throughput: depends on use case
-
Availability: 99.9%+
-
Autoscaling via Kubernetes HPA
Step 3: Practice Explaining With Examples
Example scenario:
“If we deploy a house price prediction API, latency should remain under 100ms. If traffic increases beyond 500 requests per second, we enable horizontal scaling via Kubernetes HPA. To maintain 99.9% availability, we deploy in multiple availability zones.”
π§ Final Memory Trick
L T A S
Latency → How fast
Throughput → How many
Availability → How reliable
Scalability → How expandable
π Advanced Interview-Level Answer
In production ML systems, latency measures response time per request, throughput measures request handling capacity, availability reflects system uptime percentage, and scalability determines the system's ability to maintain performance under increasing load. These metrics are interrelated and must be balanced based on business requirements and SLA constraints.