📊 The Complete Guide to Machine Learning Evaluation Metrics
(With Intuition, Examples & Real-World Analogies)
When building ML systems, the most important question is:
“How do we know if the model is actually good?”
Different problems need different metrics. Here’s a structured breakdown 👇
🔵 1️⃣ Regression Metrics (Predicting Numbers)
Used when predicting continuous values like:
-
House price
-
Sales revenue
-
Temperature
📌 MAE (Mean Absolute Error)
What it measures:
Average absolute difference between predicted and actual values.
Analogy:
Throwing darts at a board.
MAE = average distance from bullseye.
Use case:
When all errors are equally important.
📌 MSE (Mean Squared Error)
What it measures:
Squares errors → large mistakes punished heavily.
Analogy:
Small mistake = slap on wrist
Big mistake = heavy penalty
Use case:
When large errors are dangerous.
📌 RMSE (Root MSE)
What it measures:
Square root of MSE → brings error back to original unit.
Analogy:
MSE is in “square units”, RMSE makes it human-readable.
📌 R² (R-Squared)
What it measures:
How much variance in target is explained by the model.
If R² = 0.8 → model explains 80% of variation.
Analogy:
How well the model explains why students score differently.
📌 MAPE (Mean Absolute Percentage Error)
Error expressed as percentage.
Use case:
Sales forecasting.
⚠ Problem: Breaks when actual value = 0.
📌 MBE (Mean Bias Error)
Shows if model systematically overpredicts or underpredicts.
🔴 2️⃣ Classification Metrics (Predicting Categories)
Used in:
-
Spam detection
-
Fraud detection
-
Disease prediction
📌 Accuracy
Overall correctness.
⚠ Misleading in imbalanced datasets.
Analogy:
If 90% emails are normal, predicting everything normal gives 90% accuracy — but useless.
📌 Precision
Out of predicted positives, how many were correct?
Analogy:
When doctor says “you have disease”, how often is doctor right?
📌 Recall (Sensitivity)
Out of actual positives, how many did we detect?
Analogy:
Out of all criminals, how many did police catch?
📌 F1 Score
Balances precision & recall.
Used when both false positives & false negatives matter.
📌 ROC-AUC
Measures ranking ability of model.
Interpretation:
Probability that a random positive is ranked higher than a random negative.
📌 PR-AUC
Better than ROC for imbalanced datasets.
Important in:
-
Fraud detection
-
Rare disease prediction
📌 MCC (Matthews Correlation Coefficient)
Strong metric for imbalanced data.
Considers all confusion matrix values.
📌 Log Loss
Measures probability confidence.
Punishes confident wrong predictions heavily.
Used in:
-
Logistic Regression
-
Neural Networks
📌 Cohen’s Kappa
Measures agreement beyond random chance.
🟢 3️⃣ Clustering Metrics (Unsupervised Learning)
Used in:
-
Customer segmentation
-
Market analysis
📌 Silhouette Score
Measures how well points fit into clusters.
Range: -1 to 1.
📌 Davies-Bouldin Index
Lower is better.
Measures cluster separation.
📌 Calinski-Harabasz Index
Higher means better defined clusters.
📌 Adjusted Rand Index (ARI)
Measures similarity between clustering and ground truth.
📌 Dunn Index
Ratio of separation to compactness.
Higher is better.
📌 Mutual Information (MI)
Measures shared information between clusterings.
🟣 4️⃣ NLP Metrics
Used in:
-
Machine Translation
-
Text Summarization
-
Chatbots
📌 BLEU
Measures overlap between predicted and reference translation.
📌 ROUGE
Measures recall-based overlap.
Used in summarization.
📌 METEOR
Improves on BLEU with synonym matching.
📌 BERTScore
Uses embeddings to measure semantic similarity.
More intelligent comparison.
📌 Perplexity
Measures how well language model predicts next word.
Lower is better.
📌 WER (Word Error Rate)
Used in speech recognition.
Measures insertion, deletion, substitution errors.
🟡 5️⃣ Recommendation System Metrics
Used in:
-
Netflix
-
Amazon
-
Spotify
📌 MAP@K
Mean Average Precision at top K results.
📌 NDCG
Measures ranking quality with position weighting.
Top results matter more.
📌 MRR
How quickly correct item appears in ranking.
📌 Coverage
How many unique items get recommended.
📌 Diversity
Are recommendations varied?
🟠 6️⃣ Computer Vision Metrics
Used in:
-
Object detection
-
Image segmentation
📌 mAP
Mean Average Precision across classes.
Standard in object detection.
📌 IoU (Intersection over Union)
Overlap between predicted box and actual box.
📌 Dice Coefficient
Used in medical image segmentation.
📌 Hausdorff Distance
Measures worst-case boundary difference.
🟣 7️⃣ Fairness & Ethics Metrics
Modern AI must measure fairness.
📌 Demographic Parity
Model predictions equal across groups.
📌 Equalized Odds
Equal error rates across groups.
📌 Bounded Group Loss
Ensures no group is disproportionately harmed.
📌 Construct Validity
Are we measuring what we claim to measure?
🎯 Final Takeaway
There is no “best metric”.
Metrics depend on:
-
Business objective
-
Cost of mistakes
-
Data imbalance
-
Domain risk
Choosing wrong metric = building wrong system.
