📊 The Complete Guide to Machine Learning Evaluation Metrics

(With Intuition, Examples & Real-World Analogies)

When building ML systems, the most important question is:

“How do we know if the model is actually good?”

Different problems need different metrics. Here’s a structured breakdown 👇

🔵 1️⃣ Regression Metrics (Predicting Numbers)

Used when predicting continuous values like:

House price
Sales revenue
Temperature

📌 MAE (Mean Absolute Error)

What it measures:
Average absolute difference between predicted and actual values.

Analogy:
Throwing darts at a board.
MAE = average distance from bullseye.

Use case:
When all errors are equally important.

📌 MSE (Mean Squared Error)

What it measures:
Squares errors → large mistakes punished heavily.

Analogy:
Small mistake = slap on wrist
Big mistake = heavy penalty

Use case:
When large errors are dangerous.

📌 RMSE (Root MSE)

What it measures:
Square root of MSE → brings error back to original unit.

Analogy:
MSE is in “square units”, RMSE makes it human-readable.

📌 R² (R-Squared)

What it measures:
How much variance in target is explained by the model.

If R² = 0.8 → model explains 80% of variation.

Analogy:
How well the model explains why students score differently.

📌 MAPE (Mean Absolute Percentage Error)

Error expressed as percentage.

Use case:
Sales forecasting.

⚠ Problem: Breaks when actual value = 0.

📌 MBE (Mean Bias Error)

Shows if model systematically overpredicts or underpredicts.

🔴 2️⃣ Classification Metrics (Predicting Categories)

Used in:

Spam detection
Fraud detection
Disease prediction

📌 Accuracy

Overall correctness.

⚠ Misleading in imbalanced datasets.

Analogy:
If 90% emails are normal, predicting everything normal gives 90% accuracy — but useless.

📌 Precision

Out of predicted positives, how many were correct?

Analogy:
When doctor says “you have disease”, how often is doctor right?

📌 Recall (Sensitivity)

Out of actual positives, how many did we detect?

Analogy:
Out of all criminals, how many did police catch?

📌 F1 Score

Balances precision & recall.

Used when both false positives & false negatives matter.

📌 ROC-AUC

Measures ranking ability of model.

Interpretation:
Probability that a random positive is ranked higher than a random negative.

📌 PR-AUC

Better than ROC for imbalanced datasets.

Important in:

Fraud detection
Rare disease prediction

📌 MCC (Matthews Correlation Coefficient)

Strong metric for imbalanced data.

Considers all confusion matrix values.

📌 Log Loss

Measures probability confidence.

Punishes confident wrong predictions heavily.

Used in:

Logistic Regression
Neural Networks

📌 Cohen’s Kappa

Measures agreement beyond random chance.

🟢 3️⃣ Clustering Metrics (Unsupervised Learning)

Used in:

Customer segmentation
Market analysis

📌 Silhouette Score

Measures how well points fit into clusters.

Range: -1 to 1.

📌 Davies-Bouldin Index

Lower is better.

Measures cluster separation.

📌 Calinski-Harabasz Index

Higher means better defined clusters.

📌 Adjusted Rand Index (ARI)

Measures similarity between clustering and ground truth.

📌 Dunn Index

Ratio of separation to compactness.

Higher is better.

📌 Mutual Information (MI)

Measures shared information between clusterings.

🟣 4️⃣ NLP Metrics

Used in:

Machine Translation
Text Summarization
Chatbots

📌 BLEU

Measures overlap between predicted and reference translation.

📌 ROUGE

Measures recall-based overlap.

Used in summarization.

📌 METEOR

Improves on BLEU with synonym matching.

📌 BERTScore

Uses embeddings to measure semantic similarity.

More intelligent comparison.

📌 Perplexity

Measures how well language model predicts next word.

Lower is better.

📌 WER (Word Error Rate)

Used in speech recognition.

Measures insertion, deletion, substitution errors.

🟡 5️⃣ Recommendation System Metrics

Used in:

Netflix
Amazon
Spotify

📌 MAP@K

Mean Average Precision at top K results.

📌 NDCG

Measures ranking quality with position weighting.

📌 MRR

How quickly correct item appears in ranking.

📌 Coverage

How many unique items get recommended.

📌 Diversity

Are recommendations varied?

🟠 6️⃣ Computer Vision Metrics

Used in:

Object detection
Image segmentation

📌 mAP

Mean Average Precision across classes.

Standard in object detection.

📌 IoU (Intersection over Union)

Overlap between predicted box and actual box.

📌 Dice Coefficient

Used in medical image segmentation.

📌 Hausdorff Distance

Measures worst-case boundary difference.

🟣 7️⃣ Fairness & Ethics Metrics

Modern AI must measure fairness.

📌 Demographic Parity

Model predictions equal across groups.

📌 Equalized Odds

Equal error rates across groups.

📌 Bounded Group Loss

Ensures no group is disproportionately harmed.

📌 Construct Validity

Are we measuring what we claim to measure?

🎯 Final Takeaway

There is no “best metric”.

Metrics depend on:

Business objective
Cost of mistakes
Data imbalance
Domain risk

Choosing wrong metric = building wrong system.

Sunday, February 22, 2026