Sunday, February 22, 2026

Data transformation step-by-step

 Data transformation step-by-step

You’ll see what happens at every stage.

Think of this as:

Raw vegetables → Clean → Cut → Cook → Taste → Serve

Same with ML pipeline.


🧠 BIG VISUAL FLOW

RAW DATA

data_ingest.py

DIRTY TABLE

data_cleaning.py

CLEAN TABLE

feature_engineering.py

MODEL-READY NUMBERS

train.py

TRAINED MODEL (learned patterns)

evaluate.py

PERFORMANCE REPORT

predict.py

REAL-WORLD PREDICTIONS

Now let’s go step-by-step visually.


1️⃣ data_ingest.py — Getting the Raw Material

What happens here?

  • Download CSV

  • Load into pandas

  • Store in /data/raw/

Example:

| email_text | label |
|------------|--------|
| Win money | spam |
| Hello sir | ham |

At this stage:

🚨 Data is untouched
🚨 Might have missing values
🚨 Might have duplicates
🚨 Might have imbalance


Analogy

You just bought vegetables from market.

They are dirty.


2️⃣ data_cleaning.py — Cleaning the Data

What happens here?

We:

  • Remove duplicates

  • Handle missing values

  • Fix formats

  • Remove noise

Before cleaning:

| text | label |
|-------------|-------|
| Win money | spam |
| Win money | spam | ← duplicate
| NULL | ham | ← missing

After cleaning:

| text | label |
|-------------|-------|
| Win money | spam |

What we get after cleaning?

✅ Consistent
✅ No missing
✅ No duplicates
✅ Correct types


Analogy

Wash vegetables. Remove rotten ones.


3️⃣ feature_engineering.py — Converting to Machine Language

This is VERY IMPORTANT.

Models cannot understand:

"Win money now!!!"

They only understand numbers.

So we convert text → numbers.


Example: Spam Detection

Original:

"Win money now"

After TF-IDF:

[0.8, 0.0, 0.3, 0.1, 0.9, ...]

Now data looks like:

| f1 | f2 | f3 | f4 | ... | label |
|----|----|----|----|-----|-------|
|0.8 |0.0 |0.3 |0.1 | ... | 1 |

What is stored?

Usually:

  • X_train

  • X_test

  • y_train

  • y_test

Or processed CSV in /data/processed/


Why feature engineering is needed?

Because:

Models only understand mathematics.

Humans understand meaning.
Machines understand numbers.


Analogy

Cut vegetables into pieces suitable for cooking.


4️⃣ train.py — Model Training

Now this is where magic happens.


What is Training?

Training means:

Model learns relationship between:

Input (X)
Output (y)


Example:

If email contains:

  • “win”

  • “free”

  • “money”

Label is often spam.

Model learns:

Weight for word “win” = strong spam signal.


What exactly happens mathematically?

Model tries to:

Minimize loss function.

For classification:

  • Cross entropy

For regression:

  • MSE

It adjusts internal weights using:

Gradient Descent.


What do we get?

A trained model file:

models/model.pkl

This contains learned weights.


Why training is needed?

Without training:

Model doesn’t know patterns.

It’s just random.


Analogy

Training = studying past exam papers to learn patterns.


5️⃣ evaluate.py — Testing Performance

Now we ask:

“Is this model actually good?”

We use test data (unseen).

Calculate:

  • Precision

  • Recall

  • F1

  • PR-AUC

Example output:

Recall = 0.95
Precision = 0.90
F1 = 0.92

Why evaluation is needed?

Because:

Model may memorize training data.

Evaluation checks generalization.


Analogy

After studying, give mock exam.

Evaluation = mock exam result.


6️⃣ predict.py — Real World Usage

Now model is ready.

User sends new email:

"Claim your free lottery now"

Pipeline:

  • Clean text

  • Convert to features

  • Pass into model

  • Model outputs probability:

Spam probability = 0.97

System says:

“Spam”


Why prediction needed?

Because:

Training is past.
Prediction is future.


Analogy

After studying and passing mock exam,
you now write real exam.


🧠 HOW DATA CHANGES AT EACH STEP

StageData Format
RawText
CleanedClean text
Feature EngineeredNumbers
TrainingModel learns weights
EvaluationMetrics
PredictionFinal output

🧠 Big Concept Ladder

Raw Data
→ Clean Data
→ Structured Features
→ Learned Parameters
→ Performance Measurement
→ Real-world Decisions


🧠 Why Entire Pipeline Is Needed

Without cleaning → garbage model
Without features → model cannot learn
Without training → no pattern learning
Without evaluation → fake performance
Without prediction → useless model


🎯 Interview-Level Explanation

You can say:

The pipeline starts with data ingestion, followed by cleaning to ensure quality. Feature engineering transforms raw data into model-ready numerical features. During training, the model learns patterns by minimizing a loss function. Evaluation assesses performance using appropriate metrics on unseen data. Finally, the prediction stage uses the trained model to generate outputs for new real-world inputs.


🧠 Final Intuition

Machine Learning = Pattern Extraction System

Training = learning from past
Evaluation = checking learning quality
Prediction = applying learning to future

  • Draw mathematical flow with loss & gradient

  • Show internal weight updates

  • Explain MLflow tracking visually

  • Connect to MLOps pipeline

  • Create exam-ready notes

  • ✅ mathematical flow (loss → gradient → update)
    ✅ internal weight updates (numeric example)
    ✅ MLflow tracking (visual)
    ✅ MLOps pipeline connection (visual)
    ✅ exam-ready notes (copy/paste)


    1) Mathematical Flow: Prediction → Loss → Gradient → Update

    A. Model (Regression example)

    We predict with a simple linear model:

    y^=wx+b\hat{y} = wx + b

    • xx = input feature (e.g., house size)

    • ww = weight (importance)

    • bb = bias (baseline)

    • y^\hat{y} = predicted output


    B. Loss (How wrong are we?)

    Use MSE for regression:

    L=1ni=1n(yiy^i)2L = \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2

    For one sample:

    L=(yy^)2L = (y - \hat{y})^2


    C. Gradient (Direction to reduce loss)

    Gradient means:

    “If I change w a little, does loss increase or decrease?”

    Compute derivatives:

    Lw,Lb\frac{\partial L}{\partial w}, \quad \frac{\partial L}{\partial b}


    D. Gradient Descent Update

    Update weights to reduce loss:

    wwαLww \leftarrow w - \alpha \frac{\partial L}{\partial w} bbαLbb \leftarrow b - \alpha \frac{\partial L}{\partial b}

    Where α\alpha is learning rate (step size).


    2) Visual: Training Loop (Simple Diagram)

    (1) Take X, y

    (2) Predict: ŷ = f(X; w,b)

    (3) Compute Loss: L(y, ŷ)

    (4) Compute Gradient: dL/dw, dL/db

    (5) Update weights: w,b

    Repeat many times (epochs) until loss becomes small

    3) Full Numeric Example: Weight Updates Step-by-Step

    Let’s use 1 data point so it’s super clear:

    • Input: x=2x = 2

    • Actual: y=10y = 10

    Start with:

    • w=1w = 1

    • b=0b = 0

    • learning rate α=0.1\alpha = 0.1

    Step 1: Predict

    y^=wx+b=12+0=2\hat{y} = wx + b = 1\cdot2 + 0 = 2

    Step 2: Loss (Squared Error)

    L=(yy^)2=(102)2=82=64L = (y - \hat{y})^2 = (10 - 2)^2 = 8^2 = 64

    Step 3: Compute Gradients

    For squared error L=(yy^)2L=(y-\hat{y})^2:

    Lw=2x(yy^)\frac{\partial L}{\partial w} = -2x(y-\hat{y}) Lb=2(yy^)\frac{\partial L}{\partial b} = -2(y-\hat{y})

    Compute error term:

    (yy^)=8(y-\hat{y}) = 8

    So:

    Lw=228=32\frac{\partial L}{\partial w} = -2 \cdot 2 \cdot 8 = -32 Lb=28=16\frac{\partial L}{\partial b} = -2 \cdot 8 = -16

    Step 4: Update weights

    w10.1(32)=1+3.2=4.2w \leftarrow 1 - 0.1(-32) = 1 + 3.2 = 4.2 b00.1(16)=0+1.6=1.6b \leftarrow 0 - 0.1(-16) = 0 + 1.6 = 1.6

    Step 5: Predict again (after 1 update)

    y^=4.22+1.6=8.4+1.6=10\hat{y} = 4.2\cdot2 + 1.6 = 8.4 + 1.6 = 10

    Step 6: New Loss

    L=(1010)2=0L = (10 - 10)^2 = 0

    ✅ In one update, it learned perfectly (because we used one point).

    What you should teach:

    • Prediction wrong → loss high

    • Gradient tells direction

    • Update moves weights toward better prediction


    4) Visual: Loss Landscape + Gradient

    Loss
    ^
    | • (start: high loss)
    | / |
    | / | gradient points downhill
    | / v
    |•-----------------------> w
    (minimum)

    Gradient points toward steepest increase; we move opposite to go down.


    5) MLflow Tracking (Visual Explanation)

    When you run many experiments, you forget:

    • which parameters you used

    • which metrics you got

    • which model file is best

    MLflow stores everything.

    Visual

    train.py
    ├─ params: lr=0.1, model=logreg, C=1.0
    ├─ metrics: f1=0.92, recall=0.95, pr_auc=0.97
    ├─ artifacts: confusion_matrix.png, model.pkl
    └─ run_id: 8f3a...

    mlruns/
    └─ experiment_id/
    └─ run_id/
    ├─ metrics/
    ├─ params/
    ├─ artifacts/

    What you see in MLflow UI

    • a table of runs

    • compare metrics

    • click best run

    • download the model


    6) Connect to MLOps Pipeline (End-to-End Visual)

    Git Push

    CI/CD (GitHub Actions / Jenkins)

    Data Ingest → Clean → Features

    Train (MLflow logs runs)

    Evaluate (choose best model)

    Register Model (versioning)

    Deploy (FastAPI + Docker)

    Monitor (drift + latency)

    Retrain trigger if drift/perf drops

    This is real production ML: not only training—also deployment + monitoring + retraining.


    7) Exam-Ready Notes (Copy/Paste)

    Key definitions

    • Model: function f(X;θ)f(X; \theta) where θ\theta are parameters (weights).

    • Loss function: measures prediction error (training objective).

    • Gradient: slope of loss w.r.t parameters; shows how to change parameters.

    • Gradient Descent: iterative method to minimize loss.

    Core equations

    • Regression prediction: y^=wx+b\hat{y} = wx + b

    • MSE: L=1n(yy^)2L = \frac{1}{n}\sum (y-\hat{y})^2

    • Updates:

      • wwαLww \leftarrow w - \alpha \frac{\partial L}{\partial w}

      • bbαLbb \leftarrow b - \alpha \frac{\partial L}{\partial b}

    Why minimize loss?

    Lower loss ⇒ predictions closer to truth ⇒ better generalization (when validated on test data).

    Why MLflow?

    Tracks:

    • parameters

    • metrics

    • artifacts (plots/models)

    • best model reproducibility

    Why evaluation?

    Training performance can be misleading (overfitting). Evaluate on unseen data using suitable metrics.

  • Configuring Java and Maven

      1️⃣ Configure Java Environment Open the Java environment file. sudo vi /etc/profile.d/java.sh Add these lines inside the file: expor...