Machine Learning Lifecycle & Tooling Reference Guide
- What it is: The starting point where raw data is gathered and stored before processing.
- Source Diagram Tool: Kaggle CSV (typically used for practice or initial proof-of-concept projects).
- General Industry Alternatives: Relational databases (PostgreSQL), data warehouses (Snowflake), or distributed file systems.
- Cloud Equivalents:
- AWS: Amazon S3 (object storage), Amazon Redshift (data warehouse).
- Azure: Azure Blob Storage / Data Lake Storage, Azure Synapse Analytics.
- Google Cloud (GCP): Google Cloud Storage (GCS), Google BigQuery.
2. Data Pipeline
- What it is: The stage where raw data is prepared for machine learning, involving cleaning, handling missing values, and analyzing patterns.
- Source Diagram Tool: Clean + EDA (Exploratory Data Analysis).
- General Industry Alternatives: Python libraries like Pandas or Polars for smaller datasets; Apache Spark for massive datasets; Apache Airflow or Prefect to orchestrate and automate these cleaning tasks.
- Cloud Equivalents:
- AWS: AWS Glue (serverless data integration) or Amazon EMR.
- Azure: Azure Data Factory (pipeline orchestration) or Azure Databricks.
- GCP: Cloud Dataflow (stream/batch processing) or Cloud Dataproc.
3. Feature Store
- What it is: A centralized data management system to organize, store, and serve the cleaned data features so they can be easily reused across multiple machine learning models.
- Source Diagram Tool: Feast (an open-source feature store).
- General Industry Alternatives: Hopsworks.
- Cloud Equivalents:
- AWS: Amazon SageMaker Feature Store.
- Azure: Azure Machine Learning Managed Feature Store.
- GCP: Vertex AI Feature Store.
4. Model Training
- What it is: Feeding the prepared features into an algorithm to learn patterns, while strictly logging experiments, tracking metrics (like accuracy), and versioning the models.
- Source Diagram Tool: MLflow Track.
- General Industry Alternatives: Weights & Biases (W&B), Comet.ml, or Neptune.ai for experiment tracking.
- Cloud Equivalents:
- AWS: Amazon SageMaker Training and SageMaker Experiments (MLflow is also supported natively here).
- Azure: Azure Machine Learning Workspaces (natively integrates with MLflow).
- GCP: Vertex AI Training and Vertex AI Experiments.
5. Deployment
- What it is: Taking the finalized, trained model and hosting it as an API endpoint so that software applications, websites, or users can send it data and receive predictions.
- Source Diagram Tool: FastAPI + Docker. Docker packages the model into a standalone container, and FastAPI serves it over the web.
- General Industry Alternatives:
- Web Frameworks: Flask, Django.
- Specialized ML Serving: BentoML, Seldon Core, TensorFlow Serving, TorchServe, or Ray Serve (these are often preferred over FastAPI in production because they handle model batching and GPU optimization better).
- Cloud Equivalents:
- AWS: Amazon SageMaker Endpoints (managed hosting) or Amazon ECS / Fargate (for running Dockerized FastAPI apps).
- Azure: Azure Machine Learning Online Endpoints or Azure Kubernetes Service (AKS).
- GCP: Vertex AI Endpoints or Google Cloud Run (a serverless option perfect for FastAPI+Docker containers).
6. Monitoring
- What it is: Continuously observing the deployed model in the real world to ensure it remains accurate over time and alerting teams if the incoming data changes significantly.
- Source Diagram Tool: Drift Check (monitoring "data drift" to know when the model requires retraining).
- General Industry Alternatives: Evidently AI, Arize AI, Fiddler, or a general observability stack like Prometheus + Grafana.
- Cloud Equivalents:
- AWS: Amazon SageMaker Model Monitor.
- Azure: Azure Machine Learning Model Monitoring (Data Drift Detection).
- GCP: Vertex AI Model Monitoring.