Skip to content

ashrane111/ML-Observability-Platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

22 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

ML Observability Platform

A production-grade ML observability platform demonstrating real-time model monitoring, drift detection, and alerting capabilities.

Python FastAPI License

๐ŸŽฏ Overview

This platform provides comprehensive monitoring for machine learning models in production, featuring:

  • Three ML Models: Fraud Detection (XGBoost), Price Prediction (LightGBM), Churn Prediction (Random Forest)
  • Data Drift Detection: PSI-based drift monitoring using statistical tests
  • Real-time Alerting: Configurable alerts with severity levels and lifecycle management
  • Prometheus Metrics: Full observability with custom ML metrics
  • Grafana Dashboards: Pre-configured dashboards for visualization
  • REST API: FastAPI-based service with OpenAPI documentation

๐Ÿ—๏ธ Architecture

Architecture Overview

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.10+
  • Docker & Docker Compose (for full stack)
  • uv (recommended) or pip

Installation

# Clone the repository
git clone https://github.com/yourusername/ml-observability-platform.git
cd ml-observability-platform

# Install dependencies
make dev

# Generate synthetic data
make generate-data

# Train models
make train-models

# Setup reference data for drift detection
make setup-reference

Running the API

# Development mode (with hot reload)
make run-api-dev

# Production mode
make run-api

API will be available at http://localhost:8000

Running with Docker

# Build and start all services
make docker-up

# View logs
make docker-logs

# Stop services
make docker-down

Services:

๐Ÿ“ Project Structure

ml-observability-platform/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ api/                    # FastAPI application
โ”‚   โ”‚   โ”œโ”€โ”€ app.py              # Main application
โ”‚   โ”‚   โ”œโ”€โ”€ schemas.py          # Pydantic models
โ”‚   โ”‚   โ””โ”€โ”€ routes/             # API endpoints
โ”‚   โ”‚       โ”œโ”€โ”€ health.py       # Health checks
โ”‚   โ”‚       โ”œโ”€โ”€ predictions.py  # Prediction endpoints
โ”‚   โ”‚       โ””โ”€โ”€ monitoring.py   # Monitoring endpoints
โ”‚   โ”œโ”€โ”€ data/                   # Data generation
โ”‚   โ”‚   โ””โ”€โ”€ generator.py        # Synthetic data generator
โ”‚   โ”œโ”€โ”€ models/                 # ML models
โ”‚   โ”‚   โ”œโ”€โ”€ base.py             # Base model class
โ”‚   โ”‚   โ”œโ”€โ”€ fraud_detector.py   # Fraud detection model
โ”‚   โ”‚   โ”œโ”€โ”€ price_predictor.py  # Price prediction model
โ”‚   โ”‚   โ”œโ”€โ”€ churn_predictor.py  # Churn prediction model
โ”‚   โ”‚   โ””โ”€โ”€ preprocessing.py    # Feature preprocessing
โ”‚   โ””โ”€โ”€ monitoring/             # Monitoring components
โ”‚       โ”œโ”€โ”€ drift_detector.py   # Drift detection
โ”‚       โ”œโ”€โ”€ metrics.py          # Prometheus metrics
โ”‚       โ””โ”€โ”€ alerts.py           # Alert management
โ”œโ”€โ”€ scripts/
โ”‚   โ”œโ”€โ”€ generate_data.py        # Data generation script
โ”‚   โ”œโ”€โ”€ train_models.py         # Model training script
โ”‚   โ”œโ”€โ”€ demo.py                 # Interactive demo
โ”‚   โ””โ”€โ”€ simulate_traffic.py     # Traffic simulation
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ unit/                   # Unit tests
โ”‚   โ””โ”€โ”€ integration/            # Integration tests
โ”œโ”€โ”€ prometheus/                 # Prometheus configuration
โ”œโ”€โ”€ grafana/                    # Grafana dashboards
โ”œโ”€โ”€ Dockerfile                  # Container build
โ”œโ”€โ”€ docker-compose.yml          # Full stack deployment
โ”œโ”€โ”€ Makefile                    # Common commands
โ””โ”€โ”€ pyproject.toml              # Project configuration

๐Ÿ”Œ API Endpoints

Health

Endpoint Method Description
/health GET Health check with model status
/health/live GET Kubernetes liveness probe
/health/ready GET Kubernetes readiness probe

Predictions

Endpoint Method Description
/predict/fraud POST Fraud detection prediction
/predict/price POST Property price prediction
/predict/churn POST Customer churn prediction
/predict/batch POST Batch predictions
/predict/models GET List available models

Monitoring

Endpoint Method Description
/monitoring/drift/check POST Check data drift
/monitoring/drift/status/{model} GET Get drift detector status
/monitoring/quality/check POST Check data quality
/monitoring/alerts GET List alerts
/monitoring/alerts/summary GET Alert summary
/monitoring/alerts/{id} GET Get alert details
/monitoring/alerts/{id}/acknowledge POST Acknowledge alert
/monitoring/alerts/{id}/resolve POST Resolve alert
/monitoring/metrics GET Prometheus metrics

๐Ÿ“Š Monitoring Features

Drift Detection

The platform monitors for several types of drift:

  • Data Drift: Distribution changes in input features
  • Concept Drift: Changes in the relationship between features and target
  • Prediction Drift: Changes in model output distribution

Drift is detected using:

  • PSI (Population Stability Index) for numerical features
  • Jensen-Shannon Divergence for categorical features

Metrics Collected

# Prediction metrics
mlobs_predictions_total{model_name, status}
mlobs_prediction_latency_seconds{model_name}
mlobs_prediction_value{model_name}

# Drift metrics
mlobs_drift_score{model_name, feature}
mlobs_dataset_drift_detected{model_name, dataset_name}
mlobs_drift_share{model_name}
mlobs_drifted_features_count{model_name}

# Data quality metrics
mlobs_missing_values_share{model_name, dataset_name}
mlobs_duplicate_rows_count{model_name, dataset_name}

# Alert metrics
mlobs_alerts_total{model_name, alert_type, severity}
mlobs_active_alerts{model_name}

Alert Types

Alert Type Description Default Threshold
drift_detected Data drift detected drift_share > 0.2
drift_critical Critical drift level drift_share > 0.3
performance_degradation Model performance drop accuracy < 0.8
data_quality_issue Data quality problems missing > 10%
high_latency Slow predictions p99 > 500ms

๐Ÿงช Testing

# Run all tests
make test

# Run unit tests only
make test-unit

# Run integration tests
make test-integration

# Run with coverage
pytest tests/ -v --cov=src --cov-report=html

๐ŸŽฎ Demo

Run the interactive demo to see all features:

python scripts/demo.py

Simulate traffic for monitoring demonstration:

# Start the API first
make run-api-dev

# In another terminal, run traffic simulation
python scripts/simulate_traffic.py --duration 120 --rate 5 --drift 0.3

๐Ÿ”ง Configuration

Environment Variables

Variable Description Default
LOG_LEVEL Logging level INFO
MODEL_DIR Model storage directory models/
DATA_DIR Data storage directory data/

Drift Detector Configuration

DriftDetector(
    psi_threshold_warning=0.1,    # Warning threshold
    psi_threshold_critical=0.2,   # Critical threshold
    drift_share_threshold=0.3,    # % features to trigger alert
)

๐Ÿ“ˆ Grafana Dashboards

The platform includes pre-configured Grafana dashboards:

  1. ML Observability Platform - Main dashboard
    • Model status indicators
    • Prediction throughput
    • Latency percentiles
    • Drift status
    • Data quality metrics
    • Active alerts

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Run tests (make test)
  4. Run linters (make lint)
  5. Commit changes (git commit -m 'Add amazing feature')
  6. Push to branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

๐Ÿ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

About

Production-grade MLOps platform for real-time observability, drift detection, and automated retraining. Built with FastAPI, OpenTelemetry, Evidently AI, and Prefect.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages