A production-grade ML observability platform demonstrating real-time model monitoring, drift detection, and alerting capabilities.
This platform provides comprehensive monitoring for machine learning models in production, featuring:
- Three ML Models: Fraud Detection (XGBoost), Price Prediction (LightGBM), Churn Prediction (Random Forest)
- Data Drift Detection: PSI-based drift monitoring using statistical tests
- Real-time Alerting: Configurable alerts with severity levels and lifecycle management
- Prometheus Metrics: Full observability with custom ML metrics
- Grafana Dashboards: Pre-configured dashboards for visualization
- REST API: FastAPI-based service with OpenAPI documentation
- Python 3.10+
- Docker & Docker Compose (for full stack)
- uv (recommended) or pip
# Clone the repository
git clone https://github.com/yourusername/ml-observability-platform.git
cd ml-observability-platform
# Install dependencies
make dev
# Generate synthetic data
make generate-data
# Train models
make train-models
# Setup reference data for drift detection
make setup-reference# Development mode (with hot reload)
make run-api-dev
# Production mode
make run-apiAPI will be available at http://localhost:8000
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
# Build and start all services
make docker-up
# View logs
make docker-logs
# Stop services
make docker-downServices:
- API: http://localhost:8000
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (admin/admin)
ml-observability-platform/
โโโ src/
โ โโโ api/ # FastAPI application
โ โ โโโ app.py # Main application
โ โ โโโ schemas.py # Pydantic models
โ โ โโโ routes/ # API endpoints
โ โ โโโ health.py # Health checks
โ โ โโโ predictions.py # Prediction endpoints
โ โ โโโ monitoring.py # Monitoring endpoints
โ โโโ data/ # Data generation
โ โ โโโ generator.py # Synthetic data generator
โ โโโ models/ # ML models
โ โ โโโ base.py # Base model class
โ โ โโโ fraud_detector.py # Fraud detection model
โ โ โโโ price_predictor.py # Price prediction model
โ โ โโโ churn_predictor.py # Churn prediction model
โ โ โโโ preprocessing.py # Feature preprocessing
โ โโโ monitoring/ # Monitoring components
โ โโโ drift_detector.py # Drift detection
โ โโโ metrics.py # Prometheus metrics
โ โโโ alerts.py # Alert management
โโโ scripts/
โ โโโ generate_data.py # Data generation script
โ โโโ train_models.py # Model training script
โ โโโ demo.py # Interactive demo
โ โโโ simulate_traffic.py # Traffic simulation
โโโ tests/
โ โโโ unit/ # Unit tests
โ โโโ integration/ # Integration tests
โโโ prometheus/ # Prometheus configuration
โโโ grafana/ # Grafana dashboards
โโโ Dockerfile # Container build
โโโ docker-compose.yml # Full stack deployment
โโโ Makefile # Common commands
โโโ pyproject.toml # Project configuration
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check with model status |
/health/live |
GET | Kubernetes liveness probe |
/health/ready |
GET | Kubernetes readiness probe |
| Endpoint | Method | Description |
|---|---|---|
/predict/fraud |
POST | Fraud detection prediction |
/predict/price |
POST | Property price prediction |
/predict/churn |
POST | Customer churn prediction |
/predict/batch |
POST | Batch predictions |
/predict/models |
GET | List available models |
| Endpoint | Method | Description |
|---|---|---|
/monitoring/drift/check |
POST | Check data drift |
/monitoring/drift/status/{model} |
GET | Get drift detector status |
/monitoring/quality/check |
POST | Check data quality |
/monitoring/alerts |
GET | List alerts |
/monitoring/alerts/summary |
GET | Alert summary |
/monitoring/alerts/{id} |
GET | Get alert details |
/monitoring/alerts/{id}/acknowledge |
POST | Acknowledge alert |
/monitoring/alerts/{id}/resolve |
POST | Resolve alert |
/monitoring/metrics |
GET | Prometheus metrics |
The platform monitors for several types of drift:
- Data Drift: Distribution changes in input features
- Concept Drift: Changes in the relationship between features and target
- Prediction Drift: Changes in model output distribution
Drift is detected using:
- PSI (Population Stability Index) for numerical features
- Jensen-Shannon Divergence for categorical features
# Prediction metrics
mlobs_predictions_total{model_name, status}
mlobs_prediction_latency_seconds{model_name}
mlobs_prediction_value{model_name}
# Drift metrics
mlobs_drift_score{model_name, feature}
mlobs_dataset_drift_detected{model_name, dataset_name}
mlobs_drift_share{model_name}
mlobs_drifted_features_count{model_name}
# Data quality metrics
mlobs_missing_values_share{model_name, dataset_name}
mlobs_duplicate_rows_count{model_name, dataset_name}
# Alert metrics
mlobs_alerts_total{model_name, alert_type, severity}
mlobs_active_alerts{model_name}
| Alert Type | Description | Default Threshold |
|---|---|---|
drift_detected |
Data drift detected | drift_share > 0.2 |
drift_critical |
Critical drift level | drift_share > 0.3 |
performance_degradation |
Model performance drop | accuracy < 0.8 |
data_quality_issue |
Data quality problems | missing > 10% |
high_latency |
Slow predictions | p99 > 500ms |
# Run all tests
make test
# Run unit tests only
make test-unit
# Run integration tests
make test-integration
# Run with coverage
pytest tests/ -v --cov=src --cov-report=htmlRun the interactive demo to see all features:
python scripts/demo.pySimulate traffic for monitoring demonstration:
# Start the API first
make run-api-dev
# In another terminal, run traffic simulation
python scripts/simulate_traffic.py --duration 120 --rate 5 --drift 0.3| Variable | Description | Default |
|---|---|---|
LOG_LEVEL |
Logging level | INFO |
MODEL_DIR |
Model storage directory | models/ |
DATA_DIR |
Data storage directory | data/ |
DriftDetector(
psi_threshold_warning=0.1, # Warning threshold
psi_threshold_critical=0.2, # Critical threshold
drift_share_threshold=0.3, # % features to trigger alert
)The platform includes pre-configured Grafana dashboards:
- ML Observability Platform - Main dashboard
- Model status indicators
- Prediction throughput
- Latency percentiles
- Drift status
- Data quality metrics
- Active alerts
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Run tests (
make test) - Run linters (
make lint) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- FastAPI - Modern web framework
- Evidently AI - ML monitoring inspiration
- Prometheus - Metrics collection
- Grafana - Visualization
- XGBoost - Gradient boosting
- LightGBM - Fast gradient boosting
- scikit-learn - ML utilities
