AwardPredictor is a cross-domain data science project that predicts award nominations and winners across the entertainment industry's biggest ceremonies — the Grammys, Oscars, and Emmys — by combining historical award data with Spotify audio features.
Award outcomes are influenced by measurable signals: an artist's streaming momentum, audio characteristics of nominated tracks, historical voting patterns, and cross-industry trends. This project builds a unified prediction pipeline that:
- Ingests and harmonizes award histories across three major ceremonies
- Enriches music-related nominations with Spotify audio features (danceability, energy, valence, tempo, etc.)
- Trains classification models to predict both nominations and winners
- Surfaces insights through exploratory analysis and interactive dashboards
| Source | Description | Ingestion Method |
|---|---|---|
| Grammy Awards | Historical nominees & winners (1959–present) | On-hand CSV |
| Oscar Awards | Historical nominees & winners (1929–present) | On-hand CSV |
| Emmy Awards | Historical nominees & winners (1949–present) | On-hand CSV |
| Spotify Playlists | Curated award-related playlists | Spotify Web API |
| Spotify Tracks | Audio features for nominated songs/soundtracks | Spotify Web API |
This project follows the Lakehouse Medallion pattern, designed for an initial Python/Jupyter implementation with a clear migration path to Microsoft Fabric Spark.
┌─────────────────────────────────────────────────────────────────────┐
│ DATA SOURCES │
├──────────┬──────────┬──────────┬────────────────────────────────────┤
│ Grammy │ Oscar │ Emmy │ Spotify API │
│ CSV │ CSV │ CSV │ (Playlists + Audio Features) │
└────┬─────┴────┬─────┴────┬─────┴──────────┬────────────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────┐
│ 🥉 BRONZE LAYER (Raw Ingestion) │
│ ───────────────────────────────────────────────────────────────── │
│ • Raw CSVs loaded as-is │
│ • API responses stored in JSON/Parquet │
│ • No transformations — source-of-truth snapshots │
└─────────────────────────────┬───────────────────────────────────────┘
│ Cleanse / Normalize / Deduplicate
▼
┌─────────────────────────────────────────────────────────────────────┐
│ 🥈 SILVER LAYER (Cleaned & Conformed) │
│ ───────────────────────────────────────────────────────────────── │
│ • Standardized schemas across award sources │
│ • Entity resolution (artist/film/show matching) │
│ • Spotify features joined to nominations │
│ • Data quality checks applied │
└─────────────────────────────┬───────────────────────────────────────┘
│ Aggregate / Feature Engineering
▼
┌─────────────────────────────────────────────────────────────────────┐
│ 🥇 GOLD LAYER (Analytics-Ready) │
│ ───────────────────────────────────────────────────────────────── │
│ • Feature tables for ML models │
│ • Aggregated stats & trend tables │
│ • Prediction outputs (nominations + winners) │
│ • Dashboard-ready views │
└─────────────────────────────────────────────────────────────────────┘
AwardPredictor/
├── README.md
├── requirements.txt
├── .env.example
├── .gitignore
├── Dockerfile # Container image for API + dashboard
├── docker-entrypoint.sh # Startup script (FastAPI + Streamlit)
├── bootstrap.ps1 # Azure SP, OIDC, Key Vault provisioning
│
├── app/ # Web application
│ ├── api/ # FastAPI prediction endpoints
│ └── dashboard/ # Streamlit multi-page dashboard
│
├── data/
│ ├── bronze/ # Raw ingested data
│ │ ├── grammys/
│ │ ├── oscars/
│ │ ├── emmys/
│ │ └── spotify/
│ ├── silver/ # Cleaned & conformed
│ └── gold/ # Analytics-ready tables
│
├── models/ # Trained model artifacts
│
├── notebooks/
│ ├── cleaning/ # Bronze → Silver notebooks
│ ├── features/ # Silver → Gold notebooks
│ ├── modeling/ # ML training & evaluation
│ └── exploration/ # EDA & visualization
│
├── src/
│ ├── ingestion/ # Data collection scripts
│ ├── cleaning/ # Cleaning & joining logic
│ ├── features/ # Feature engineering
│ ├── models/ # Model training & inference
│ └── utils/ # Shared helpers
│
├── tests/ # Unit & integration tests
├── config/ # Pipeline configurations
├── infra/ # Bicep IaC (Fabric, Key Vault, ACR, ACA)
└── docs/ # Additional documentation
The first end-to-end model is trained and reproducible.
| Split | Rows | AUC-ROC |
|---|---|---|
| Train | 37,815 | 0.9989 |
| Test | 366 | 0.9999 |
- Algorithm: XGBoost with median imputation and class-imbalance weighting
- Model card:
docs/MODEL_CARD.md - Inference API:
src/models/predict.py(load_model,predict,write_predictions) - CLI:
python -m src.models.trainandpython -m src.models.predict - Demo notebook:
notebooks/modeling/04_predict_next_season.ipynb
⚠️ Known limitations — the ~1.0 AUC reflects strong legitimate signal from historical patterns, not leakage (confirmed in #21). The test set now covers all 3 domains via domain-stratified temporal holdout (fixed in #20). Per-domain metrics should still be evaluated separately. See the model card for the full caveats.
| Document | Purpose |
|---|---|
| Deployment Runbook | Full deployment lifecycle: bootstrap, infrastructure, application, and troubleshooting |
| API Reference | FastAPI endpoint documentation, authentication, error codes, and integration examples |
| Model Card | Model performance, limitations, and ethical considerations |
- Python 3.10+
- A Spotify Developer account (for API access)
# Clone the repository
git clone https://github.com/IBuySpy-Dev/AwardPredictor.git
cd AwardPredictor
# Create and activate a virtual environment
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS/Linux
# source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Copy environment template
cp .env.example .env
# Launch Jupyter
jupyter notebookCreate a .env file for local development (non-sensitive settings only):
# Spotify uses OAuth2 Authorization Code flow via Entra ID or Spotipy's PKCE
# No client secrets stored — use Spotify's PKCE flow or Managed Identity
SPOTIFY_REDIRECT_URI=http://localhost:8888/callback
⚠️ No API keys or secrets in .env files. All external service auth uses OAuth2/OIDC flows or Managed Identity.
| Sprint | Focus | Status |
|---|---|---|
| Sprint 1 | Data Ingestion & Bronze Layer | ✅ Complete |
| Sprint 2 | Cleaning & Silver Layer | ✅ Complete |
| Sprint 3 | Feature Engineering & Gold Layer | ✅ Complete |
| Sprint 4 | Baseline & Advanced Models | ✅ Complete |
| Sprint 5 | Reusable Prediction Pipeline & Model Card | ✅ Complete |
| Sprint 6 | FastAPI + Streamlit Web Application | ✅ Complete |
| Sprint 7 | Containerization & Azure Container Apps | ✅ Complete |
| Sprint 8 | E2E Tests & Infrastructure Hardening | ✅ Complete |
Future: Migrate data pipeline to Microsoft Fabric Spark for scalable, scheduled execution.
Contributions are welcome! To get started:
- Open or reference a GitHub Issue for your change
- Create a branch:
<type>/<issue-number>-<short-description>(e.g.,feat/42-add-spotify-features) - Commit using Conventional Commits:
<type>(<scope>): <summary> (#<issue>) - Push and open a Pull Request referencing the issue
Please ensure:
- Code follows existing style conventions
- Notebooks are cleared of output before committing
- New data sources include documentation in
docs/ - Tests pass before submitting PR (
pytest tests/ -v)
This project is licensed under the MIT License — see the LICENSE file for details.
Built with curiosity about what makes a winner. 🎬🎵📺