🏆 AwardPredictor

Overview

AwardPredictor is a cross-domain data science project that predicts award nominations and winners across the entertainment industry's biggest ceremonies — the Grammys, Oscars, and Emmys — by combining historical award data with Spotify audio features.

Motivation

Award outcomes are influenced by measurable signals: an artist's streaming momentum, audio characteristics of nominated tracks, historical voting patterns, and cross-industry trends. This project builds a unified prediction pipeline that:

Ingests and harmonizes award histories across three major ceremonies
Enriches music-related nominations with Spotify audio features (danceability, energy, valence, tempo, etc.)
Trains classification models to predict both nominations and winners
Surfaces insights through exploratory analysis and interactive dashboards

Data Sources

Source	Description	Ingestion Method
Grammy Awards	Historical nominees & winners (1959–present)	On-hand CSV
Oscar Awards	Historical nominees & winners (1929–present)	On-hand CSV
Emmy Awards	Historical nominees & winners (1949–present)	On-hand CSV
Spotify Playlists	Curated award-related playlists	Spotify Web API
Spotify Tracks	Audio features for nominated songs/soundtracks	Spotify Web API

Architecture

This project follows the Lakehouse Medallion pattern, designed for an initial Python/Jupyter implementation with a clear migration path to Microsoft Fabric Spark.

┌─────────────────────────────────────────────────────────────────────┐
│                        DATA SOURCES                                  │
├──────────┬──────────┬──────────┬────────────────────────────────────┤
│  Grammy  │  Oscar   │  Emmy    │  Spotify API                       │
│  CSV     │  CSV     │  CSV     │  (Playlists + Audio Features)      │
└────┬─────┴────┬─────┴────┬─────┴──────────┬────────────────────────┘
     │          │          │                 │
     ▼          ▼          ▼                 ▼
┌─────────────────────────────────────────────────────────────────────┐
│  🥉 BRONZE LAYER (Raw Ingestion)                                    │
│  ─────────────────────────────────────────────────────────────────  │
│  • Raw CSVs loaded as-is                                            │
│  • API responses stored in JSON/Parquet                             │
│  • No transformations — source-of-truth snapshots                   │
└─────────────────────────────┬───────────────────────────────────────┘
                              │ Cleanse / Normalize / Deduplicate
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│  🥈 SILVER LAYER (Cleaned & Conformed)                              │
│  ─────────────────────────────────────────────────────────────────  │
│  • Standardized schemas across award sources                        │
│  • Entity resolution (artist/film/show matching)                    │
│  • Spotify features joined to nominations                           │
│  • Data quality checks applied                                      │
└─────────────────────────────┬───────────────────────────────────────┘
                              │ Aggregate / Feature Engineering
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│  🥇 GOLD LAYER (Analytics-Ready)                                    │
│  ─────────────────────────────────────────────────────────────────  │
│  • Feature tables for ML models                                     │
│  • Aggregated stats & trend tables                                  │
│  • Prediction outputs (nominations + winners)                       │
│  • Dashboard-ready views                                            │
└─────────────────────────────────────────────────────────────────────┘

Project Structure

AwardPredictor/
├── README.md
├── requirements.txt
├── .env.example
├── .gitignore
├── Dockerfile                 # Container image for API + dashboard
├── docker-entrypoint.sh       # Startup script (FastAPI + Streamlit)
├── bootstrap.ps1              # Azure SP, OIDC, Key Vault provisioning
│
├── app/                       # Web application
│   ├── api/                   # FastAPI prediction endpoints
│   └── dashboard/             # Streamlit multi-page dashboard
│
├── data/
│   ├── bronze/                # Raw ingested data
│   │   ├── grammys/
│   │   ├── oscars/
│   │   ├── emmys/
│   │   └── spotify/
│   ├── silver/                # Cleaned & conformed
│   └── gold/                  # Analytics-ready tables
│
├── models/                    # Trained model artifacts
│
├── notebooks/
│   ├── cleaning/              # Bronze → Silver notebooks
│   ├── features/              # Silver → Gold notebooks
│   ├── modeling/              # ML training & evaluation
│   └── exploration/           # EDA & visualization
│
├── src/
│   ├── ingestion/             # Data collection scripts
│   ├── cleaning/              # Cleaning & joining logic
│   ├── features/              # Feature engineering
│   ├── models/                # Model training & inference
│   └── utils/                 # Shared helpers
│
├── tests/                     # Unit & integration tests
├── config/                    # Pipeline configurations
├── infra/                     # Bicep IaC (Fabric, Key Vault, ACR, ACA)
└── docs/                      # Additional documentation

Sprint 5 Results — First Trained Model

The first end-to-end model is trained and reproducible.

Split	Rows	AUC-ROC
Train	37,815	0.9989
Test	366	0.9999

Algorithm: XGBoost with median imputation and class-imbalance weighting
Model card: docs/MODEL_CARD.md
Inference API: src/models/predict.py (load_model, predict, write_predictions)
CLI: python -m src.models.train and python -m src.models.predict
Demo notebook: notebooks/modeling/04_predict_next_season.ipynb

⚠️ Known limitations — the ~1.0 AUC reflects strong legitimate signal from historical patterns, not leakage (confirmed in #21). The test set now covers all 3 domains via domain-stratified temporal holdout (fixed in #20). Per-domain metrics should still be evaluated separately. See the model card for the full caveats.

Documentation

Document	Purpose
Deployment Runbook	Full deployment lifecycle: bootstrap, infrastructure, application, and troubleshooting
API Reference	FastAPI endpoint documentation, authentication, error codes, and integration examples
Model Card	Model performance, limitations, and ethical considerations

Getting Started

Prerequisites

Python 3.10+
A Spotify Developer account (for API access)

Installation

# Clone the repository
git clone https://github.com/IBuySpy-Dev/AwardPredictor.git
cd AwardPredictor

# Create and activate a virtual environment
python -m venv .venv

# Windows
.venv\Scripts\activate

# macOS/Linux
# source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Copy environment template
cp .env.example .env

# Launch Jupyter
jupyter notebook

Configuration

Create a .env file for local development (non-sensitive settings only):

# Spotify uses OAuth2 Authorization Code flow via Entra ID or Spotipy's PKCE
# No client secrets stored — use Spotify's PKCE flow or Managed Identity
SPOTIFY_REDIRECT_URI=http://localhost:8888/callback

⚠️ No API keys or secrets in .env files. All external service auth uses OAuth2/OIDC flows or Managed Identity.

Sprint Roadmap

Sprint	Focus	Status
Sprint 1	Data Ingestion & Bronze Layer	✅ Complete
Sprint 2	Cleaning & Silver Layer	✅ Complete
Sprint 3	Feature Engineering & Gold Layer	✅ Complete
Sprint 4	Baseline & Advanced Models	✅ Complete
Sprint 5	Reusable Prediction Pipeline & Model Card	✅ Complete
Sprint 6	FastAPI + Streamlit Web Application	✅ Complete
Sprint 7	Containerization & Azure Container Apps	✅ Complete
Sprint 8	E2E Tests & Infrastructure Hardening	✅ Complete

Future: Migrate data pipeline to Microsoft Fabric Spark for scalable, scheduled execution.

Contributing

Contributions are welcome! To get started:

Open or reference a GitHub Issue for your change
Create a branch: <type>/<issue-number>-<short-description> (e.g., feat/42-add-spotify-features)
Commit using Conventional Commits: <type>(<scope>): <summary> (#<issue>)
Push and open a Pull Request referencing the issue

Please ensure:

Code follows existing style conventions
Notebooks are cleared of output before committing
New data sources include documentation in docs/
Tests pass before submitting PR (pytest tests/ -v)

License

This project is licensed under the MIT License — see the LICENSE file for details.

Built with curiosity about what makes a winner. 🎬🎵📺

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏆 AwardPredictor

Overview

Motivation

Data Sources

Architecture

Project Structure

Sprint 5 Results — First Trained Model

Documentation

Getting Started

Prerequisites

Installation

Configuration

Sprint Roadmap

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.agents/skills		.agents/skills
.github		.github
app		app
config		config
data		data
docs		docs
fabric		fabric
infra		infra
models		models
notebooks		notebooks
src		src
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Add-SPToWorkspace.template.ps1		Add-SPToWorkspace.template.ps1
Dockerfile		Dockerfile
README.md		README.md
Test-WorkspaceAccess.template.ps1		Test-WorkspaceAccess.template.ps1
bootstrap.ps1		bootstrap.ps1
docker-entrypoint.sh		docker-entrypoint.sh
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🏆 AwardPredictor

Overview

Motivation

Data Sources

Architecture

Project Structure

Sprint 5 Results — First Trained Model

Documentation

Getting Started

Prerequisites

Installation

Configuration

Sprint Roadmap

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages