Skip to content

IBuySpy-Dev/AwardPredictor

Repository files navigation

🏆 AwardPredictor

Build Status Python License Data Pipeline

Overview

AwardPredictor is a cross-domain data science project that predicts award nominations and winners across the entertainment industry's biggest ceremonies — the Grammys, Oscars, and Emmys — by combining historical award data with Spotify audio features.

Motivation

Award outcomes are influenced by measurable signals: an artist's streaming momentum, audio characteristics of nominated tracks, historical voting patterns, and cross-industry trends. This project builds a unified prediction pipeline that:

  • Ingests and harmonizes award histories across three major ceremonies
  • Enriches music-related nominations with Spotify audio features (danceability, energy, valence, tempo, etc.)
  • Trains classification models to predict both nominations and winners
  • Surfaces insights through exploratory analysis and interactive dashboards

Data Sources

Source Description Ingestion Method
Grammy Awards Historical nominees & winners (1959–present) On-hand CSV
Oscar Awards Historical nominees & winners (1929–present) On-hand CSV
Emmy Awards Historical nominees & winners (1949–present) On-hand CSV
Spotify Playlists Curated award-related playlists Spotify Web API
Spotify Tracks Audio features for nominated songs/soundtracks Spotify Web API

Architecture

This project follows the Lakehouse Medallion pattern, designed for an initial Python/Jupyter implementation with a clear migration path to Microsoft Fabric Spark.

┌─────────────────────────────────────────────────────────────────────┐
│                        DATA SOURCES                                  │
├──────────┬──────────┬──────────┬────────────────────────────────────┤
│  Grammy  │  Oscar   │  Emmy    │  Spotify API                       │
│  CSV     │  CSV     │  CSV     │  (Playlists + Audio Features)      │
└────┬─────┴────┬─────┴────┬─────┴──────────┬────────────────────────┘
     │          │          │                 │
     ▼          ▼          ▼                 ▼
┌─────────────────────────────────────────────────────────────────────┐
│  🥉 BRONZE LAYER (Raw Ingestion)                                    │
│  ─────────────────────────────────────────────────────────────────  │
│  • Raw CSVs loaded as-is                                            │
│  • API responses stored in JSON/Parquet                             │
│  • No transformations — source-of-truth snapshots                   │
└─────────────────────────────┬───────────────────────────────────────┘
                              │ Cleanse / Normalize / Deduplicate
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│  🥈 SILVER LAYER (Cleaned & Conformed)                              │
│  ─────────────────────────────────────────────────────────────────  │
│  • Standardized schemas across award sources                        │
│  • Entity resolution (artist/film/show matching)                    │
│  • Spotify features joined to nominations                           │
│  • Data quality checks applied                                      │
└─────────────────────────────┬───────────────────────────────────────┘
                              │ Aggregate / Feature Engineering
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│  🥇 GOLD LAYER (Analytics-Ready)                                    │
│  ─────────────────────────────────────────────────────────────────  │
│  • Feature tables for ML models                                     │
│  • Aggregated stats & trend tables                                  │
│  • Prediction outputs (nominations + winners)                       │
│  • Dashboard-ready views                                            │
└─────────────────────────────────────────────────────────────────────┘

Project Structure

AwardPredictor/
├── README.md
├── requirements.txt
├── .env.example
├── .gitignore
├── Dockerfile                 # Container image for API + dashboard
├── docker-entrypoint.sh       # Startup script (FastAPI + Streamlit)
├── bootstrap.ps1              # Azure SP, OIDC, Key Vault provisioning
│
├── app/                       # Web application
│   ├── api/                   # FastAPI prediction endpoints
│   └── dashboard/             # Streamlit multi-page dashboard
│
├── data/
│   ├── bronze/                # Raw ingested data
│   │   ├── grammys/
│   │   ├── oscars/
│   │   ├── emmys/
│   │   └── spotify/
│   ├── silver/                # Cleaned & conformed
│   └── gold/                  # Analytics-ready tables
│
├── models/                    # Trained model artifacts
│
├── notebooks/
│   ├── cleaning/              # Bronze → Silver notebooks
│   ├── features/              # Silver → Gold notebooks
│   ├── modeling/              # ML training & evaluation
│   └── exploration/           # EDA & visualization
│
├── src/
│   ├── ingestion/             # Data collection scripts
│   ├── cleaning/              # Cleaning & joining logic
│   ├── features/              # Feature engineering
│   ├── models/                # Model training & inference
│   └── utils/                 # Shared helpers
│
├── tests/                     # Unit & integration tests
├── config/                    # Pipeline configurations
├── infra/                     # Bicep IaC (Fabric, Key Vault, ACR, ACA)
└── docs/                      # Additional documentation

Sprint 5 Results — First Trained Model

The first end-to-end model is trained and reproducible.

Split Rows AUC-ROC
Train 37,815 0.9989
Test 366 0.9999

⚠️ Known limitations — the ~1.0 AUC reflects strong legitimate signal from historical patterns, not leakage (confirmed in #21). The test set now covers all 3 domains via domain-stratified temporal holdout (fixed in #20). Per-domain metrics should still be evaluated separately. See the model card for the full caveats.

Documentation

Document Purpose
Deployment Runbook Full deployment lifecycle: bootstrap, infrastructure, application, and troubleshooting
API Reference FastAPI endpoint documentation, authentication, error codes, and integration examples
Model Card Model performance, limitations, and ethical considerations

Getting Started

Prerequisites

  • Python 3.10+
  • A Spotify Developer account (for API access)

Installation

# Clone the repository
git clone https://github.com/IBuySpy-Dev/AwardPredictor.git
cd AwardPredictor

# Create and activate a virtual environment
python -m venv .venv

# Windows
.venv\Scripts\activate

# macOS/Linux
# source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Copy environment template
cp .env.example .env

# Launch Jupyter
jupyter notebook

Configuration

Create a .env file for local development (non-sensitive settings only):

# Spotify uses OAuth2 Authorization Code flow via Entra ID or Spotipy's PKCE
# No client secrets stored — use Spotify's PKCE flow or Managed Identity
SPOTIFY_REDIRECT_URI=http://localhost:8888/callback

⚠️ No API keys or secrets in .env files. All external service auth uses OAuth2/OIDC flows or Managed Identity.

Sprint Roadmap

Sprint Focus Status
Sprint 1 Data Ingestion & Bronze Layer ✅ Complete
Sprint 2 Cleaning & Silver Layer ✅ Complete
Sprint 3 Feature Engineering & Gold Layer ✅ Complete
Sprint 4 Baseline & Advanced Models ✅ Complete
Sprint 5 Reusable Prediction Pipeline & Model Card ✅ Complete
Sprint 6 FastAPI + Streamlit Web Application ✅ Complete
Sprint 7 Containerization & Azure Container Apps ✅ Complete
Sprint 8 E2E Tests & Infrastructure Hardening ✅ Complete

Future: Migrate data pipeline to Microsoft Fabric Spark for scalable, scheduled execution.

Contributing

Contributions are welcome! To get started:

  1. Open or reference a GitHub Issue for your change
  2. Create a branch: <type>/<issue-number>-<short-description> (e.g., feat/42-add-spotify-features)
  3. Commit using Conventional Commits: <type>(<scope>): <summary> (#<issue>)
  4. Push and open a Pull Request referencing the issue

Please ensure:

  • Code follows existing style conventions
  • Notebooks are cleared of output before committing
  • New data sources include documentation in docs/
  • Tests pass before submitting PR (pytest tests/ -v)

License

This project is licensed under the MIT License — see the LICENSE file for details.


Built with curiosity about what makes a winner. 🎬🎵📺

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors