A machine learning project to predict score differentials in NCAA Division I Men's Basketball games, backed by a Streamlit web application for interactive predictions.
This project scrapes historical game log data from Sports Reference, builds a predictive model for score differentials, and provides an interactive web interface where users can input two teams and receive a predicted score differential.
Basketball_Modeling/
├── NCAAB_Sports_Reference_Scraper/ # Data collection scripts
│ ├── data/ # Raw scraped data (not tracked in git)
│ └── scraper.py # Sports Reference scraper
├── data/ # Processed datasets (not tracked in git)
│ ├── raw/ # Raw scraped data
│ ├── processed/ # Cleaned and feature-engineered data
│ └── training/ # Train/test splits
├── notebooks/ # Jupyter notebooks for exploration
│ ├── 01_data_exploration.ipynb
│ ├── 02_feature_engineering.ipynb
│ └── 03_model_development.ipynb
├── src/ # Source code
│ ├── data/
│ │ ├── scraper.py # Data collection functions
│ │ └── preprocessing.py # Data cleaning and feature engineering
│ ├── models/
│ │ ├── train.py # Model training scripts
│ │ └── predict.py # Prediction functions
│ └── utils/
│ └── helpers.py # Utility functions
├── models/ # Saved trained models
│ └── score_differential_model.pkl
├── app/ # Streamlit application
│ ├── streamlit_app.py # Main Streamlit app
│ └── components/ # UI components
├── tests/ # Unit tests
├── requirements.txt # Python dependencies
└── README.md # This file
- Scrapes NCAAB game logs from Sports Reference for multiple seasons
- Collects box score statistics including:
- Team statistics (points, rebounds, assists, turnovers, etc.)
- Shooting percentages (FG%, 3P%, FT%)
- Advanced metrics
- Target Variable: Score differential (Team A - Team B)
- Features:
- Team statistics (offensive/defensive ratings)
- Recent form (rolling averages)
- Head-to-head history
- Home/away status
- Season-to-date performance metrics
- Models to Explore:
- Linear Regression (baseline)
- Random Forest
- Gradient Boosting (XGBoost/LightGBM)
- Neural Networks
- Interactive web interface for predictions
- Input: Two team names
- Output: Predicted score differential with confidence intervals
- Visualization of key factors influencing the prediction
- Clone the repository:
git clone https://github.com/yourusername/Basketball_Modeling.git
cd Basketball_Modeling- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txtBefore processing, check that your files are in the correct location:
python src/data/validate_data.py
python src/data/validate_data.py --quality-check # For detailed analysisProcess a single season:
python src/data/combine_data.py --season 2023Or process all seasons at once:
python src/data/process_all_seasons.pyThis creates:
data/{season}/gamelogs_{season}.csv- Combined game logsdata/{season}/full_stats_{season}.csv- Complete dataset with team stats
python src/data/preprocessing.pypython src/models/train.py --model xgboost --output models/score_differential_model.pklstreamlit run app/streamlit_app.pyThe app will be available at http://localhost:8501
Key dependencies (see requirements.txt for full list):
- pandas
- numpy
- scikit-learn
- xgboost
- streamlit
- beautifulsoup4
- requests
- matplotlib
- seaborn
- Build Sports Reference scraper
- Create data combination scripts
- Add data validation tools
- Data preprocessing and feature engineering
- Exploratory data analysis
- Baseline model development
- Advanced model development and tuning
- Model evaluation and validation
- Build Streamlit application
- Deploy application
- Add real-time data updates
To be updated after model training
| Model | MAE | RMSE | R² |
|---|---|---|---|
| Baseline | - | - | - |
| Random Forest | - | - | - |
| XGBoost | - | - | - |
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
- Data sourced from Sports Reference
- Inspired by sports analytics community
For questions or feedback, please open an issue or contact [your email/contact info].