This project is an open-source contribution toward building the most intelligent sports AI agent ever created. By leveraging comprehensive public cricket statistics, we've developed a sophisticated scoring system that analyzes player performance, team composition, and match dynamics with unprecedented depth.
Our system processes real-world Test cricket data spanning decades to create an AI-powered framework capable of evaluating team strength, predicting performance outcomes, and understanding the nuanced interplay of batting, bowling, fielding, and experience factors.
- Web scraping from ESPNcricinfo's public statistics database
- Four specialized scrapers for different performance dimensions:
- Batting Statistics: Runs, averages, centuries, consistency metrics
- Bowling Statistics: Wickets, economy rates, strike rates, five-wicket hauls
- Fielding Statistics: Catches, stumpings, dismissal rates
- All-rounder Statistics: Combined batting and bowling prowess
- Longevity-Adjusted Metrics: Uses Bayesian shrinkage to fairly evaluate players with different career lengths
- Multi-dimensional Assessment:
- Batting Score (0-100)
- Bowling Score (0-100)
- Fielding Score (0-100)
- Experience Score (0-100)
- Consistency Modifiers: Accounts for not-outs, ducks, and performance reliability
- Analyzes team composition (6 batsmen, 1 wicketkeeper, 4 bowlers)
- Weighted contributions from top performers
- Component-level scoring with realistic weights:
- Batting: 35%
- Bowling: 35%
- Fielding: 15%
- Experience: 15%
- Home/Away advantage modeling
- Luck factor simulation for match variability
- Game situation adaptability
- Monte Carlo simulation with 150,000+ team generations
- Percentile-based tier classification
- Distribution visualization and balance analysis
ESPNcricinfo Public Stats โ Web Scrapers โ Raw CSV Files
Four Python scripts collect data with proper rate limiting and error handling:
batting_data_scrapping.ipynb- Collects batting recordsbowling_data_scrapping.ipynb- Collects bowling recordsfielding_data_scrapping.ipynb- Collects fielding recordsallrounder_data_scrapping.ipynb- Collects all-rounder records
Raw CSVs โ Data Cleaning โ Player Merge โ Final Dataset
The cleaning script (data_merger.py):
- Standardizes player names and extracts countries
- Merges statistics across all dimensions
- Handles missing data intelligently
- Creates derived metrics (e.g., fielding catches vs keeper stumpings)
- Categorizes players (Batsman/Bowler/Wicketkeeper/All-rounder)
Final Dataset โ Individual Scoring โ Team Generation โ Distribution Analysis
Multiple analysis scripts provide:
- Individual player evaluation
- Team strength assessment
- Score distribution modeling
- Tier-based classification systems
Each player receives four component scores (0-100 scale):
- Batting Average: Adjusted for career length using Bayesian shrinkage
- Total Runs: Career run accumulation
- Centuries & Fifties: Major milestone achievements
- Consistency Modifier: Rewards not-outs, penalizes ducks proportionally
- Bowling Average: Lower is better, adjusted for career length
- Total Wickets: Career wicket-taking ability
- Five-Wicket Hauls: Match-winning performances
- Matches Played: Experience factor
- Catches: Outfield catching ability
- Stumpings: Wicketkeeper-specific skill (bonus weighting)
- Career Longevity: Matches played over career span
- Pure longevity metric: No overlap with batting/bowling stats
Individual โ Team Aggregation
- Calculate individual weighted score for all 11 players
- Sum individual totals (0-1100 range)
- Scale to 0-1000 range
- Apply component variance (5%) for realistic fluctuation
- Apply team variance (3%) for match-day factors
- Apply contextual modifiers (home/away, luck factor)
- Smooth scaling to expand range while preserving skill hierarchy
Key Principle: Skill dominates over luck. Elite teams consistently outperform average teams.
Five-tier system based on team scores:
| Tier | Score Range | Color | Rarity | Expected Distribution |
|---|---|---|---|---|
| Base | 0-571 | Black | Common | 40% |
| Rare | 571-623 | Bronze | Rare | 25% |
| Super Rare | 623-678 | Silver | Super Rare | 20% |
| Legendary | 678-733 | Gold | Legendary | 10% |
| Mythic | 733-1000 | Platinum | Mythic | 5% |
Note: Tier boundaries are determined purely by team performance scores, not luck factors.
Python 3.8+
pip install pandas numpy scipy matplotlib seaborn beautifulsoup4 requests scikit-learn- Statistical summaries (min, max, mean, median, std dev)
- Percentile breakdowns (every 5th and 10th percentile)
- Tier distribution analysis
- Component score analysis
- Home/away impact analysis
- Luck factor impact (gameplay only)
- Team score distribution with tier boundaries
- Tier distribution pie chart
- Component score box plots
- Luck factor impact analysis
- Comprehensive PDF reports
- Individual player scores
- Team compositions
- Score distributions
- Tier classifications
The system supports multiple simulation modes:
- Single Team Generation: Create one team with detailed breakdown
- Small-Scale Testing: 10,000 simulations for quick validation
- Large-Scale Analysis: 150,000+ simulations for stable distributions
- Comparative Studies: Analyze score distributions across parameters
Configurable Parameters:
- Component variance (how much scores fluctuate)
- Team variance (match-day variability)
- Home/away advantage magnitude
- Luck factor impact range
- Smooth scaling parameters for range expansion
The AI understands that:
- Elite batsmen need different evaluation criteria than part-time bowlers
- Wicketkeepers have unique value beyond their batting
- All-rounders require balanced assessment across multiple dimensions
- Career longevity impacts statistical reliability
- Young talents aren't unfairly penalized for short careers
- Legends with 150+ matches get their full statistical weight
- Performance consistency matters alongside peak achievements
- Home teams have statistical advantages (5% boost)
- Away teams face challenges (3% penalty)
- Luck exists but doesn't overwhelm skill (ยฑ5-8% range)
- Top performers contribute disproportionately (weighted averaging)
The framework is designed for:
- Integration with machine learning models
- Real-time prediction systems
- Strategy recommendation engines
- Player development tracking
adjusted_avg = (matches / (matches + k)) * player_avg +
((k / (matches + k)) * league_avg)
Where k represents the number of matches needed to establish statistical reliability.
normalized = (score - min_expected) / (max_expected - min_expected)
scaled = normalized ^ (1 / stretch_power)
final_score = scaled * (target_max - min_expected) + min_expected
This preserves skill hierarchy while expanding the score range naturally.
Top performers in each category (batting, bowling) receive higher weights, reflecting their disproportionate impact on match outcomes.
Score Distribution:
- Mean: ~565
- Median: ~563
- Std Dev: ~75
- Range: 300-950
Tier Breakdown:
- Base (40%): 60,000 teams
- Rare (25%): 37,500 teams
- Super Rare (20%): 30,000 teams
- Legendary (10%): 15,000 teams
- Mythic (5%): 7,500 teams
Elite Team Threshold:
- 95th percentile: ~700
- 99th percentile: ~750
- Maximum: ~950
This project demonstrates several principles crucial for building intelligent sports agents:
- Data Quality: Rigorous cleaning and validation of public statistics
- Domain Knowledge: Cricket-specific understanding of player roles and metrics
- Statistical Rigor: Proper handling of small samples and career variability
- Interpretability: Every score can be traced back to its components
- Scalability: Framework extensible to other sports and formats
Potential extensions for the AI agent:
- Format Expansion: ODI, T20, domestic leagues
- Temporal Analysis: Era-adjusted comparisons, form tracking
- Match Prediction: Win probability models
- Strategy Optimization: Batting order, bowling changes
- Player Recommendations: Team building, draft picks
- Real-time Integration: Live match scoring and updates
- Natural Language Interface: Query system for statistics
- Visual Analytics Dashboard: Interactive exploration tools
All data is collected from publicly available sources:
- ESPNcricinfo Statistics Database (https://stats.espncricinfo.com)
- International Cricket Council (ICC) public records
- Historical Test match archives
Data Ethics:
- No private or personal information collected
- Only publicly available performance statistics used
- Proper rate limiting and respectful scraping practices
- Attribution to original sources maintained
- ESPNcricinfo for maintaining comprehensive public cricket statistics
- The cricket analytics community for insights into player evaluation
- Open-source Python ecosystem (pandas, numpy, scipy, scikit-learn)
For questions, suggestions, or collaboration opportunities related to building the most intelligent sports AI agent:
- Open an issue on this repository
- Contribute via pull requests
- Share your analysis results and findings
This framework has potential applications in:
- Sports Analytics Research: Academic studies on player valuation
- Machine Learning: Feature engineering for predictive models
- Data Science Education: Real-world dataset for teaching
- AI Development: Training data for sports-specific agents
- Statistical Methods: Demonstrating Bayesian approaches to sports data
Built with โค๏ธ for advancing sports AI intelligence through open-source collaboration