Skip to content

emmacgodfrey/baseball

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

baseball

Baseball is a sport of strategy consisting of seemingly endless statistics. For pitchers, their goal is to throw pitches that entice the batter to swing. As a hitter, their goal is to successfully recognize the incoming pitch and make good swinging decisions to hopefully get the ball in play. In this project, I focus on the pitcher’s goal of throwing enticing, yet tricky, pitches that the offensive batter will swing at. I sought to answer the following question: Given that the batter swings, what influences the likelihood of the batter making contact with the ball? More specifically, I use machine learning techniques including random forests, support vector machines, and gradient boosted trees to classify and predict whether a pitch will be hit or missed based on statcast pitch-by-pitch data.

The dataset I used to conduct my analysis was sourced from Kaggle and contains pitch level data for the 2015-2018 MLB seasons. Each observation represents a single pitch. Due to the minutiae of baseball and its reliance on statistics, there are over 40 explanatory variables in this dataset which, to the average non-baseball connoisseur, is overwhelming. For a detailed variable explanation, reference this link (http://www.inalitic.com/datasets/mlb%20pitch%20data.html#fn-2).

I first subset the data into two groups, ball contact and no ball contact, conditional on an offensive swing. Initially, I looked at pitch type, ball movement, and pitch location as predictors for successful ball contact. Based on my initial EDA, pitch location seemed the most meaningful in predicting successful ball contact. I then built a random forest, extreme gradient booster, and a support vector machine using a training dataset and proceeded to test each model on unseen data. The XGBoost model outperformed the other two models in terms of area under the ROC curve, however there is still significant unexplained variability. Most importantly, though, the location of the pitch (horizontal and vertical) ended up being the most meaningful features in all three models. The high importance of pitch location agrees with my initial EDA and dissection into the likelihood of contact in a given zone; there was a significant increase in no contact swings just below the strike zone.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors