- Intro: The Nature of Baseball
- Technologies
- Metadata
- Data Cleaning
- EDA (Exploratory Data Analysis)
- Feature Scaling
- Multiple Linear Regression with Feature Selection
- Simple Linear Regression
- Model Validation
- Conclusion
In the previous project, I briefly talked about how a team wins in baseball. The first part of winning in baseball is Runs Scored (RS) and what makes that RS was dealt with in the previous project using linear regression models.
However RS is not the only part of winning in baseball. While a team must score runs, it also has to prevent its opponents from scoring runs (at least allow runs less than it scores) to win a game. This is indicated as Runs Allowed. So in this project, I analyzed how a team can allow runs as less as possible.
- Python 3.8
- Pandas - version 1.2.2
- Numpy - version 1.20.1
- matplotlib - version 3.3.4
- seaborn - version 0.11.1
- scikit-learn - version 0.24.1
- statsmodels - version 0.12.2
- scipy - version 1.6.1
Metadata | Information |
---|---|
Origin of Data | Baseball Prospectus |
Terms of Use | Terms and Conditions |
Data Structure | 10 datasets each consisting of 31 rows * 27 columns |
Data Feature | Data Meaning |
---|---|
LVL | Level of Play: MLB (the major league) |
YEAR | Each year refers to corresponding seasons |
TEAM | All 30 Major League Baseball Teams |
IP | Innings Pitched |
PA | Plate Appearance |
R | Runs Allowed |
ERA | Earned Run Average |
FIP | Fielding Independent Pitching |
cFIP | Contextual Fiedling Independent Pitching |
cFIP_START | cFIP for Starting Pitchers |
cFIP_RELIEF | cFIP for Relief Pitchers |
FIP_MINUS_ERA | ERA Subtracted from FIP |
SO9 | Strikeouts Per 9 Innings |
BB9 | Walks Per 9 Innings |
SO/BB | Strikeout-to-Walk Ratio |
HR9 | Home Runs Per 9 Innings |
oppAVG | Batting Average Allowed by a Pitcher |
oppOBP | On-base Percentage Allowed by a Pitcher |
oppSLG | Slugging Percentage Allowed by a Pitcher |
oppOPS | On-base Plus Slugging Allowed by a Pitcher |
WHIP | Walk and Hits Per Inning Pitched |
DRA | Deserved Run Average |
DRA- | DRA-Minus |
DRA_START | DRA for Starting Pitchers |
DRA_RELIEF | DRA for Relief Pitchers |
PWARP | Pitcher Wins Above Replacement Player |
- Combined 10 different datasets (2010-2019 Season Pitching datasets).
- Dropped an unnecessary column made when combining datasets (Column: '#').
- Renamed 'R' data feature as 'RA' for clarity.
- Eliminated commas in some data features and convert their data types from integer into numeric (IP, PA).
- Detected invalid 0 values in some data features (cFIP_START, cFIP_RELIEF, SO/BB, oppAVG, oppOBP, oppSLG, oppOPS, DRA_START, DRA_RELIEF).
- By looking at data features that contain the 0 values, I noticed that these invalid values were recorded in specific seasons because such data atrributes didn't exist in that corresponding seasson. In other words, such invalid values are considered Missing At Random (MAR).
- Treated these invalid values as missing values and replaced them with projected values based on linear regression result using IterativeImputer.
- Dropped categorical variables (LVL and TEAM), as they are irrelevant to this analysis.
- RA Skewness: 0.38340975864973814
- RA Kurtosis: -0.13976152269512854
According to the histogram and probability plot above, RA seems to follow a normal distribution. The skewness of 0.38 and kurtosis of -0.14 also indicate that team RA data is normallly distributed. Likewise, the boxplots above show that team RA has been normally distributed over the last 10 seasons with few outliers.
5-2. Feature Selection: Filter Method
Correlation | RS | PA | TB | OBP | ISO | DRC+ | DRAA | BWARP |
---|---|---|---|---|---|---|---|---|
RS | 1.0 | 0.739 | 0.922 | 0.829 | 0.812 | 0.751 | 0.806 | 0.780 |
Initially, I had 23 independent variables. To avoid multicollinearity, I filtered some of them based on (i) correlation between each independent variable, and (ii) correlation between those filtered features and the dependent variable, RA. As a result, I ended up 7 independent varaibles as indicated in the correlation matrix above.
5-3. Filtered Independent Variables EDA
According to the histograms of each independent variable above, all the variables are normally distributed.
Scatter plots also depict that there are reasonable linear trends between each independent variable and RS without notable outliers, and thus, it's safe to use the linear regression model.
Since the ranges of independent variables vary considerably, I scaled all the independent variables. As all the data attributes have normal distributions with few outliers, I used StandardScaler to scale them.
The result of feature scaling is the following:
With all the independent variables filtered above, I built a multiple linear regression model to check the degree of multicollinearity based on VIF.
According to the table above, there seems to be multicollinearity in the model because independent variables are highly corrleated one another. Therefore, I used the wrapper method (Recursive Feature Elimination) to find the best two independent variables.
Through RFE, I got HR9 and WHIP as independent variables and built a multiple linear regression. The result of the model is:
Apart from the multiple linear regression model, I also built a simple linear regression model. To find the sinlge best independent variable, I used the SelectKBest function. Based on F-statistics of each independent variable, ERA has beend selected as the best independent variable.
Furthermore, I also splitted data into training(70%) and test(30%) datasets for accuracy.
The result of the model is:
Measurement | Score |
---|---|
Intercept | 44.81409069091842 |
Coefficient | 163.46870405 |
R-squared | 0.977594476807553 |
RMSE | 12.37869911916763 |
As indicated in the table above, the result was TOO accurate yielding an R-squared of 0.978 and RMSE of 12.38. Such a result seems to occur because ERA and RA are almost indentical stats except the fact that ERA (Earned Run Average) doesn't take into account runs allowed recorded via errors or passed plays, while RA does.
In modern baseball, the quality of fielding is so outstanding compared to the past day's baseaball (imagine ball games in the 1890s or 1910s). Therefore, the odds of scoring runs with the aids of errors became so low these days. This is proven by the correlation of 0.99 between these two stats. These two stats are almost indentical.
So although ERA is the best single predictor of a team's RA, I believe there's no point in spending time on building machine learning algorithm to just predict RA, if we already have ERA. Thus, I got the second best predictor, WHIP, again using skelearn's SelectKBest.
With WHIP as an independent variable the result of this model is:
Measurement | Score |
---|---|
Intercept | -465.10977839397117 |
Coefficient | 893.24724699 |
R-squared | 0.7837067997149497 |
RMSE | 38.460851534999485 |
To validate both multiple and simple linear regression models, I used the K-Fold Cross Validation method, where the number of folds is 10.
9-1. Multiple Linear Regression model validtion
Measurement | Score |
---|---|
Mean R-squared | 0.8905666784838221 |
Mean RMSE | 24.9749012355517 |
9-2. Simple Linear Regression model validtion
Measurement | Score |
---|---|
Mean R-squared | 0.7360878269961807 |
Mean RMSE | 38.90144641675336 |
Accoring to the results above, the simple linear regression model (x: WHIP / y:RA) also seems to perform well. However, the accuracy is not as high as that of the multiple linear regression model (x: HR9, WHIP / y:RA).
Comparing those two models through 10-Fold Cross Validation, although WHIP alone is a good measure when predicting a team's RA, it'll result a much better result to use HR9 and WHIP together for a team's RA prediction given the mean R-squared of 0.891 and RMSE of 24.97.
As I mentioned in the previous project, a team must reach bases as many as possible to produce runs. If you're not able to reach bases, then how would you score? So if we think about it from the pitching's perspective. As a pitcher (or a team) your goal is to prevent your opponents from scoring as many as possible. How? The answer is simple. You must prevent your opponents from reaching bases by allowing as less hits, bases on balls, or hit-by-pitches as you can.
And such a job is measured by a single statistic, WHIP. It measures how well a pitcher has kept runners off the basepaths and is calculated by the total number of hits and walks divided by his total innings pitched. In other words, it represents how may batters a pitcher allows to reach bases per innings pitched. (e.g. a WHIP of 0.84 means that this pitcher allows 0.84 hitters to reach bases per innings he pitched)
Even though the ability to keep runners off the baspaths is still important, it seems that preventing opponent batters from reaching bases is not enough not to give up runs. There's one more thing you should do to give up as less runs as possible given my analysis. Not allowing home runs.
In the 2010s (especially since the 2016 season), the way a team produces runs has changed compared to the past days. Somehow, both the total number of home runs a team records per season, and the total number of runs created via home runs has inclined (see this article).
As the way of playing the ball game changes, teams must be used to it. Therefore, from a pitching's perspective, the ability of not allowing home runs has become important these days. And such a job is measured by HR9, the number of home runs allowed per 9 innings pitched.
To sum, if a team allows its opponents to reach bases as less as possible, and also if it allows as less home runs as it can, such a team will give up as less runs as possible. This seems to be proven through this analysis: RA prediction.