MLB Team Runs Allowed Prediction

Content

Intro: The Nature of Baseball
Technologies
Metadata
Data Cleaning
EDA (Exploratory Data Analysis)
Feature Scaling
Multiple Linear Regression with Feature Selection
Simple Linear Regression
Model Validation
Conclusion

1. Intro

In the previous project, I briefly talked about how a team wins in baseball. The first part of winning in baseball is Runs Scored (RS) and what makes that RS was dealt with in the previous project using linear regression models.

However RS is not the only part of winning in baseball. While a team must score runs, it also has to prevent its opponents from scoring runs (at least allow runs less than it scores) to win a game. This is indicated as Runs Allowed. So in this project, I analyzed how a team can allow runs as less as possible.

2. Techonologies

Python 3.8
- Pandas - version 1.2.2
- Numpy - version 1.20.1
- matplotlib - version 3.3.4
- seaborn - version 0.11.1
- scikit-learn - version 0.24.1
- statsmodels - version 0.12.2
- scipy - version 1.6.1

3. Metadata

Metadata	Information
Origin of Data	Baseball Prospectus
Terms of Use	Terms and Conditions
Data Structure	10 datasets each consisting of 31 rows * 27 columns

Data Feature	Data Meaning
LVL	Level of Play: MLB (the major league)
YEAR	Each year refers to corresponding seasons
TEAM	All 30 Major League Baseball Teams
IP	Innings Pitched
PA	Plate Appearance
R	Runs Allowed
ERA	Earned Run Average
FIP	Fielding Independent Pitching
cFIP	Contextual Fiedling Independent Pitching
cFIP_START	cFIP for Starting Pitchers
cFIP_RELIEF	cFIP for Relief Pitchers
FIP_MINUS_ERA	ERA Subtracted from FIP
SO9	Strikeouts Per 9 Innings
BB9	Walks Per 9 Innings
SO/BB	Strikeout-to-Walk Ratio
HR9	Home Runs Per 9 Innings
oppAVG	Batting Average Allowed by a Pitcher
oppOBP	On-base Percentage Allowed by a Pitcher
oppSLG	Slugging Percentage Allowed by a Pitcher
oppOPS	On-base Plus Slugging Allowed by a Pitcher
WHIP	Walk and Hits Per Inning Pitched
DRA	Deserved Run Average
DRA-	DRA-Minus
DRA_START	DRA for Starting Pitchers
DRA_RELIEF	DRA for Relief Pitchers
PWARP	Pitcher Wins Above Replacement Player

4. Data Cleaning

Combined 10 different datasets (2010-2019 Season Pitching datasets).
Dropped an unnecessary column made when combining datasets (Column: '#').
Renamed 'R' data feature as 'RA' for clarity.
Eliminated commas in some data features and convert their data types from integer into numeric (IP, PA).
Detected invalid 0 values in some data features (cFIP_START, cFIP_RELIEF, SO/BB, oppAVG, oppOBP, oppSLG, oppOPS, DRA_START, DRA_RELIEF).
By looking at data features that contain the 0 values, I noticed that these invalid values were recorded in specific seasons because such data atrributes didn't exist in that corresponding seasson. In other words, such invalid values are considered Missing At Random (MAR).
Treated these invalid values as missing values and replaced them with projected values based on linear regression result using IterativeImputer.
Dropped categorical variables (LVL and TEAM), as they are irrelevant to this analysis.

5. EDA (Exploratory Data Analysis)

5-1. RA EDA

RA Skewness: 0.38340975864973814
RA Kurtosis: -0.13976152269512854

According to the histogram and probability plot above, RA seems to follow a normal distribution. The skewness of 0.38 and kurtosis of -0.14 also indicate that team RA data is normallly distributed. Likewise, the boxplots above show that team RA has been normally distributed over the last 10 seasons with few outliers.

5-2. Feature Selection: Filter Method

Correlation	RS	PA	TB	OBP	ISO	DRC+	DRAA	BWARP
RS	1.0	0.739	0.922	0.829	0.812	0.751	0.806	0.780

Initially, I had 23 independent variables. To avoid multicollinearity, I filtered some of them based on (i) correlation between each independent variable, and (ii) correlation between those filtered features and the dependent variable, RA. As a result, I ended up 7 independent varaibles as indicated in the correlation matrix above.

5-3. Filtered Independent Variables EDA

According to the histograms of each independent variable above, all the variables are normally distributed.

Scatter plots also depict that there are reasonable linear trends between each independent variable and RS without notable outliers, and thus, it's safe to use the linear regression model.

6. Feature Scaling

Since the ranges of independent variables vary considerably, I scaled all the independent variables. As all the data attributes have normal distributions with few outliers, I used StandardScaler to scale them.

The result of feature scaling is the following:

7. Multiple Linear Regression with Feature Selection

With all the independent variables filtered above, I built a multiple linear regression model to check the degree of multicollinearity based on VIF.

According to the table above, there seems to be multicollinearity in the model because independent variables are highly corrleated one another. Therefore, I used the wrapper method (Recursive Feature Elimination) to find the best two independent variables.

Through RFE, I got HR9 and WHIP as independent variables and built a multiple linear regression. The result of the model is:

8. Simple Linear Regression

Apart from the multiple linear regression model, I also built a simple linear regression model. To find the sinlge best independent variable, I used the SelectKBest function. Based on F-statistics of each independent variable, ERA has beend selected as the best independent variable.

Furthermore, I also splitted data into training(70%) and test(30%) datasets for accuracy.

The result of the model is:

Measurement	Score
Intercept	44.81409069091842
Coefficient	163.46870405
R-squared	0.977594476807553
RMSE	12.37869911916763

As indicated in the table above, the result was TOO accurate yielding an R-squared of 0.978 and RMSE of 12.38. Such a result seems to occur because ERA and RA are almost indentical stats except the fact that ERA (Earned Run Average) doesn't take into account runs allowed recorded via errors or passed plays, while RA does.

In modern baseball, the quality of fielding is so outstanding compared to the past day's baseaball (imagine ball games in the 1890s or 1910s). Therefore, the odds of scoring runs with the aids of errors became so low these days. This is proven by the correlation of 0.99 between these two stats. These two stats are almost indentical.

So although ERA is the best single predictor of a team's RA, I believe there's no point in spending time on building machine learning algorithm to just predict RA, if we already have ERA. Thus, I got the second best predictor, WHIP, again using skelearn's SelectKBest.

With WHIP as an independent variable the result of this model is:

Measurement	Score
Intercept	-465.10977839397117
Coefficient	893.24724699
R-squared	0.7837067997149497
RMSE	38.460851534999485

9. Model Validation

To validate both multiple and simple linear regression models, I used the K-Fold Cross Validation method, where the number of folds is 10.

9-1. Multiple Linear Regression model validtion

Measurement	Score
Mean R-squared	0.8905666784838221
Mean RMSE	24.9749012355517

9-2. Simple Linear Regression model validtion

Measurement	Score
Mean R-squared	0.7360878269961807
Mean RMSE	38.90144641675336

Accoring to the results above, the simple linear regression model (x: WHIP / y:RA) also seems to perform well. However, the accuracy is not as high as that of the multiple linear regression model (x: HR9, WHIP / y:RA).

10. Conclusion

Comparing those two models through 10-Fold Cross Validation, although WHIP alone is a good measure when predicting a team's RA, it'll result a much better result to use HR9 and WHIP together for a team's RA prediction given the mean R-squared of 0.891 and RMSE of 24.97.

As I mentioned in the previous project, a team must reach bases as many as possible to produce runs. If you're not able to reach bases, then how would you score? So if we think about it from the pitching's perspective. As a pitcher (or a team) your goal is to prevent your opponents from scoring as many as possible. How? The answer is simple. You must prevent your opponents from reaching bases by allowing as less hits, bases on balls, or hit-by-pitches as you can.

And such a job is measured by a single statistic, WHIP. It measures how well a pitcher has kept runners off the basepaths and is calculated by the total number of hits and walks divided by his total innings pitched. In other words, it represents how may batters a pitcher allows to reach bases per innings pitched. (e.g. a WHIP of 0.84 means that this pitcher allows 0.84 hitters to reach bases per innings he pitched)

Even though the ability to keep runners off the baspaths is still important, it seems that preventing opponent batters from reaching bases is not enough not to give up runs. There's one more thing you should do to give up as less runs as possible given my analysis. Not allowing home runs.

In the 2010s (especially since the 2016 season), the way a team produces runs has changed compared to the past days. Somehow, both the total number of home runs a team records per season, and the total number of runs created via home runs has inclined (see this article).

As the way of playing the ball game changes, teams must be used to it. Therefore, from a pitching's perspective, the ability of not allowing home runs has become important these days. And such a job is measured by HR9, the number of home runs allowed per 9 innings pitched.

To sum, if a team allows its opponents to reach bases as less as possible, and also if it allows as less home runs as it can, such a team will give up as less runs as possible. This seems to be proven through this analysis: RA prediction.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
data		data
images		images
.DS_Store		.DS_Store
Code.py		Code.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLB Team Runs Allowed Prediction

Content

1. Intro

2. Techonologies

3. Metadata

4. Data Cleaning

5. EDA (Exploratory Data Analysis)

6. Feature Scaling

7. Multiple Linear Regression with Feature Selection

8. Simple Linear Regression

9. Model Validation

10. Conclusion

About

Releases

Packages

Languages

SanghyunKim1/MLB_Team_RunsAllowed_Prediction

Folders and files

Latest commit

History

Repository files navigation

MLB Team Runs Allowed Prediction

Content

1. Intro

2. Techonologies

3. Metadata

4. Data Cleaning

5. EDA (Exploratory Data Analysis)

6. Feature Scaling

7. Multiple Linear Regression with Feature Selection

8. Simple Linear Regression

9. Model Validation

10. Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages