-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
203 lines (147 loc) · 7.96 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
---
title: "Team 15 MGT Madness"
output: github_document
---
The goal of MGT Madness was to create a machine learning model that reliably predicted the outcome of both March Madness and regular season college basketball games at a level equal or better to ESPN's industry-standard prediction models.
## Github File Structure
```{r setup, echo=FALSE,message=FALSE,warning=FALSE}
library(tidyverse)
library(data.tree)
```
If you plan to use this repository ensure that you have the following folders stored in the same directory as your .Rproj file. These directories are required to ensure each script runs. Make sure to include each of the .csv and/or .rds data files.
```{r, echo = FALSE,message=FALSE,warning=FALSE}
paths = unique(c(list.dirs(path = "Data",full.names = T),list.files(path = "Data",full.names = T,recursive = TRUE)))
paths = paths[!grepl("deprecated|old",unique(c(list.dirs(path = "Data",full.names = T),list.files(path = "Data",full.names = T,recursive = TRUE))))]
library(data.tree)
library(plyr)
x <- lapply(strsplit(paths, "/"), function(z) as.data.frame(t(z)))
x <- rbind.fill(x)
x$pathString <- apply(x, 1, function(x) paste(trimws(na.omit(x)), collapse="/"))
(mytree <- data.tree::as.Node(x))
```
```{r, echo = FALSE}
paths = unique(c(list.dirs(path = "Final Code",full.names = T),list.files(path = "Final Code",full.names = T,recursive = TRUE)))
library(data.tree)
library(plyr)
x <- lapply(strsplit(paths, "/"), function(z) as.data.frame(t(z)))
x <- rbind.fill(x)
x$pathString <- apply(x, 1, function(x) paste(trimws(na.omit(x)), collapse="/"))
(mytree <- data.tree::as.Node(x))
```
## Data Gathering & Cleaning
Within this directory there are 5 R scripts that were finalized to gather and clean each of our data sources. These scripts utilize numerous data sources such as:
1. ESPN web API
2. `hoopR::load_mbb_team_box()` function
3. <https://www.sports-reference.com/> AP poll data
4. Google Maps API to geocode arena information
Below are a few examples of our data gathering and cleaning process.
### Loading Packages
Most scripts will have packages listed at the top and should install or load depending on your situation. If has to be installed, you will need to run the line once more when the install completes in order to load the package. Refer to the `0. Package Check.R` if there are any concerns.
```{r, warning=FALSE, message=FALSE,eval=TRUE}
if (!require('devtools')) install.packages('devtools')
if (!require('tidyverse')) install.packages('tidyverse')
if (!require('tidymodels')) install.packages('tidymodels')
if (!require('tictoc')) install.packages('tictoc')
if (!require('hoopR')) devtools::install_github('sportsdataverse/hoopR')
if (!require('glue')) install.packages('glue')
if (!require('httr')) install.packages('httr')
if (!require('rvest')) install.packages('rvest')
if (!require('here')) install.packages('here')
if (!require('ggmap')) install.packages('ggmap')
if (!require('glmnet')) install.packages('glmnet')
if (!require('vip')) install.packages('vip')
if (!require('caret')) install.packages('caret')
if (!require('xgboost')) install.packages('xgboost')
if (!require('ggcorrplot')) install.packages('ggcorrplot')
if (!require('zoo')) install.packages('zoo')
if (!require('lubridate')) install.packages('lubridate')
if (!require('geosphere')) install.packages('geosphere')
if (!require('ggthemes')) install.packages('ggthemes')
if (!require('forcats')) install.packages('forcats')
if(!require('bookdown')) install.packages('bookdown')
if(!require('kableExtra')) install.packages('kableExtra')
if(!require('doParallel')) install.packages('doParallel')
if(!require('scales')) install.packages('scales')
if(!require('stringr')) install.packages('stringr')
require(parallel)
```
### Men's Basketball Boxscore
This code will use the hoopR package and tidyverse to quickly collect the 2022 box score data for all teams.
```{r, warning=FALSE, message=FALSE}
mbb_box_score_2012_2022_tbl <- hoopR::load_mbb_team_box(seasons = 2022)
mbb_box_score_2012_2022_tbl %>% head()
```
### AP Poll Data
This function was built to scrape the AP poll data from sports-reference. Both the rvest and tidyverse suite of packages were used. Additional data cleaning was performed in our `data_cleanup.R` script.
```{r, message=FALSE,warning=FALSE}
get_ap_polls <- function(year){
url <- glue::glue("https://www.sports-reference.com/cbb/seasons/men/{year}-polls.html#ap-polls")
webpage <- read_html(url)
data <- html_table(html_nodes(webpage,"table")[1])[[1]]
names(data) <- c("school","conference",1:ncol(data))
data <- data %>%
mutate(year = year)
return(data)
}
ap_poll_2012_2022_tbl <- lapply(2012:2022, get_ap_polls) %>% bind_rows()
ap_poll_2012_2022_tbl
```
### ESPN Attendance Data
This data requires access to the ESPN API. This access is free but will be throttled depending on use. If the script returns no data or missing data then a VPN may be required to bypass the ESPN restrictions. This function can be modified to include box score data similar to the `hoopR::load_mbb_team_box()` function. Using the hoopR function saved us time for this project but pulling directly from the API had many uses as the data is much more likely to be completely filled in.
```{r}
# Select First 5 Game IDs from our tibble
game_ids_vec <- mbb_box_score_2012_2022_tbl %>%
pull(game_id) %>%
unique() %>%
.[1:5]
# Initialize ESPN Attendance function
get_attendance_espn_api <- function(game_id){
tryCatch({url = glue::glue("https://site.api.espn.com/apis/site/v2/sports/basketball/mens-college-basketball/summary?event={game_id}")
txt = httr::GET(url) %>% httr::content(as = "text")
game_info <- jsonlite::fromJSON(txt)
temp <- game_info %>%
enframe() %>%
pivot_wider() %>%
select(gameInfo, predictor, odds) %>%
unnest_wider(gameInfo) %>%
unnest_wider(venue) %>%
unnest_wider(predictor) %>%
unnest_wider(homeTeam,names_sep = "_") %>%
unnest_wider(awayTeam,names_sep = "_") %>%
unnest_wider(address) %>%
unnest_wider(odds) %>%
select(-officials, -images, -grass, -id, -header) %>%
mutate(game_id = game_id, type = 1)
temp2 <- game_info %>%
enframe() %>%
pivot_wider() %>%
select(header) %>%
unnest_wider(header)
temp$neutral_site <- temp2$competitions$neutralSite
return(temp)
}, error = function(e) {
message(paste("Error getting data for date", game_id))
temp <- data.frame(game_id = game_id, type = 0)
return(temp)
})
}
tictoc::tic()
mbb_attendance_2012_2022_tbl <- lapply(game_ids_vec, get_attendance_espn_api) %>% bind_rows()
tictoc::toc()
mbb_attendance_2012_2022_tbl
```
## Model & Visualization
This directory contains 3 scripts in total:
1. model_building.R
2. model_predictions.R
3. draw_confusion_matrix.R
Model building should be run first in order to create the models, assuming you've already create/pulled all relevant data sources. Model predictions will access our stored models and create the relevant R objects for the RMarkdown file. The confusion matrix script is meant to be a helper script that adds a bit of flare to our contingency tables.
### Model Building
We will not cover the models in detail but the general concept here is to read in the data created in the previous steps then perform various modeling techniques ranging from LASSO regression, Logistic Regression, Probit Regression and Decision Trees. Please see the tidymodels documentation to find out more about our approach and see examples. You can find this documentation at <https://www.tidymodels.org/>
### Model Predictions
Again, we will not cover the model information in detail as the code is quite long. Once again the concept is similar to Model Building. If you have the models created or pulled from the repository, this script can be ran and our contingency tables created for evaluation.
## Wrap up
Below is the session information that was used to create this report. The packages and their versions are listed. R 4.2.3 is required.
```{r}
sessionInfo()
```