README.Rmd

---
title: "Team 15 MGT Madness"
output: github_document
---

The goal of MGT Madness was to create a machine learning model that reliably predicted the outcome of both March Madness and regular season college basketball games at a level equal or better to ESPN's industry-standard prediction models.

## Github File Structure

```{r setup, echo=FALSE,message=FALSE,warning=FALSE}
library(tidyverse)
library(data.tree)
```

If you plan to use this repository ensure that you have the following folders stored in the same directory as your .Rproj file. These directories are required to ensure each script runs. Make sure to include each of the .csv and/or .rds data files.

```{r, echo = FALSE,message=FALSE,warning=FALSE}
paths = unique(c(list.dirs(path = "Data",full.names = T),list.files(path = "Data",full.names = T,recursive = TRUE)))
paths = paths[!grepl("deprecated|old",unique(c(list.dirs(path = "Data",full.names = T),list.files(path = "Data",full.names = T,recursive = TRUE))))]
library(data.tree)
library(plyr)

x <- lapply(strsplit(paths, "/"), function(z) as.data.frame(t(z)))
x <- rbind.fill(x)
x$pathString <- apply(x, 1, function(x) paste(trimws(na.omit(x)), collapse="/"))
(mytree <- data.tree::as.Node(x))

```

```{r, echo = FALSE}
paths = unique(c(list.dirs(path = "Final Code",full.names = T),list.files(path = "Final Code",full.names = T,recursive = TRUE)))
library(data.tree)
library(plyr)

x <- lapply(strsplit(paths, "/"), function(z) as.data.frame(t(z)))
x <- rbind.fill(x)
x$pathString <- apply(x, 1, function(x) paste(trimws(na.omit(x)), collapse="/"))
(mytree <- data.tree::as.Node(x))

```

## Data Gathering & Cleaning

Within this directory there are 5 R scripts that were finalized to gather and clean each of our data sources. These scripts utilize numerous data sources such as:

1.  ESPN web API
2.   `hoopR::load_mbb_team_box()` function
3.  <https://www.sports-reference.com/> AP poll data
4.  Google Maps API to geocode arena information

Below are a few examples of our data gathering and cleaning process.

### Loading Packages

Most scripts will have packages listed at the top and should install or load depending on your situation. If has to be installed, you will need to run the line once more when the install completes in order to load the package. Refer to the `0. Package Check.R` if there are any concerns.

```{r, warning=FALSE, message=FALSE,eval=TRUE}
if (!require('devtools')) install.packages('devtools')
if (!require('tidyverse')) install.packages('tidyverse')
if (!require('tidymodels')) install.packages('tidymodels')
if (!require('tictoc')) install.packages('tictoc')
if (!require('hoopR')) devtools::install_github('sportsdataverse/hoopR')
if (!require('glue')) install.packages('glue')
if (!require('httr')) install.packages('httr')
if (!require('rvest')) install.packages('rvest')
if (!require('here')) install.packages('here')
if (!require('ggmap')) install.packages('ggmap')
if (!require('glmnet')) install.packages('glmnet')
if (!require('vip')) install.packages('vip')
if (!require('caret')) install.packages('caret')
if (!require('xgboost')) install.packages('xgboost')
if (!require('ggcorrplot')) install.packages('ggcorrplot')
if (!require('zoo')) install.packages('zoo')
if (!require('lubridate')) install.packages('lubridate')
if (!require('geosphere')) install.packages('geosphere')
if (!require('ggthemes')) install.packages('ggthemes')
if (!require('forcats')) install.packages('forcats')
if(!require('bookdown')) install.packages('bookdown')
if(!require('kableExtra')) install.packages('kableExtra')
if(!require('doParallel')) install.packages('doParallel')
if(!require('scales')) install.packages('scales')
if(!require('stringr')) install.packages('stringr')
require(parallel)
```

### Men's Basketball Boxscore

This code will use the hoopR package and tidyverse to quickly collect the 2022 box score data for all teams.

```{r, warning=FALSE, message=FALSE}
mbb_box_score_2012_2022_tbl <- hoopR::load_mbb_team_box(seasons = 2022)

mbb_box_score_2012_2022_tbl %>% head()
```

### AP Poll Data

This function was built to scrape the AP poll data from sports-reference. Both the rvest and tidyverse suite of packages were used. Additional data cleaning was performed in our `data_cleanup.R` script.

```{r, message=FALSE,warning=FALSE}
get_ap_polls <- function(year){

  url <- glue::glue("https://www.sports-reference.com/cbb/seasons/men/{year}-polls.html#ap-polls")
  
  webpage <- read_html(url)
  
  data <- html_table(html_nodes(webpage,"table")[1])[[1]]
  names(data) <- c("school","conference",1:ncol(data))
  data <- data %>%
    mutate(year = year)
  
  return(data)
  
}

ap_poll_2012_2022_tbl <- lapply(2012:2022, get_ap_polls) %>% bind_rows()

ap_poll_2012_2022_tbl
```

### ESPN Attendance Data

This data requires access to the ESPN API. This access is free but will be throttled depending on use. If the script returns no data or missing data then a VPN may be required to bypass the ESPN restrictions. This function can be modified to include box score data similar to the `hoopR::load_mbb_team_box()` function. Using the hoopR function saved us time for this project but pulling directly from the API had many uses as the data is much more likely to be completely filled in.

```{r}

# Select First 5 Game IDs from our tibble
game_ids_vec <- mbb_box_score_2012_2022_tbl %>%
  pull(game_id) %>%
  unique() %>%
  .[1:5]

# Initialize ESPN Attendance function
get_attendance_espn_api <- function(game_id){

  tryCatch({url = glue::glue("https://site.api.espn.com/apis/site/v2/sports/basketball/mens-college-basketball/summary?event={game_id}")
    
    txt = httr::GET(url) %>% httr::content(as = "text")
    
    game_info <- jsonlite::fromJSON(txt)
    temp <- game_info %>% 
      enframe() %>% 
      pivot_wider() %>% 
      select(gameInfo, predictor, odds) %>%  
      unnest_wider(gameInfo) %>% 
      unnest_wider(venue) %>% 
      unnest_wider(predictor) %>%
      unnest_wider(homeTeam,names_sep = "_") %>% 
      unnest_wider(awayTeam,names_sep = "_") %>% 
      unnest_wider(address) %>% 
      unnest_wider(odds) %>%
      select(-officials, -images, -grass, -id, -header) %>%
      mutate(game_id = game_id, type = 1)
    
    temp2 <- game_info %>% 
      enframe() %>% 
      pivot_wider() %>% 
      select(header) %>%
      unnest_wider(header)
    
    temp$neutral_site <- temp2$competitions$neutralSite
    
    return(temp)
  }, error = function(e) {
    message(paste("Error getting data for date", game_id))
    temp <- data.frame(game_id = game_id, type = 0)
    return(temp)
  })
}

tictoc::tic()
mbb_attendance_2012_2022_tbl <- lapply(game_ids_vec, get_attendance_espn_api) %>% bind_rows()
tictoc::toc()

mbb_attendance_2012_2022_tbl
```

## Model & Visualization

This directory contains 3 scripts in total:

1.  model_building.R
2.  model_predictions.R
3.  draw_confusion_matrix.R

Model building should be run first in order to create the models, assuming you've already create/pulled all relevant data sources. Model predictions will access our stored models and create the relevant R objects for the RMarkdown file. The confusion matrix script is meant to be a helper script that adds a bit of flare to our contingency tables.

### Model Building

We will not cover the models in detail but the general concept here is to read in the data created in the previous steps then perform various modeling techniques ranging from LASSO regression, Logistic Regression, Probit Regression and Decision Trees. Please see the tidymodels documentation to find out more about our approach and see examples. You can find this documentation at <https://www.tidymodels.org/>

### Model Predictions

Again, we will not cover the model information in detail as the code is quite long. Once again the concept is similar to Model Building. If you have the models created or pulled from the repository, this script can be ran and our contingency tables created for evaluation.

## Wrap up

Below is the session information that was used to create this report. The packages and their versions are listed. R 4.2.3 is required.

```{r}

sessionInfo()
```