model-mlr.qmd

# Linear regression with multiple predictors {#sec-model-mlr}

```{r}
#| include: false
source("_common.R")
```

::: {.chapterintro data-latex=""}
Building on the ideas of one predictor variable in a linear regression model (from [Chapter -@sec-model-slr]), a multiple linear regression model is now fit to two or more predictor variables.
By considering how different explanatory variables interact, we can uncover complicated relationships between the predictor variables and the response variable.
One challenge to working with multiple variables is that it is sometimes difficult to know which variables are most important to include in the model.
Model building is an extensive topic, and we scratch the surface here by defining and utilizing the adjusted $R^2$ value.
:::

Multiple regression extends single predictor variable regression to the case that still has one response but many predictors (denoted $x_1$, $x_2$, $x_3$, ...).
The method is motivated by scenarios where many variables may be simultaneously connected to an output.

```{r}
#| label: loans-prep
#| include: false
loans <- loans_full_schema |>
  mutate(
    credit_util = total_credit_utilized / total_credit_limit,
    bankruptcy = as.factor(if_else(public_record_bankrupt == 0, 0, 1)),
    verified_income = droplevels(verified_income)
  ) |>
  rename(credit_checks = inquiries_last_12m) |>
  select(interest_rate, verified_income, debt_to_income, credit_util, bankruptcy, term, credit_checks, issue_month)
```

We will consider data about loans from the peer-to-peer lender, Lending Club, which is a dataset we first encountered in [Chapter -@sec-data-hello].
The loan data includes terms of the loan as well as information about the borrower.
The outcome variable we would like to better understand is the interest rate assigned to the loan.
For instance, all other characteristics held constant, does it matter how much debt someone already has?
Does it matter if their income has been verified?
Multiple regression will help us answer these and other questions.

```{r}
#| label: loans-n-rows-to-show
#| include: false
n_rows_to_show <- 6
n_rows_to_show_string <- "six"
```

The dataset includes information on `r nrow(loans)` loans, and we'll be looking at a subset of the available variables, some of which will be new from those we saw in earlier chapters.
The first `r n_rows_to_show_string` observations in the dataset are shown in @tbl-loans-data-matrix, and descriptions for each variable are shown in @tbl-loans-variables.
Notice that the past bankruptcy variable (`bankruptcy`) is an indicator variable, where it takes the value 1 if the borrower had a past bankruptcy in their record and 0 if not.
Using an indicator variable in place of a category name allows for these variables to be directly used in regression.
Two of the other variables are categorical (`verified_income` and `issue_month`), each of which can take one of a few different non-numerical values; we'll discuss how these are handled in the model in @sec-ind-and-cat-predictors.

::: {.data data-latex=""}
The [`loans_full_schema`](http://openintrostat.github.io/openintro/reference/loans_full_schema.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
Based on the data in this dataset we have created two new variables: `credit_util` which is calculated as the total credit utilized divided by the total credit limit and `bankruptcy` which turns the number of bankruptcies to an indicator variable (0 for no bankruptcies and 1 for at least 1 bankruptcy).
We will refer to this modified dataset as `loans`.
:::

```{r}
#| label: tbl-loans-data-matrix
#| tbl-cap: First six rows of the `loans` dataset.
#| tbl-pos: H
loans |>
  slice_head(n = n_rows_to_show) |>
  kbl(
    linesep = "", booktabs = TRUE,
    align = "rlrrrr"
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "condensed"),
    latex_options = c("striped", "scale_down"),
    full_width = FALSE
  )
```

```{r}
#| label: tbl-loans-variables
#| tbl-cap: Variables and their descriptions for the `loans` dataset.
#| tbl-pos: H
loans_var_def <- tribble(
  ~variable,         ~description,
  "interest_rate",   "Interest rate on the loan, in an annual percentage.",
  "verified_income", "Categorical variable describing whether the borrower's income source and amount have been verified, with levels Verified (source and amount verified), Source Verified (source only verified), and Not Verified.",
  "debt_to_income",  "Debt-to-income ratio, which is the percentage of total debt of the borrower divided by their total income.",
  "credit_util",     "Of all the credit available to the borrower, what fraction are they utilizing. For example, the credit utilization on a credit card would be the card's balance divided by the card's credit limit.",
  "bankruptcy",      "An indicator variable for whether the borrower has a past bankruptcy in their record. This variable takes a value of `1` if the answer is *yes* and `0` if the answer is *no*.",
  "term",            "The length of the loan, in months.",
  "issue_month",     "The month and year the loan was issued, which for these loans is always during the first quarter of 2018.",
  "credit_checks",   "Number of credit checks in the last 12 months. For example, when filing an application for a credit card, it is common for the company receiving the application to run a credit check.",
)

loans_var_def |>
  kbl(
    linesep = "", booktabs = TRUE,
    col.names = c("Variable", "Description"),
    format = "markdown"
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "condensed"),
    latex_options = c("striped"), full_width = TRUE
  ) |>
  column_spec(1, monospace = TRUE) |>
  column_spec(2, width = "30em")
```

## Indicator and categorical predictors {#sec-ind-and-cat-predictors}

Let's start by fitting a linear regression model for interest rate with a single predictor indicating whether a person has a bankruptcy in their record:

$$
\widehat{\texttt{interest\_rate}} = 12.34 + 0.74 \times \texttt{bankruptcy}
$$

Results of this model are shown in @tbl-int-rate-bankruptcy.

```{r}
#| label: tbl-int-rate-bankruptcy
#| tbl-cap: |
#|   Summary of a linear model for predicting `interest_rate`
#|   based on whether the borrower has a bankruptcy in their
#|   record. Degrees of freedom for this model is 9998.
#| tbl-pos: H
m_bankruptcy <- lm(interest_rate ~ bankruptcy, data = loans)

m_bankruptcy |>
  tidy() |>
  mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) |>
  kbl(
    linesep = "", booktabs = TRUE,
    digits = 2, align = "lrrrr"
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "condensed"),
    latex_options = c("striped")
  ) |>
  column_spec(1, width = "13em", monospace = TRUE) |>
  column_spec(2:5, width = "4em")
```

::: {.workedexample data-latex=""}
Interpret the coefficient for the past bankruptcy variable in the model.

------------------------------------------------------------------------

The variable takes one of two values: 1 when the borrower has a bankruptcy in their history and 0 otherwise.
A slope of 0.74 means that the model predicts a 0.74% higher interest rate for those borrowers with a bankruptcy in their record.
(See @sec-categorical-predictor-two-levels for a review of the interpretation for two-level categorical predictor variables.)
:::

Suppose we had fit a model using a 3-level categorical variable, such as `verified_income`.
The output from software is shown in @tbl-int-rate-ver-income.
This regression output provides multiple rows for the variable.
Each row represents the relative difference for each level of `verified_income`.
However, we are missing one of the levels: `Not Verified`.
The missing level is called the **reference level**\index{reference level} and it represents the default level that other levels are measured against.

```{r}
#| include: false
terms_chp_08 <- c("reference level")
```

```{r}
#| label: tbl-int-rate-ver-income
#| tbl-cap: |
#|   Summary of a linear model for predicting `interest_rate` from
#|   the borrower’s income source and amount verification. 
#|   This predictor has three levels, which results in 2
#|   rows in the regression output.
#| tbl-pos: H
m_verified_income <- lm(interest_rate ~ verified_income, data = loans)

m_verified_income |>
  tidy() |>
  mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) |>
  kbl(
    linesep = "", booktabs = TRUE,
    digits = 2, align = "lrrrr"
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "condensed"),
    latex_options = c("striped")
  ) |>
  column_spec(1, width = "17em", monospace = TRUE) |>
  column_spec(2:5, width = "5em")
```

::: {.workedexample data-latex=""}
How would we write an equation for this regression model?

------------------------------------------------------------------------

The equation for the regression model may be written as a model with two predictors:

\vspace{-5mm}

$$
\begin{aligned}
\widehat{\texttt{interest\_rate}} &= 11.10 \\
&+ 1.42 \times \texttt{verified\_income}_{\texttt{Source Verified}} \\
&+ 3.25 \times \texttt{verified\_income}_{\texttt{Verified}}
\end{aligned}
$$

We use the notation $\texttt{variable}_{\texttt{level}}$ to represent indicator variables for when the categorical variable takes a particular value.
For example, $\texttt{verified\_income}_{\texttt{Source Verified}}$ would take a value of 1 if it was for a borrower that was source verified, and it would take a value of 0 otherwise.
Likewise, $\texttt{verified\_income}_{\texttt{Verified}}$ would take a value of 1 if it was for a borrower that was verified, and 0 if it took any other value.
:::

The notation $\texttt{variable}_{\texttt{level}}$ may feel a bit confusing.
Let's figure out how to use the equation for each level of the `verified_income` variable.

::: {.workedexample data-latex=""}
Using the model for predicting interest rate from income verification type, compute the average interest rate for borrowers whose income source and amount are both *unverified*.

------------------------------------------------------------------------

When `verified_income` takes a value of `Not Verified`, then both indicator functions in the equation for the linear model are set to 0:

$$
\widehat{\texttt{interest\_rate}} = 11.10 + 1.42 \times 0 + 3.25 \times 0 = 11.10
$$

The average interest rate for these borrowers is 11.1%.
Because the level does not have its own coefficient and it is the reference value, the indicators for the other levels for this variable all drop out.
:::

::: {.workedexample data-latex=""}
Using the model for predicting interest rate from income verification type, compute the average interest rate for borrowers whose income source is *source verified*.

------------------------------------------------------------------------

When `verified_income` takes a value of `Source Verified`, then the corresponding variable takes a value of 1 while the other is 0:

$$
\widehat{\texttt{interest\_rate}} = 11.10 + 1.42 \times 1 + 3.25 \times 0 = 12.52
$$

The average interest rate for these borrowers is 12.52%.
:::

::: {.guidedpractice data-latex=""}
Compute the average interest rate for borrowers whose income source and amount are both *verified*.[^08-model-mlr-1]
:::

[^08-model-mlr-1]: When `verified_income` takes a value of `Verified`, then the corresponding variable takes a value of 1 while the other is 0: $11.10 + 1.42 \times 0 + 3.25 \times 1 = 14.35.$ The average interest rate for these borrowers is 14.35%.

::: {.important data-latex=""}
**Predictors with several categories.**

When fitting a regression model with a categorical variable that has $k$ levels where $k > 2$, software will provide a coefficient for $k - 1$ of those levels.
For the last level that does not receive a coefficient, this is the reference level, and the coefficients listed for the other levels are all considered relative to this reference level.
:::

::: {.guidedpractice data-latex=""}
Interpret the coefficients from the model above.[^08-model-mlr-2]
:::

[^08-model-mlr-2]: Each of the coefficients gives the incremental interest rate for the corresponding level relative to the `Not Verified` level, which is the reference level.
    For example, for a borrower whose income source and amount have been verified, the model predicts that they will have a 3.25% higher interest rate than a borrower who has not had their income source or amount verified.

The higher interest rate for borrowers who have verified their income source or amount is surprising.
Intuitively, we would think that a loan would look *less* risky if the borrower's income has been verified.
However, note that the situation may be more complex, and there may be confounding variables that we didn't account for.
For example, perhaps lenders require borrowers with poor credit to verify their income.
That is, verifying income in our dataset might be a signal of some concerns about the borrower rather than a reassurance that the borrower will pay back the loan.
For this reason, the borrower could be deemed higher risk, resulting in a higher interest rate.
(What other confounding variables might explain this counter-intuitive relationship suggested by the model?)

::: {.guidedpractice data-latex=""}
How much larger of an interest rate would we expect for a borrower who has verified their income source and amount vs a borrower whose income source has only been verified?[^08-model-mlr-3]
:::

[^08-model-mlr-3]: Relative to the `Not Verified` category, the `Verified` category has an interest rate of 3.25% higher, while the `Source Verified` category is only 1.42% higher.
    Thus, `Verified` borrowers will tend to get an interest rate about $3.25\% - 1.42\% = 1.83\%$ higher than `Source Verified` borrowers.

## Many predictors in a model

The world is complex, and it can be helpful to consider many factors at once in statistical modeling.
For example, we might like to use the full context of borrowers to predict the interest rate they receive rather than using a single variable.
This is the strategy used in **multiple regression**\index{multiple regression}.
While we remain cautious about making any causal interpretations using multiple regression on observational data, such models are a common first step in gaining insights or providing some evidence of a causal connection.

```{r}
#| include: false
terms_chp_08 <- c(terms_chp_08, "multiple regression")
```

We want to construct a model that accounts not only for any past bankruptcy or whether the borrower had their income source or amount verified, but simultaneously accounts for all the variables in the `loans` dataset: `verified_income`, `debt_to_income`, `credit_util`, `bankruptcy`, `term`, `issue_month`, and `credit_checks`.

$$
\begin{aligned}
\widehat{\texttt{interest\_rate}} &= b_0 \\
&+ b_1 \times \texttt{verified\_income}_{\texttt{Source Verified}} + b_2 \times \texttt{verified\_income}_{\texttt{Verified}} \\
&+ b_3 \times \texttt{debt\_to\_income} + b_4 \times \texttt{credit\_util} \\
&+ b_5 \times \texttt{bankruptcy} + b_6 \times \texttt{term} \\
&+ b_7 \times \texttt{credit\_checks} + b_8 \times \texttt{issue\_month}_{\texttt{Jan-2018}} \\
&+ b_9 \times \texttt{issue\_month}_{\texttt{Mar-2018}}
\end{aligned}
$$

This equation represents a holistic approach for modeling all of the variables simultaneously.
Notice that there are two coefficients for `verified_income` and two coefficients for `issue_month`, since both are 3-level categorical variables.

We calculate $b_0$, $b_1$, $b_2$, $\cdots$, $b_9$ the same way as we did in the case of a model with a single predictor -- we select values that minimize the sum of the squared residuals:

$$
SSE = e_1^2 + e_2^2 + \dots + e_{10000}^2 = \sum_{i=1}^{10000} e_i^2 = \sum_{i=1}^{10000} \left(y_i - \hat{y}_i\right)^2
$$

where $y_i$ and $\hat{y}_i$ represent the observed interest rates and their estimated values according to the model, respectively.
10,000 residuals are calculated, one for each observation.
Note that these values are sample statistics and in the case where the observed data is a random sample from a target population that we are interested in making inferences about, they are estimates of the population parameters $\beta_0$, $\beta_1$, $\beta_2$, $\cdots$, $\beta_9$.
We will discuss inference based on linear models in [Chapter -@sec-inf-model-mlr], for now we will focus on calculating sample statistics $b_i$.

We typically use a computer to minimize the sum of squares and compute point estimates, as shown in the sample output in @tbl-loans-full.
Using this output, we identify $b_i,$ just as we did in the one-predictor case.

```{r}
#| label: tbl-loans-full
#| tbl-cap: |
#|   Output for the regression model, where interest rate is the
#|   outcome and the variables listed are the predictors. Degrees
#|   of freedom for this model is 9990.
#| tbl-pos: H
m_full <- lm(interest_rate ~ ., data = loans)

m_full |>
  tidy() |>
  mutate(p.value = ifelse(p.value < 0.001, "<0.0001", round(p.value, 4))) |>
  kbl(linesep = "", booktabs = TRUE, digits = 2, align = "lrrrr") |>
  kable_styling(
    bootstrap_options = c("striped", "condensed"),
    latex_options = c("striped")
  ) |>
  column_spec(1, width = "15em", monospace = TRUE) |>
  column_spec(2:5, width = "5em")
```

::: {.important data-latex=""}
**Multiple regression model.**

A multiple regression model is a linear model with many predictors.
In general, we write the model as

$$
\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_k x_k
$$

when there are $k$ predictors.
We always calculate $b_i$ using statistical software.
:::

::: {.workedexample data-latex=""}
Write out the regression model using the regression output from @tbl-loans-full.
How many predictors are there in this model?

------------------------------------------------------------------------

The fitted model for the interest rate is given by:

$$
\begin{aligned}
\widehat{\texttt{interest\_rate}} &= 1.89 \\
&+ 1.00 \times \texttt{verified\_income}_{\texttt{Source Verified}} + 2.56 \times \texttt{verified\_income}_{\texttt{Verified}} \\
&+ 0.02 \times \texttt{debt\_to\_income} + 4.90 \times \texttt{credit\_util} \\
&+ 0.39 \times \texttt{bankruptcy} + 0.15 \times \texttt{term} \\
&+ 0.23 \times \texttt{credit\_checks} + 0.05 \times \texttt{issue\_month}_{\texttt{Jan-2018}} \\
&- 0.04 \times \texttt{issue\_month}_{\texttt{Mar-2018}}
\end{aligned}
$$

If we count up the number of predictor coefficients, we get the *effective* number of predictors in the model; there are nine of those.
Notice that the categorical predictor counts as two, once for each of the two levels shown in the model.
In general, a categorical predictor with $p$ different levels will be represented by $p - 1$ terms in a multiple regression model.
A total of seven variables were used as predictors to fit this model: `verified_income`, `debt_to_income`, `credit_util`, `bankruptcy`, `term`, `credit_checks`, `issue_month`.
:::

::: {.guidedpractice data-latex=""}
Interpret the coefficient of the variable `credit_checks`.[^08-model-mlr-4]
:::

[^08-model-mlr-4]: All else held constant, for each additional inquiry into the applicant's credit during the last 12 months, we would expect the interest rate for the loan to be higher, on average, by 0.23 points.

::: {.guidedpractice data-latex=""}
Compute the residual of the first observation in @tbl-loans-data-matrix using the full model.[^08-model-mlr-5]
:::

[^08-model-mlr-5]: To compute the residual, we first need the predicted value, which we compute by plugging values into the equation from earlier.
    For example, $\texttt{verified\_income}_{\texttt{Source Verified}}$ takes a value of 0, $\texttt{verified\_income}_{\texttt{Verified}}$ takes a value of 1 (since the borrower's income source and amount were verified), $\texttt{debt\_to\_income}$ was 18.01, and so on.
    This leads to a prediction of $\widehat{\texttt{interest\_rate}}_1 = 17.84$.
    The observed interest rate was 14.07%, which leads to a residual of $e_1 = 14.07 - 17.84 = -3.77$.

::: {.workedexample data-latex=""}
We calculated a slope coefficient of 0.74 for `bankruptcy` in @sec-ind-and-cat-predictors while the coefficient is 0.39 here.
Why is there a difference between the coefficient values between the models with single and multiple predictors?

------------------------------------------------------------------------

If we examined the data carefully, we would see that some predictors are correlated.
For instance, when we modeled the relationship of the outcome `interest_rate` and predictor `bankruptcy` using linear regression, we were unable to control for other variables like whether the borrower had their income verified, the borrower's debt-to-income ratio, and other variables.
That original model was constructed in a vacuum and did not consider the full context of everything that is considered when an interest rate is decided.
When we include all of the variables, underlying and unintentional bias that was missed by not including these other variables is reduced or eliminated.
Of course, bias can still exist from other confounding variables.
:::

The previous example describes a common issue in multiple regression: correlation among predictor variables.
We say the two predictor variables are **collinear** (pronounced as *co-linear*\index{collinear}) when they are correlated, and this **multicollinearity**\index{multicollinearity} complicates model estimation.
While it is impossible to prevent multicollinearity from arising in observational data, experiments are usually designed to prevent predictors from being multicollinear.

```{r}
#| include: false
terms_chp_08 <- c(terms_chp_08, "multicollinearity", "collinear")
```

::: {.guidedpractice data-latex=""}
The estimated value of the intercept is 1.89, and one might be tempted to make some interpretation of this coefficient, such as, it is the model's predicted interest rate when each of the variables take value zero: income source is not verified, the borrower has no debt (debt-to-income and credit utilization are zero), and so on.
Is this reasonable?
Is there any value gained by making this interpretation?[^08-model-mlr-6]
:::

[^08-model-mlr-6]: Many of the variables do take a value 0 for at least one data point, and for those variables, it is reasonable.
    However, one variable never takes a value of zero: `term`, which describes the length of the loan, in months.
    If `term` is set to zero, then the loan must be paid back immediately; the borrower must give the money back as soon as they receive it, which means it is not a real loan.
    Ultimately, the interpretation of the intercept in this setting is not insightful.

## Adjusted R-squared

We first used $R^2$ in @sec-r-squared to determine the amount of variability in the response that was explained by the model:

$$
R^2 = 1 - \frac{\text{variability in residuals}}{\text{variability in the outcome}}
    = 1 - \frac{Var(e_i)}{Var(y_i)}
$$ 

where $e_i$ represents the residuals of the model and $y_i$ the outcomes.
This equation remains valid in the multiple regression framework, but a small enhancement can make it even more informative when comparing models.

::: {.guidedpractice data-latex=""}
The variance of the residuals for the model given in the earlier Guided Practice is 18.53, and the variance of the total price in all the auctions is 25.01.
Calculate $R^2$ for this model.[^08-model-mlr-7]
:::

[^08-model-mlr-7]: $R^2 = 1 - \frac{18.53}{25.01} = 0.2591$.

This strategy for estimating $R^2$ works when there is a single predictor.
However, it becomes less helpful when there are many variables.
The regular $R^2$ is a biased estimate of the amount of variability explained by the model when applied to model with more than one predictor.
To get a better estimate, we use the adjusted $R^2$.

::: {.important data-latex=""}
**Adjusted R-squared as a tool for model assessment.**

The **adjusted R-squared**\index{adjusted R-squared} is computed as

\vspace{-5mm}

$$
\begin{aligned}
  R_{adj}^{2}
    &= 1 - \frac{s_{\text{residuals}}^2 / (n-k-1)}
        {s_{\text{outcome}}^2 / (n-1)} = 1 - \frac{s_{\text{residuals}}^2}{s_{\text{outcome}}^2}
        \times \frac{n-1}{n-k-1}
\end{aligned}
$$

where $n$ is the number of observations used to fit the model and $k$ is the number of predictor variables in the model.
Remember that a categorical predictor with $p$ levels will contribute $p - 1$ to the number of variables in the model.
:::

```{r}
#| include: false
terms_chp_08 <- c(terms_chp_08, "adjusted R-squared")
```

Because $k$ is never negative, the adjusted $R^2$ will be smaller -- often times just a little smaller -- than the unadjusted $R^2$.
The reasoning behind the adjusted $R^2$ lies in the **degrees of freedom**\index{degrees of freedom} associated with each variance, which is equal to $n - k - 1$ in the multiple regression context.
If we were to make predictions for *new data* using our current model, we would find that the unadjusted $R^2$ would tend to be slightly overly optimistic, while the adjusted $R^2$ formula helps correct this bias.

```{r}
#| include: false
terms_chp_08 <- c(terms_chp_08, "degrees of freedom")
```

::: {.guidedpractice data-latex=""}
There were n = 10,000 auctions in the dataset and $k=9$ predictor variables in the model.
Use $n$, $k$, and the variances from the earlier Guided Practice to calculate adjusted $R^2$ for the interest rate model.[^08-model-mlr-8]
:::

[^08-model-mlr-8]: $R_{adj}^2 = 1 - \frac{18.53}{25.01}\times \frac{10000-1}{10000-9-1} = 0.2584$.
    While the difference is very small, it will be important when we fine tune the model in the next section.

::: {.guidedpractice data-latex=""}
Suppose you added another predictor to the model, but the variance of the errors $Var(e_i)$ didn't go down.
What would happen to the $R^2$?
What would happen to the adjusted $R^2$?[^08-model-mlr-9]
:::

[^08-model-mlr-9]: The unadjusted $R^2$ would stay the same and the adjusted $R^2$ would go down.

Adjusted $R^2$ could also have been used in [Chapter -@sec-model-slr] where we introduced regression models with a single predictor.
However, when there is only $k = 1$ predictors, adjusted $R^2$ is very close to regular $R^2$, so this nuance isn't typically important when the model has only one predictor.

## Model selection {#sec-model-selection}

The best model is not always the most complicated.
Sometimes including predictors that are not evidently important can actually reduce the accuracy of predictions.
In this section, we discuss model selection strategies, which will help us eliminate predictors from the model that are found to be less important.
It's common (and hip, at least in the statistical world) to refer to models that have undergone such predictor pruning as **parsimonious**\index{parsimonious}.

```{r}
#| include: false
terms_chp_08 <- c(terms_chp_08, "parsimonious")
```

In practice, the model that includes all available predictors is often referred to as the **full model**\index{full model}.
The full model may not be the best model, and if it isn't, we want to identify a smaller model that is preferable.

```{r}
#| include: false
terms_chp_08 <- c(terms_chp_08, "full model")
```

### Stepwise selection

Two common strategies for adding or removing predictors in a multiple regression model are called backward elimination and forward selection.
These techniques are often referred to as **stepwise selection**\index{stepwise selection} strategies, because they add or delete one variable at a time as they "step" through the candidate predictors.

```{r}
#| include: false
terms_chp_08 <- c(terms_chp_08, "stepwise selection")
```

\clearpage

**Backward elimination**\index{backward elimination} starts with the full model -- the model that includes all potential predictor variables. Predictors are eliminated one-at-a-time from the model until we cannot improve the model any further.

**Forward selection**\index{forward selection} is the reverse of the backward elimination technique.
Instead, of eliminating predictors one-at-a-time, we add predictors one-at-a-time until we cannot find any predictors that improve the model any further.

```{r}
#| include: false
terms_chp_08 <- c(terms_chp_08, "backward elimination", "forward selection")
```

An important consideration in implementing either of these stepwise selection strategies is the criterion used to decide whether to eliminate or add a predictors.
One commonly used decision criterion is adjusted $R^2$.
When using adjusted $R^2$ as the decision criterion, we seek to eliminate or add predictors depending on whether they lead to the largest improvement in adjusted $R^2$ and we stop when adding or elimination of another predictor does not lead to further improvement in adjusted $R^2$.

Adjusted $R^2$ describes the strength of a model fit, and it is a useful tool for evaluating which predictors are adding value to the model, where *adding value* means they are (likely) improving the accuracy in predicting future outcomes.

Let's consider two models, which are shown in @tbl-loans-full-for-model-selection and @tbl-loans-full-except-issue-month.
The first table summarizes the full model since it includes all predictors, while the second does not include the `issue_month` variable.

```{r}
#| label: tbl-loans-full-for-model-selection
#| tbl-cap: |
#|   The fit for the full regression model, including the adjusted $R^2$.
#| tbl-pos: H
options(digits = 6) # to get more digits
m_full_r_sq_adj <- glance(m_full)$adj.r.squared |> round(4)
options(digits = 3) # to get back to default set in _common.R
m_full_df_residual <- glance(m_full)$df.residual

m_full_w_rsq <- m_full |>
  tidy() |>
  mutate(p.value = ifelse(p.value < 0.001, "<0.0001", round(p.value, 4))) |>
  add_row(term = glue("Adjusted R-sq = {m_full_r_sq_adj}")) |>
  add_row(term = glue("df = {m_full_df_residual}"))

m_full_w_rsq |>
  kbl(
    linesep = "", booktabs = TRUE,
    digits = 2, align = "lrrrr"
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "condensed"),
    latex_options = c("striped")
  ) |>
  column_spec(1, width = "17em") |>
  column_spec(1, monospace = ifelse(as.numeric(rownames(m_full_w_rsq)) < 11, TRUE, FALSE)) |>
  column_spec(2:5, width = "5em") |>
  pack_rows("", 11, 12) |>
  add_indent(11:12) |>
  row_spec(11:12, italic = TRUE)
```

```{r}
#| label: tbl-loans-full-except-issue-month
#| tbl-cap: |
#|   The fit for the regression model after dropping issue month,
#|   including the adjusted $R^2$.
#| tbl-pos: H
m_full_minus_issue_month <- lm(interest_rate ~ . - issue_month, data = loans)

options(digits = 6) # to get more digits
m_full_minus_issue_month_r_sq_adj <- glance(m_full_minus_issue_month)$adj.r.squared |> round(4)
options(digits = 3) # to get back to default set in _common.R
m_full_minus_issue_month_df_residual <- glance(m_full_minus_issue_month)$df.residual

m_full_minus_issue_month_w_rsq <- m_full_minus_issue_month |>
  tidy() |>
  mutate(p.value = ifelse(p.value < 0.001, "<0.0001", round(p.value, 4))) |>
  add_row(term = glue("Adjusted R-sq = {m_full_minus_issue_month_r_sq_adj}")) |>
  add_row(term = glue("df = {m_full_minus_issue_month_df_residual}"))

m_full_minus_issue_month_w_rsq |>
  kbl(
    linesep = "", booktabs = TRUE,
    digits = 2, align = "lrrrr"
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "condensed"),
    latex_options = c("striped")
  ) |>
  column_spec(1, width = "17em") |>
  column_spec(1, monospace = ifelse(as.numeric(rownames(m_full_minus_issue_month_w_rsq)) < 9, TRUE, FALSE)) |>
  column_spec(2:5, width = "5em") |>
  pack_rows("", 9, 10) |>
  add_indent(9:10) |>
  row_spec(9:10, italic = TRUE)
```

::: {.workedexample data-latex=""}
Which of the two models is better?

------------------------------------------------------------------------

We compare the adjusted $R^2$ of each model to determine which to choose.
Since the second model has a higher adjusted $R^2$ compared to the first model, we prefer the second model to the first.
We cannot know for sure, but based on the adjusted $R^2$, this is our best assessment.
:::

::: {.workedexample data-latex=""}
Results corresponding to the full model for the `loans` data are shown in @tbl-loans-full-for-model-selection.
How should we proceed under the backward elimination strategy?

------------------------------------------------------------------------

Our baseline adjusted $R^2$ from the full model is 0.2597, and we need to determine whether dropping a predictor will improve the adjusted $R^2$.
To check, we fit models that each drop a different predictor, and we record the adjusted $R^2$:

-   Excluding `verified_income`: 0.2238
-   Excluding `debt_to_income`: 0.2557
-   Excluding `credit_util`: 0.1916
-   Excluding `bankruptcy`: 0.2589
-   Excluding `term`: 0.1468
-   Excluding `credit_checks`: 0.2484
-   Excluding `issue_month`: 0.2598

The model without `issue_month` has the highest adjusted $R^2$ of 0.2598, higher than the adjusted $R^2$ for the full model; therefore, we drop `issue_month` from the model.

Since we eliminated a predictor from the model in the first step, we see whether we should eliminate any additional predictors.
Our baseline adjusted $R^2$ is now $R^2_{adj} = 0.2598$.
We now fit new models, which consider eliminating each of the remaining predictors in addition to `issue_month`:

-   Excluding `issue_month` and `verified_income`: 0.22395
-   Excluding `issue_month` and `debt_to_income`: 0.25579
-   Excluding `issue_month` and `credit_util`: 0.19174
-   Excluding `issue_month` and `bankruptcy`: 0.25898
-   Excluding `issue_month` and `term`: 0.14692
-   Excluding `issue_month` and `credit_checks`: 0.24801

None of these models lead to an improvement in adjusted $R^2$, so we do not eliminate any of the remaining predictors.
That is, after backward elimination, we are left with the model that keeps all predictors except `issue_month`, which we can summarize using the coefficients from @tbl-loans-full-except-issue-month.

\vspace{-5mm}

$$
\begin{aligned}
\widehat{\texttt{interest\_rate}} &= 1.90 \\
&+ 1.00 \times \texttt{verified\_income}_\texttt{Source only} + 2.56 \times \texttt{verified\_income}_\texttt{Verified} \\
&+ 0.02 \times \texttt{debt\_to\_income} + 4.90 \times \texttt{credit\_util} \\
&+ 0.39 \times \texttt{bankruptcy} + 0.15 \times \texttt{term} \\
&+ 0.23 \times \texttt{credit\_checks}
\end{aligned}
$$
:::

::: {.workedexample data-latex=""}
Construct a model for predicting `interest_rate` from the `loans` data using forward selection.

------------------------------------------------------------------------

We start with the model that includes no predictors.
Then we fit each of the possible models with just one predictor.
Then we examine the adjusted $R^2$ for each of these models:

-   Including `verified_income`: 0.05926
-   Including `debt_to_income`: 0.01946
-   Including `credit_util`: 0.06452
-   Including `bankruptcy`: 0.00222
-   Including `term`: 0.12855
-   Including `credit_checks`: -0.0001
-   Including `issue_month`: 0.01711

In this first step, we compare the adjusted $R^2$ against a baseline model that has no predictors, which always has $R_{adj}^2 = 0$.
The model with one predictor that has the largest adjusted $R^2$ is the model with the `term` predictor, so we will add this variable to our model.

We repeat the process again, this time considering 2-predictor models where one of the predictors is `term` and with a new baseline of $R^2_{adj} = 0.12855:$

-   Including `term` and `verified_income`: 0.16851
-   Including `term` and `debt_to_income`: 0.14368
-   Including `term` and `credit_util`: 0.20046
-   Including `term` and `bankruptcy`: 0.13070
-   Including `term` and `credit_checks`: 0.12840
-   Including `term` and `issue_month`: 0.14294

The model including `credit_util` has the largest increase in adjusted $R^2$ (0.20046) from the baseline (0.12855),
Thus, we will also add `credit_util` to the model as a predictor.

Now we have a new baseline adjusted $R^2$ of 0.20046.
We can continue on and see whether it would be beneficial to add a third predictor:

-   Including `term`, `credit_util`, and `verified_income`: 0.24183
-   Including `term`, `credit_util`, and `debt_to_income`: 0.20810
-   Including `term`, `credit_util`, and `bankruptcy`: 0.20169
-   Including `term`, `credit_util`, and `credit_checks`: 0.20031
-   Including `term`, `credit_util`, and `issue_month`: 0.21629

The model including `verified_income` has the largest increase in adjusted $R^2$ (0.24183) from the baseline (0.20046), so we add `verified_income` to the model as a predictor as well.

We continue in this way, next adding `debt_to_income`, then `credit_checks`, and `bankruptcy`.
At this point, we come again to the `issue_month` variable: adding this as a predictor leads to $R_{adj}^2 = 0.25843$, while keeping all the other predictors but excluding `issue_month` has a higher $R_{adj}^2 = 0.25854$.
This means we do not add `issue_month` to the model as a predictor.
In this example, we have arrived at the same model that we identified with backward elimination.
:::

::: {.important data-latex=""}
**Stepwise selection strategies.**

Backward elimination begins with the model having the largest number of predictors and eliminates predictors one-by-one until we are satisfied that all remaining predictors are important to the model.
Forward selection starts with no predictors included in the model, then it adds in predictors according to their importance until no other important predictors are found.
Notice that, for both methods, we have always chosen to retain the model with the largest adjusted $R^2$ value, even if the difference is less than half a percent (e.g., 0.2597 versus 0.2598).
One could argue that the difference between these two models is negligible, as they both explain nearly the same amount of variability in the `interest_rate`.
These negligible differences are an important aspect to model selection.
It is highly advised that *before* you begin the model selection process, you decide what a "meaningful" difference in adjusted $R^2$ is for the context of your data.
Maybe this difference is 1% or maybe it is 5%.
This "threshold" is what you will then use to decide if one model is "better" than another model.
Using meaningful thresholds in model selection requires more critical thinking about what the adjusted $R^2$ values mean.

Additionally, backward elimination and forward selection can arrive at different final models, because the decision for whether to include a given predictor or not depends on the other predictors that are already in the model.
With forward selection, you start with a model that includes no predictors, and add predictors one at a time.
In backward elimination, you start with a model that includes all of the potential predictors, and remove predictors one at a time.
How much a given predictor changes the percentage of the variability in the outcome that is explained by the model depends on the other predictors in the model, particularly if the predictor variables are correlated with each other.
:::

There is no "one size fits all" model selection strategy, which is why there are so many different model selection methods.
We hope you walk away from this exploration understanding how stepwise selection is carried out and the considerations that should be made when using stepwise selection with regression models.

### Other model selection strategies

Stepwise selection using adjusted $R^2$ as the decision criteria is one of many commonly used model selection strategies.
Stepwise selection can also be carried out with decision criteria other than adjusted $R^2$, such as p-values, which you'll learn about in @sec-inf-model-slr onward, or AIC (Akaike information criterion) or BIC (Bayesian information criterion), which you might learn about in more advanced courses.

Alternatively, one could choose to include or exclude predictors from a model based on expert opinion or due to research focus.
In fact, many statisticians discourage the use of stepwise regression *alone* for model selection and advocate, instead, for a more thoughtful approach that carefully considers the research focus and features of the data.

\clearpage

## Chapter review {#sec-chp8-review}

### Summary

With real data, there is often a need to describe how multiple variables can be modeled together.
In this chapter, we have presented one approach using multiple linear regression.
Each coefficient represents how the model predicts the outcome might change with one unit increase of that predictor *given* the rest of the predictor variables in the model.
Working with and interpreting multivariable models can be tricky, especially when the predictor variables show multicollinearity.
There is often no perfect or "right" final model, however, using the adjusted $R^2$ value is one way to identify important predictor variables for a final regression model.
In later chapters we will generalize multiple linear regression models to a larger population of interest from which the dataset was sampled.

### Terms

The terms introduced in this chapter are presented in @tbl-terms-chp-08.
If you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.
You should be able to easily spot them as **bolded text**.

```{r}
#| label: tbl-terms-chp-08
#| tbl-cap: Terms introduced in this chapter.
#| tbl-pos: H
make_terms_table(terms_chp_08)
```

\clearpage

## Exercises {#sec-chp8-exercises}

Answers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-08].

::: {.exercises data-latex=""}
{{< include exercises/_08-ex-model-mlr.qmd >}}
:::