multiple-regression.qmd

# Multiple Regression {#sec-multipleRegression}

## Getting Started {#sec-multipleRegressionGettingStarted}

### Load Packages {#sec-multipleRegressionLoadPackages}

```{r}
library("petersenlab")
library("tidyverse")
library("knitr")
```

## Overview of Multiple Regression {#sec-multipleRegressionOverview}

Multiple regression examines the association between multiple [predictor variables](#sec-correlationalStudy) and one [outcome variable](#sec-correlationalStudy).
It allows obtaining a more accurate estimate of the unique contribution of a given [predictor variable](#sec-correlationalStudy), by controlling for other variables ([covariates](#sec-covariates)).

Regression with one [predictor variable](#sec-correlationalStudy) takes the form of @eq-regression:

$$
y = \beta_0 + \beta_1x_1 + \epsilon
$$ {#eq-regression}

where $y$ is the [outcome variable](#sec-correlationalStudy), $\beta_0$ is the intercept, $\beta_1$ is the slope, $x_1$ is the [predictor variable](#sec-correlationalStudy), and $\epsilon$ is the error term.

A regression line is depicted in @fig-regression.

```{r}
#| include: false
#| eval: false

set.seed(52242)
regression <- data.frame(outcome = rnorm(40, mean = 5, sd = 2))

regression$predictor <- complement(regression$outcome, .5)
regression$predictor <- regression$predictor + abs(min(regression$predictor))

lm(
  outcome ~ predictor,
  data = regression)

ggplot2::ggplot(
  data = regression,
  aes(
    x = predictor,
    y = outcome,
  )
) +
  geom_point() +
  geom_smooth(
    method = "lm",
    linewidth = 2,
    se = FALSE,
    fullrange = TRUE) +
  scale_x_continuous(
    lim = c(0,8),
    breaks = seq(from = 0, to = 8, by = 2),
    expand = c(0,0)
  ) +
  scale_y_continuous(
    lim = c(0,8),
    breaks = seq(from = 0, to = 8, by = 2),
    expand = c(0,0)
  ) +
  labs(
    x = "Predictor Variable",
    y = "Outcome Variable",
    title = "Regression Best-Fit Line"
  ) +
  theme_classic(
    base_size = 16) +
  theme(legend.title = element_blank())

ggsave("./images/regression.pdf", width = 6, height = 6)
```

::: {#fig-regression}
![](images/regression.png){fig-alt="A Regression Best-Fit Line."}

A Regression Best-Fit Line.
:::

Regression with multiple predictors—i.e., multiple regression—takes the form of @eq-multipleRegression:

$$
y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p + \epsilon
$$ {#eq-multipleRegression}

where $p$ is the number of [predictor variables](#sec-correlationalStudy).

## Components {#sec-multipleRegressionComponents}

- $B$ = unstandardized coefficient: direction and magnitude of the estimate (original scale)
- $\beta$ (beta) = standardized coefficient: direction and magnitude of the estimate (standard deviation scale)
- $SE$ = standard error: uncertainty of unstandardized estimate

The unstandardized regression coefficient ($B$) is interpreted such that, for every unit change in the [predictor variable](#sec-correlationalStudy), there is a __ unit change in the [outcome variable](#sec-correlationalStudy).
For instance, when examining the association between age and fantasy points, if the unstandardized regression coefficient is 2.3, players score on average 2.3 more points for each additional year of age.
(In reality, we might expect a nonlinear, inverted-U-shaped association between age and fantasy points such that players tend to reach their peak in the middle of their careers.)
Unstandardized regression coefficients are tied to the metric of the raw data.
Thus, a large unstandardized regression coefficient for two variables may mean completely different things.
Holding the strength of the association constant, you tend to see larger unstandardized regression coefficients for variables with smaller units and smaller unstandardized regression coefficients for variables with larger units.

Standardized regression coefficients can be obtained by standardizing the variables to [*z*-scores](#sec-zScores) so they all have a mean of zero and standard deviation of one.
The standardized regression coefficient ($\beta$) is interpreted such that, for every standard deviation change in the [predictor variable](#sec-correlationalStudy), there is a __ standard deviation change in the [outcome variable](#sec-correlationalStudy).
For instance, when examining the association between age and fantasy points, if the standardized regression coefficient is 0.1, players score on average 0.1 standard deviation more points for each additional standard deviation of their year of age.
Standardized regression coefficients—though not the case in all instances—tend to fall between [−1, 1].
Thus, standardized regression coefficients tend to be more comparable across variables and models compared to unstandardized regression coefficients.
In this way, standardized regression coefficients provide a meaningful index of [effect size](#sec-practicalSignificance).

## Assumptions of Multiple Regression {#sec-assumptionsRegression}

Linear regression models make the following assumptions:

- there is a linear association between the predictor variables and the outcome variable
- there is homoscedasticity of the residuals; the residuals do not differ as a function of the predictor variables or as a function of the outcome variable
- the residuals are independent; they are uncorrelated with each other
- the residuals are normally distributed

## Coefficient of Determination ($R^2$) {#sec-multipleRegressionRSquared}

The coefficient of determination ($R^2$) reflects the proportion of variance in the [outcome (dependent) variable](#sec-correlationalStudy) that is explained by the model predictions: $R^2 = \frac{\text{variance explained in }Y}{\text{total variance in }Y}$.
Various formulas for $R^2$ are in @eq-rSquared.
Larger $R^2$ values indicate greater accuracy.
Multiple regression can be conceptualized with overlapping circles (similar to a venn diagram), where the non-overlapping portions of the circles reflect nonshared variance and the overlapping portions of the circles reflect shared variance, as in @fig-regression.

::: {#fig-regression}
![](images/multipleRegressionRSquared.png){width=50% fig-alt="Conceptual Depiction of Proportion of Variance Explained ($R^2$) in an Outcome Variable ($Y$) by Multiple Predictors ($X1$ and $X2$) in Multiple Regression. The size of each circle represents the variable's variance. The proportion of variance in $Y$ that is explained by the predictors is depicted by the areas in orange. The dark orange space ($G$) is where multiple predictors explain overlapping variance in the outcome. Overlapping variance that is explained in the outcome ($G$) will not be recovered in the regression coefficients when both predictors are included in the regression model. From @Petersen2024a and @PetersenPrinciplesPsychAssessment."}

Conceptual Depiction of Proportion of Variance Explained ($R^2$) in an Outcome Variable ($Y$) by Multiple Predictors ($X1$ and $X2$) in Multiple Regression. The size of each circle represents the variable's variance. The proportion of variance in $Y$ that is explained by the predictors is depicted by the areas in orange. The dark orange space ($G$) is where multiple predictors explain overlapping variance in the outcome. Overlapping variance that is explained in the outcome ($G$) will not be recovered in the regression coefficients when both predictors are included in the regression model. From @Petersen2024a and @PetersenPrinciplesPsychAssessment.
:::

One issue with $R^2$ is that it increases as the number of predictors increases, which can lead to [overfitting](#sec-overfitting) if using $R^2$ as an index to compare models for purposes of selecting the "best-fitting" model.
Consider the following example (adapted from @PetersenPrinciplesPsychAssessment) in which you have one [predictor variable](#sec-correlationalStudy) and one [outcome variable](#sec-correlationalStudy), as shown in @tbl-regression1.

```{r}
#| echo: false

regression1 <- data.frame(
  "y" = c(7, 13, 29, 10),
  "x1" = c(1, 2, 7, 2))

regression2 <- data.frame(
  "y" = c(7, 13, 29, 10),
  "x1" = c(1, 2, 7, 2),
  "x2" = c(3, 5, 1, 2))

regression1_model <- lm(
  y ~ x1,
  data = regression1)

regression1_intercept <- apa(regression1_model$coefficients[[1]], decimals = 2)
regression1_slope <- apa(regression1_model$coefficients[[2]], decimals = 2)
regression1_rsquare <- apa(summary(regression1_model)$r.squared, decimals = 2)

regression2_model <- lm(y ~ x1 + x2, data = regression2)
regression2_intercept <- apa(regression2_model$coefficients[[1]], decimals = 2)
regression2_slope1 <- apa(regression2_model$coefficients[[2]], decimals = 2)
regression2_slope2 <- apa(regression2_model$coefficients[[3]], decimals = 2)
regression2_rsquare <- apa(summary(regression2_model)$r.squared, decimals = 2)
```

```{r}
#| label: tbl-regression1
#| tbl-cap: "Example Data of Predictor (x1) and Outcome (y) Used for Regression Model."
#| echo: false

kable(
  regression1,
  col.names = c("y","x1"),
  booktabs = TRUE)
```

Using the data, the best fitting regression model is: $y =$ `{r} regression1_intercept` $+$ `{r} regression1_slope` $\cdot x_1$.
In this example, the $R^2$ is `{r} regression1_rsquare`.
The equation is not a perfect prediction, but with a single [predictor variable](#sec-correlationalStudy), it captures the majority of the variance in the outcome.

Now consider the following example where you add a second [predictor variable](#sec-correlationalStudy) to the data above, as shown in @tbl-regression2.

```{r}
#| label: tbl-regression2
#| tbl-cap: "Example Data of Predictors (x1 and x2) and Outcome (y) Used for Regression Model."
#| echo: false

kable(
  regression2,
  col.names = c("y","x1","x2"),
  booktabs = TRUE)
```

With the second [predictor variable](#sec-correlationalStudy), the best fitting regression model is: $y =$ `{r} regression2_intercept` + `{r} regression2_slope1` $\cdot x_1 +$ `{r} regression2_slope2` $\cdot x_2$.
In this example, the $R^2$ is `{r} regression2_rsquare`.
The equation with the second [predictor variable](#sec-correlationalStudy) provides a perfect prediction of the outcome.

Providing perfect prediction with the right set of [predictor variable](#sec-correlationalStudy)s is the dream of multiple regression.
So, using multiple regression, we often add [predictor variables](#sec-correlationalStudy) to incrementally improve prediction.
Knowing how much variance would be accounted for by random chance follows @eq-predictionByChance:

$$
E(R^2) = \frac{K}{n-1}
$$ {#eq-predictionByChance}

where $E(R^2)$ is the expected value of $R^2$ (the proportion of variance explained), $K$ is the number of [predictor variables](#sec-correlationalStudy), and $n$ is the sample size.
The formula demonstrates that the more [predictor variables](#sec-correlationalStudy) in the regression model, the more variance will be accounted for by chance.
With many [predictor variables](#sec-correlationalStudy) and a small sample, you can account for a large share of the variance merely by chance.

As an example, consider that we have 13 [predictor variables](#sec-correlationalStudy) to predict fantasy performance for 43 players.
Assume that, with 13 [predictor variables](#sec-correlationalStudy), we explain 38% of the variance ($R^2 = .38; r = .62$).
We explained a lot of the variance in the outcome, but it is important to consider how much variance could have been explained by random chance: $E(R^2) = \frac{K}{n-1} = \frac{13}{43 - 1} = .31$.
We expect to explain 31% of the variance, by chance, in the outcome.
So, 82% of the variance explained was likely spurious (i.e., $\frac{.31}{.38} = .82$).
As the sample size increases, the spuriousness decreases.

To account for the number of [predictor variables](#sec-correlationalStudy) in the model, we can use a modified version of $R^2$ called adjusted $R^2$ ($R^2_{adj}$).
Adjusted $R^2$ ($R^2_{adj}$) accounts for the number of [predictor variables](#sec-correlationalStudy) in the model, based on how much would be expected to be accounted for by chance to penalize [overfitting](#sec-overfitting).
Adjusted $R^2$ ($R^2_{adj}$) reflects the proportion of variance in the [outcome (dependent) variable](#sec-correlationalStudy) that is explained by the model predictions over and above what would be expected to be accounted for by chance, given the number of [predictor variables](#sec-correlationalStudy) in the model.
The formula for adjusted $R^2$ ($R^2_{adj}$) is in @eq-adjustedRSquared:

$$
R^2_{adj} = 1 - (1 - R^2) \frac{n - 1}{n - p - 1}
$$ {#eq-adjustedRSquared}

where $p$ is the number of [predictor variables](#sec-correlationalStudy) in the model, and $n$ is the sample size.

## Overfitting {#sec-overfitting}

Statistical models applied to big data (e.g., data with many [predictor variables](#sec-correlationalStudy)) can *overfit* the data, which means that the statistical model accounts for error variance, which will not generalize to future samples.
So, even though an overfitting statistical model appears to be accurate because it is accounting for more variance, it is not actually that accurate—it will predict new data less accurately than how accurately it accounts for the data with which the model was built.
In the case of fantasy football analytics, this is especially relevant because there are hundreds if not thousands of variables we could consider for inclusion and many, many players when considering historical data.

Consider an example where you develop an algorithm to predict players' fantasy performance based on 2023 data using hundreds of [predictor variables](#sec-correlationalStudy).
To some extent, these [predictor variables](#sec-correlationalStudy) will likely account for true variance (i.e., signal) and error variance (i.e., noise).
If we were to apply the same algorithm based on the 2023 prediction model to 2024 data, the prediction model would likely predict less accurately than with 2023 data.
The regression coefficients in the 

In @fig-overfittingModel, the blue line represents the true distribution of the data, and the red line is an overfitting model:

```{r}
#| label: fig-overfittingModel
#| fig-cap: "Over-fitting Model in Red Relative to the True Distribution of the Data in Blue. From @Petersen2024a and @PetersenPrinciplesPsychAssessment."
#| fig-alt: "Over-fitting Model in Red Relative to the True Distribution of the Data in Blue. From @Petersen2024a and @PetersenPrinciplesPsychAssessment."
#| code-fold: true

set.seed(52242)

sampleSize <- 200
quadraticX <- rnorm(sampleSize)
quadraticY <- quadraticX ^ 2 + rnorm(sampleSize)
quadraticData <- cbind(quadraticX, quadraticY) %>%
  data.frame %>%
  arrange(quadraticX)

quadraticModel <- lm(
  quadraticY ~ quadraticX + I(quadraticX ^ 2),
  data = quadraticData)

quadraticNewData <- data.frame(
  quadraticX = seq(
    from = min(quadraticData$quadraticX),
    to = max(quadraticData$quadraticY),
    length.out = sampleSize))

quadraticNewData$quadraticY <- predict(
  quadraticModel,
  newdata = quadraticNewData)

loessFit <- loess(
  quadraticY ~ quadraticX,
  data = quadraticData,
  span = 0.01,
  degree = 1)

loessNewData <- data.frame(
  quadraticX = seq(
    from = min(quadraticData$quadraticX),
    to = max(quadraticData$quadraticY),
    length.out = sampleSize))

quadraticNewData$loessY <- predict(
  loessFit,
  newdata = quadraticNewData)

plot(
  x = quadraticData$quadraticX,
  y = quadraticData$quadraticY,
  xlab = "",
  ylab = "")

lines(
  quadraticNewData$quadraticY ~ quadraticNewData$quadraticX,
  lwd = 2,
  col = "blue")

lines(
  quadraticNewData$loessY ~ quadraticNewData$quadraticX,
  lwd = 2,
  col = "red")
```

## Covariates {#sec-covariates}

Covariates are variables that you include in the statistical model to try to control for them so you can better isolate the unique contribution of the [predictor variable](#sec-correlationalStudy)(s) in relation to the [outcome variable](#sec-correlationalStudy).
Use of covariates examines the association between the [predictor variable](#sec-correlationalStudy) and the [outcome variable](#sec-correlationalStudy) when holding people's level constant on the covariates.
Inclusion of confounds as covariates allows potentially gaining a more accurate estimate of the causal effect of the [predictor variable](#sec-correlationalStudy) on the [outcome variable](#sec-correlationalStudy).
Ideally, you want to include any and all confounds as covariates.
As described in @sec-correlationCausation, confounds are third variables that influence both the [predictor variable](#sec-correlationalStudy) and the [outcome variable](#sec-correlationalStudy) and explain their association.
Covariates are potentially (but not necessarily) confounds.
For instance, you might include the player's age as a covariate in a model that examines whether a player's 40-yard dash time at the NFL Combine predicts their fantasy points in their rookie year, but it may not be a confound.

## Multicollinearity {#sec-multipleRegressionMulticollinearity}

*Multicollinearity* occurs when two or more [predictor variables](#sec-correlationalStudy) in a regression model are highly correlated.
The problem of having multiple [predictor variables](#sec-correlationalStudy) that are highly correlated is that it makes it challenging to estimate the regression coefficients accurately.

Multicollinearity in multiple regression is depicted conceptually in @fig-regression.

::: {#fig-regression}
![](images/multipleRegressionMulticollinearity.png){width=50% fig-alt="Conceptual Depiction of Multicollinearity in Multiple Regression. From @Petersen2024a and @PetersenPrinciplesPsychAssessment."}

Conceptual Depiction of Multicollinearity in Multiple Regression. From @Petersen2024a and @PetersenPrinciplesPsychAssessment.
:::

Consider the following example adapted from @PetersenPrinciplesPsychAssessment where you have two [predictor variables](#sec-correlationalStudy) and one [outcome variable](#sec-correlationalStudy), as shown in @tbl-regression3.

```{r}
#| echo: false

regression3 <- data.frame(
  "y" = c(9, 11, 17, 3, 21, 13),
  "x1" = c(2, 3, 4, 1, 5, 3.5),
  "x2" = c(4, 6, 8, 2, 10, 7))
```

```{r}
#| label: tbl-regression3
#| tbl-cap: "Example Data of Predictors (x1 and x2) and Outcome (y) Used for Regression Model."
#| echo: false

kable(
  regression3,
  col.names = c("y","x1","x2"),
  booktabs = TRUE)
```

The second [predictor variable](#sec-correlationalStudy) is not very good—it is exactly twice the value of the first [predictor variable](#sec-correlationalStudy); thus, the two [predictor variables](#sec-correlationalStudy) are perfectly correlated (i.e., $r = 1.0$).
This means that there are different prediction equation possibilities that are equally good—see Equations in @eq-multicollinearity:

$$
\begin{aligned}
  2x_2 &= y \\
  0x_1 + 2x_2 &= y \\
  4x_1 &= y \\
  4x_1 + 0x_2 &= y \\
  2x_1 + 1x_2 &= y \\
  5x_1 - 0.5x_2 &= y \\
  ...
&= y
\end{aligned}
$$ {#eq-multicollinearity}

Then, what are the regression coefficients?
We do not know what are the correct regression coefficients because each of the possibilities fits the data equally well.
Thus, when estimating the regression model, we could obtain arbitrary estimates of the regression coefficients with an enormous standard error around each estimate.
In general, multicollinearity increases the uncertainty (i.e., standard errors and confidence intervals) around the parameter estimates.
Any [predictor variables](#sec-correlationalStudy) that have a correlation above ~ $r = .30$ with each other could have an impact on the confidence interval of the regression coefficient.
As the correlations among the [predictor variables](#sec-correlationalStudy) increase, the chance of getting an arbitrary answer increases, sometimes called "bouncing betas."
So, it is important to examine a correlation matrix of the [predictor variables](#sec-correlationalStudy) before putting them in the same regression model.
You can also examine indices such as variance inflation factor (VIF).

To address multicollinearity, you can drop a redundant predictor or you can also use principal component analysis or factor analysis of the predictors to reduce the predictors down to a smaller number of meaningful predictors.
For a meaningful answer in a regression framework that is precise and confident, you need a low level of intercorrelation among predictors, unless you have a very large sample size.

## Impact of Oultiers {#sec-multipleRegressionOutliers}

[As with correlation](#sec-correlationOutliers), multiple regression can be strongly impacted by outliers.

## Moderated Multiple Regression {#sec-moderatedMultipleRegression}

When examining moderation in multiple regression, several steps are important:

- When computing the interaction term, first mean-center the predictor variables.
Calculate the interaction term as the multiplication of the mean-centered predictor variables.
Mean-centering the predictor variables when computing the interaction term is important for addressing issues regarding [multicollinearity](#sec-multipleRegressionMulticollinearity) [@Iacobucci2016].
- When including an interaction term in the model, make sure also to include the main effects.

## Mediation {#sec-multipleRegressionMediation}

## Bayesian Multiple Regression {#sec-multipleRegressionBayesian}

## Conclusion {#sec-multipleRegressionConclusion}

Multiple regression allows examining the association between multiple [predictor variables](#sec-correlationalStudy) and one [outcome variable](#sec-correlationalStudy).
Inclusion of multiple predictors in the model allows for potentially greater predictive accuracy and identification of the extent to which each variable uniquely contributes to the outcome variable.
As with [correlation](#sec-correlation), an association does not imply causation.
However, identifying associations is important because associations are a necessary (but insufficient) condition for causality.
When developing a multiple regression model, it is important to pay attention for potential [multicollinearity](#sec-multipleRegressionMulticollinearity)—it may become difficult to detect a given [predictor variable](#sec-correlationalStudy) as [statistically significant](#sec-statisticalSignificance) due to the greater uncertainty around the parameter estimates.

::: {.content-visible when-format="html"}

## Session Info {#sec-multipleRegressionSessionInfo}

```{r}
sessionInfo()
```

:::