model-slr.qmd

# Linear regression with a single predictor {#sec-model-slr}

```{r}
#| include: false

source("_common.R")
```

\vspace{-5mm}

::: {.chapterintro data-latex=""}
Linear regression is a very powerful statistical technique.
Many people have some familiarity with regression models just from reading the news, where straight lines are overlaid on scatterplots.
Linear models can be used for prediction or to describe the relationship between two numerical variables, assuming there is a linear relationship between them.
:::

## Fitting a line, residuals, and correlation {#sec-fit-line-res-cor}

When considering linear regression, it's helpful to think deeply about the line fitting process.
In this section, we define the form of a linear model, explore criteria for what makes a good fit, and introduce a new statistic called *correlation*.

### Fitting a line to data

@fig-perfLinearModel shows two variables whose relationship can be modeled perfectly with a straight line.
The equation for the line is $y = 5 + 64.96 x.$ Consider what a perfect linear relationship means: we know the exact value of $y$ just by knowing the value of $x.$ A perfect linear relationship is unrealistic in almost any natural process.
For example, if we took family income $(x),$ this value would provide some useful information about how much financial support a college may offer a prospective student $(y.)$
However, the prediction would be far from perfect, since other factors play a role in financial support beyond a family's finances.

\vspace{-5mm}

```{r}
#| label: fig-perfLinearModel
#| fig-cap: |
#|   Requests from twelve separate buyers were simultaneously placed with a 
#|   trading company to purchase Target Corporation stock (ticker TGT, December 
#|   28th, 2018), and the total cost of the shares were reported. Because the 
#|   cost is computed using a linear formula, the linear fit is perfect.
#| fig-alt: |
#|   A scatterplot showing a perfect linear relationship between number of 
#|   stocks to purchase on the x-axis and total cost of the share purchase 
#|   on the y-axis.
#| fig-asp: 0.4
target <- simulated_scatter |>
  filter(group == 4)

ggplot(target, aes(x = x, y = y)) +
  geom_smooth(method = "lm") +
  geom_point(size = 3) +
  scale_y_continuous(labels = label_dollar(scale = 0.001, suffix = "K", accuracy = 1), breaks = c(0, 1000, 2000)) +
  labs(
    x = "Number of Target Corporation stocks to purchase",
    y = "Total cost of the\nshare purchase"
  )
```

\clearpage

Linear regression is the statistical method for fitting a line to data where the relationship between two variables, $x$ and $y,$ can be modeled by a straight line with some error:

$$
y = b_0 + b_1 \ x + e
$$

The values $b_0$ and $b_1$ represent the model's intercept and slope, respectively, and the error is represented by $e.$
These values are calculated based on the data, i.e., they are sample statistics.
If the observed data is a random sample from a target population that we are interested in making inferences about, these values are considered to be point estimates for the population parameters $\beta_0$ and $\beta_1.$
We will discuss how to make inferences about parameters of a linear model based on sample statistics in @sec-inf-model-slr.

::: {.content-visible when-format="html"}
::: {.pronunciation data-latex=""}
The Greek letter $\beta$ is pronounced *beta*, listen to the pronunciation [here](https://youtu.be/PStgY5AcEIw?t=7).
:::
:::

::: {.content-visible when-format="pdf"}
::: {.pronunciation data-latex=""}
The Greek letter $\beta$ is pronounced *beta*.
:::
:::

When we use $x$ to predict $y,$ we usually call $x$ the **predictor**\index{variable!predictor}\index{predictor variable} variable and we call $y$ the **outcome**\index{variable!outcome}\index{outcome variable}.
We also often drop the $e$ term when writing down the model since our main focus is often on the prediction of the average outcome.

```{r}
#| include: false
terms_chp_07 <- c("predictor", "outcome")
```

It is rare for all of the data to fall perfectly on a straight line.
Instead, it's more common for data to appear as a *cloud of points*, such as those examples shown in @fig-imperfLinearModel.
In each case, the data fall around a straight line, even if none of the observations fall exactly on the line.
The first plot shows a relatively strong downward linear trend, where the remaining variability in the data around the line is minor relative to the strength of the relationship between $x$ and $y.$ The second plot shows an upward trend that, while evident, is not as strong as the first.
The last plot shows a very weak downward trend in the data, so slight we can hardly notice it.
In each of these examples, we will have some uncertainty regarding our estimates of the model parameters, $\beta_0$ and $\beta_1.$ For instance, we might wonder, should we move the line up or down a little, or should we tilt it more or less?
As we move forward in this chapter, we will learn about criteria for line-fitting, and we will also learn about the uncertainty associated with estimates of model parameters.

```{r}
#| label: fig-imperfLinearModel
#| fig-cap: |
#|   Three datasets where a linear model may be useful even though the data do
#|   not all fall exactly on the line.
#| fig-alt: |
#|   Three scatterplots with fabricated data. The first panel shows a 
#|   strong negative linear relationship. The second panel shows a moderate 
#|   positive linear relationship. The last panel shows no relationship 
#|   between the x and y variables.
#| out-width: 100%
#| fig-asp: 0.25
neg <- simulated_scatter |> filter(group == 1)
pos <- simulated_scatter |> filter(group == 2)
ran <- simulated_scatter |> filter(group == 3)

p_neg <- ggplot(neg, aes(x = x, y = y)) +
  geom_point(size = 2, alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = NULL, y = NULL)

p_pos <- ggplot(pos, aes(x = x, y = y)) +
  geom_point(size = 2, alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = NULL, y = NULL)

p_ran <- ggplot(ran, aes(x = x, y = y)) +
  geom_point(size = 2, alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = NULL, y = NULL)

p_neg + p_pos + p_ran
```

\vspace{-5mm}

There are also cases where fitting a straight line to the data, even if there is a clear relationship between the variables, is not helpful.
One such case is shown in @fig-notGoodAtAllForALinearModel where there is a very clear relationship between the variables even though the trend is not linear.
We discuss nonlinear trends in this chapter and the next, but details of fitting nonlinear models are saved for a later course.

```{r}
#| label: fig-notGoodAtAllForALinearModel
#| fig-cap: |
#|   The best fitting line for these data is flat, which is not a useful way 
#|   to describe the non-linear relationship. These data are from a physics 
#|   experiment.
#| fig-alt: |
#|   A scatterplot showing a perfect quadratic relationship between angle of 
#|   incline on the x-axis and distance traveled on the y-axis. The line of 
#|   best fit is superimposed as a perfectly horizontal line. That is to say,
#|   the variables are clearly related, but they do not have a linear 
#|   relationship.
#| fig-asp: 0.25
bad <- simulated_scatter |> filter(group == 5)

ggplot(bad, aes(x = x, y = y)) +
  geom_point(size = 2.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Angle of Incline (Degrees)",
    y = "Distance Traveled (m)"
  )
```

\clearpage

### Using linear regression to predict possum head lengths

Brushtail possums are marsupials that live in Australia, and a photo of one is shown in @fig-brushtail-possum.
Researchers captured 104 of these animals and took body measurements before releasing the animals back into the wild.
We consider two of these measurements: the total length of each possum, from head to tail, and the length of each possum's head.

```{r}
#| label: fig-brushtail-possum
#| fig-alt: Photograph of a common brushtail possum of Australia.
#| fig-cap: |
#|   The common brushtail possum of Australia. Photo by Greg Schecter,
#|   [flic.kr/p/9BAFbR](https://flic.kr/p/9BAFbR), CC BY 2.0 license.
#| out-width: 50%

knitr::include_graphics("images/brushtail-possum/brushtail-possum.jpg")
```

::: {.data data-latex=""}
The [`possum`](http://openintrostat.github.io/openintro/reference/possum.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
:::

@fig-scattHeadLTotalL shows a scatterplot for the head length (mm) and total length (cm) of the possums.
Each point represents a single possum from the data.
The head and total length variables are associated: possums with an above average total length also tend to have above average head lengths.
While the relationship is not perfectly linear, it could be helpful to partially explain the connection between these variables with a straight line.

```{r}
#| label: fig-scattHeadLTotalL
#| fig-cap: |
#|   A scatterplot showing head length against total length for 104 brushtail
#|   possums. A point representing a possum with head length 86.7 mm and total 
#|   length 84 cm is highlighted.
#| fig-alt: |
#|   A scatterplot with total length on the x-axis and head length on the 
#|   y-axis. The variables show a moderately strong positive linear 
#|   relationship. A single observation is circled in red with coordinates 
#|   of approximately 84cm of total length and 87mm of head length.
#| fig-asp: 0.5
ggplot(possum, aes(x = total_l, y = head_l)) +
  geom_point(alpha = 0.7, size = 2) +
  labs(
    x = "Total Length (cm)",
    y = "Head Length (mm)"
  ) +
  geom_point(
    data = tibble(x = 84, y = 86.7), 
    aes(x = x, y = y), 
    color = IMSCOL["red", "full"], 
    size = 5, shape = "circle open", stroke = 2
  )
```

We want to describe the relationship between head and total length of possum's with a line.
In this example, we will use the total length as the predictor variable, $x,$ to predict a possum's head length, $y.$ We could fit the linear relationship by eye, as in @fig-scattHeadLTotalLLine.

\clearpage

```{r}
#| label: fig-scattHeadLTotalLLine
#| fig-cap: |
#|   A reasonable linear model was fit to represent the relationship between 
#|   head length and total length.
#| fig-alt: |
#|   A scatterplot with total length on the x-axis and head length on the 
#|   y-axis. The variables show a moderately strong positive linear relationship.
#|   A least squares line is superimposed.
#| fig-asp: 0.5
ggplot(possum, aes(x = total_l, y = head_l)) +
  geom_point(alpha = 0.7, size = 2) +
  labs(
    x = "Total Length (cm)",
    y = "Head Length (mm)"
  ) +
  geom_smooth(method = "lm", se = FALSE)
```

The equation for this line is

\vspace{-5mm}

$$
\hat{y} = 41 + 0.59x
$$

A "hat" on $y$ is used to signify that this is an estimate.
We can use this line to discuss properties of possums.
For instance, the equation predicts a possum with a total length of 80 cm will have a head length of

\vspace{-5mm}

$$
\hat{y} = 41 + 0.59 \times 80 = 88.2
$$

The estimate may be viewed as an average: the equation predicts that possums with a total length of 80 cm will have an average head length of 88.2 mm.
Absent further information about an 80 cm possum, the prediction for head length that uses the average is a reasonable estimate.

There may be other variables that could help us predict the head length of a possum besides its length.
Perhaps the relationship would be a little different for male possums than female possums, or perhaps it would differ for possums from one region of Australia versus another region.
@fig-scattHeadLTotalL-sex-age-1 shows the relationship between total length and head length of brushtail possums, taking into consideration their sex.
Male possums (represented by blue triangles) seem to be larger in terms of total length and head length than female possums (represented by red circles).
@fig-scattHeadLTotalL-sex-age-2 shows the same relationship, taking into consideration their age.
It's harder to tell if age changes the relationship between total length and head length for these possums.

```{r}
#| label: fig-scattHeadLTotalL-sex-age
#| fig-cap: |
#|   Relationship between total length and head length of brushtail possums, 
#|   taking into consideration their sex or age.
#| fig-subcap: 
#|   - By sex
#|   - By age
#| fig-alt: |
#|   Two scatterplots, both with total length on the x-axis and head 
#|   length on the y-axis. The top plot colors the points by sex where the female
#|   possums seem slightly longer with slightly smaller head lengths. The bottom
#|   plot is colored by age with no obvious trends between age and lengths.
#| layout-ncol: 2
#| fig-width: 5
#| out-width: 100%
p_sex <- ggplot(possum, aes(x = total_l, y = head_l, shape = sex, color = sex)) +
  geom_point(alpha = 0.8, size = 2) +
  scale_color_manual(values = c(IMSCOL["red", "full"], IMSCOL["blue", "full"])) +
  labs(
    x = "Total Length (cm)",
    y = "Head Length (mm)",
    color = "Sex", shape = "Sex"
  )

p_age <- ggplot(possum, aes(x = total_l, y = head_l, color = age)) +
  geom_point(size = 2) +
  labs(
    x = "Total Length (cm)",
    y = "Head Length (mm)",
    color = "Age"
  ) +
  scale_color_gradient(
    low = IMSCOL["green", "f4"],
    high = IMSCOL["green", "full"]
  )

p_sex
p_age
```

In @sec-model-mlr, we'll learn about how we can include more than one predictor in our model.
Before we get there, we first need to better understand how to best build a linear model with one predictor.

\clearpage

### Residuals {#sec-resids}

**Residuals**\index{residuals} are the leftover variation in the data after accounting for the model fit:

$$
\text{Data} = \text{Fit} + \text{Residual}
$$

Each observation will have a residual, and three of the residuals for the linear model we fit for the possum data are shown in @fig-scattHeadLTotalLLine-highlighted.
If an observation is above the regression line, then its residual, the vertical distance from the observation to the line, is positive.
Observations below the line have negative residuals.
One goal in picking the right linear model is for residuals to be as small as possible.

```{r}
#| include: false
terms_chp_07 <- c(terms_chp_07, "residuals")
```

@fig-scattHeadLTotalLLine-highlighted is almost a replica of @fig-scattHeadLTotalLLine, with three points from the data highlighted.
The observation marked by a red circle has a small, negative residual of about -1; the observation marked by a gray diamond has a large positive residual of about +7; and the observation marked by a pink triangle has a moderate negative residual of about -4.
The size of a residual is usually discussed in terms of its absolute value.
For example, the residual for the observation marked by a pink triangle is larger than that of the observation marked by a red circle because $|-4|$ is larger than $|-1|.$

```{r}
#| label: fig-scattHeadLTotalLLine-highlighted
#| fig-cap: |
#|   A reasonable linear model was fit to represent the relationship between 
#|   head length and total length, with three points highlighted.
#| fig-alt: |
#|   A scatterplot with total length on the x-axis and head length on the 
#|   y-axis. A least squares line is superimposed onto the scatterplot.
#|   Three individual observations are circled to indicate their vertical 
#|   distance from the least square line.
mod <- lm(head_l ~ total_l, data = possum)
preds <- predict(mod, data.frame(total_l = c(76, 85, 95.5)))
obs <- c(85.1, 98.6, 94)
ggplot(possum, aes(x = total_l, y = head_l)) +
  geom_point(alpha = 0.8, size = 2) +
  labs(
    x = "Total Length (cm)",
    y = "Head Length (mm)"
  ) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_point(
    data = possum |> filter(total_l == 76),
    shape = "circle open", stroke = 2,
    size = 4, color = IMSCOL["red", "full"]
  ) +
  geom_segment(aes(x = 76, y = preds[1], xend = 76, yend = obs[1] + 0.4),
    color = IMSCOL["red", "full"], inherit.aes = FALSE
  ) +
  geom_point(
    data = possum |> filter(total_l == 85, head_l == 98.6),
    shape = "diamond open", stroke = 2, size = 5,
    color = IMSCOL["gray", "full"]
  ) +
  geom_segment(aes(x = 85, y = preds[2], xend = 85, yend = obs[2] - 0.5),
    color = IMSCOL["gray", "full"], inherit.aes = FALSE
  ) +
  geom_point(
    data = possum |> filter(total_l == 95.5, head_l == 94),
    shape = "triangle open", stroke = 2, size = 5,
    color = IMSCOL["pink", "full"]
  ) +
  geom_segment(aes(x = 95.5, y = preds[3], xend = 95.5, yend = obs[3] + 0.5),
    color = IMSCOL["pink", "full"], inherit.aes = FALSE
  )
```

::: {.important data-latex=""}
**Residual: Difference between observed and expected.**

The residual of the $i^{th}$ observation $(x_i, y_i)$ is the difference of the observed outcome $(y_i)$ and the outcome we would predict based on the model fit $(\hat{y}_i):$

$$
e_i = y_i - \hat{y}_i
$$

We typically identify $\hat{y}_i$ by plugging $x_i$ into the model.
:::

::: {.workedexample data-latex=""}
The linear fit shown in @fig-scattHeadLTotalLLine-highlighted is given as $\hat{y} = 41 + 0.59x.$ Based on this line, compute the residual of the observation $(76.0, 85.1).$ This observation is marked by a red circle in @fig-scattHeadLTotalLLine-highlighted.
Check it against the earlier visual estimate, -1.

------------------------------------------------------------------------

We first compute the predicted value of the observation marked by a red circle based on the model: $\hat{y} = 41+0.59x = 41+0.59\times 76.0 = 85.84$.
Next we compute the difference of the actual head length and the predicted head length: $e = y - \hat{y} = 85.1 -  85.84 = -0.74$.
The model's error is $e = -0.74$ mm, which is very close to the visual estimate of -1 mm.
The negative residual indicates that the linear model overpredicted head length for this possum.
:::

::: {.guidedpractice data-latex=""}
If a model underestimates an observation, will the residual be positive or negative?
What about if it overestimates the observation?[^07-model-slr-1]
:::

[^07-model-slr-1]: If a model underestimates an observation, then the model estimate is below the actual.
    The residual, which is the actual observation value minus the model estimate, must then be positive.
    The opposite is true when the model overestimates the observation: the residual is negative.

::: {.guidedpractice data-latex=""}
Compute the residuals for the observation marked by a blue diamond, $(85.0, 98.6),$ and the observation marked by a pink triangle, $(95.5, 94.0),$ in the figure using the linear relationship $\hat{y} = 41 + 0.59x.$[^07-model-slr-2]
:::

[^07-model-slr-2]: Gray diamond: $\hat{y} = 41+0.59x = 41+0.59\times 85.0 = 91.15 \rightarrow e = y - \hat{y} = 98.6-91.15=7.45.$ This is close to the earlier estimate of 7.
    pink triangle: $\hat{y} = 41+0.59x = 97.3 \rightarrow e = -3.3.$ This is also close to the estimate of -4.

Residuals are helpful in evaluating how well a linear model fits a dataset.
We often display them in a scatterplot such as the one shown in @fig-scattHeadLTotalLResidualPlot for the regression line in @fig-scattHeadLTotalLLine-highlighted.
The residuals are plotted with their predicted outcome variable value as the horizontal coordinate, and the vertical coordinate as the residual.
For instance, the point $(85.0, 98.6)$ (marked by the blue diamond) had a predicted value of 91.4 mm and had a residual of 7.45 mm, so in the residual plot it is placed at $(91.4, 7.45).$ Creating a residual plot is sort of like tipping the scatterplot over so the regression line is horizontal, as indicated by the dashed line.

```{r}
#| label: fig-scattHeadLTotalLResidualPlot
#| fig-cap: |
#|   Residual plot for the model predicting head length from total length for
#|   brushtail possums.
#| fig-alt: |
#|   A residual plot based on @fig-scattHeadLTotalLLine-highlighted 
#|   displaying predicted values on the x-axis and residual values on the y-axis. 
#|   The same three points that were circled in 
#|   @fig-scattHeadLTotalLLine-highlighted are still circled, demonstrating the 
#|   vertical distance from the least squares line is the residual.
m_head_total <- lm(head_l ~ total_l, data = possum)
m_head_total_aug <- augment(m_head_total)

ggplot(m_head_total_aug, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.8, size = 2) +
  labs(
    x = "Predicted values of head length (mm)",
    y = "Residuals"
  ) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_point(
    data = m_head_total_aug |>
      filter(total_l == 76),
    shape = "circle open", stroke = 2, size = 4,
    color = IMSCOL["red", "full"]
  ) +
  geom_segment(
    aes(
      x = preds[1], y = obs[1] - preds[1] + 0.2,
      xend = preds[1], yend = 0
    ),
    color = IMSCOL["red", "full"], inherit.aes = FALSE
  ) +
  geom_point(
    data = m_head_total_aug |>
      filter(total_l == 85, head_l == 98.6),
    shape = "diamond open", stroke = 2, size = 5,
    color = IMSCOL["gray", "full"]
  ) +
  geom_segment(
    aes(
      x = preds[2], y = obs[2] - preds[2] - 0.3,
      xend = preds[2], yend = 0
    ),
    color = IMSCOL["gray", "full"], inherit.aes = FALSE
  ) +
  geom_point(
    data = m_head_total_aug |>
      filter(total_l == 95.5, head_l == 94),
    shape = "triangle open", stroke = 2, size = 5,
    color = IMSCOL["pink", "full"]
  ) +
  geom_segment(
    aes(
      x = preds[3], y = obs[3] - preds[3] + 0.3,
      xend = preds[3], yend = 0
    ),
    color = IMSCOL["pink", "full"], inherit.aes = FALSE
  )
```

\clearpage

::: {.workedexample data-latex=""}
One purpose of residual plots is to identify characteristics or patterns still apparent in data after fitting a model.
The figure below shows three scatterplots with linear models in the first row and residual plots in the second row.
Can you identify any patterns in the residuals?

```{r}
#| label: sampleLinesAndResPlots
#| fig-alt: |
#|   A grid of 2 by 3 scatterplots with fabricated data. The top row of 
#|   plots contains original x-y data plots with a least squares regression line. 
#|   The bottom row of plots is a series of residual plot with predicted value on 
#|   the x-axis and residual on the y-axis. The first column of plots gives an 
#|   example of points that are well described by a linear model. The second 
#|   column of plots gives an example where the correct model seems to be quadratic 
#|   insead of linear. The third column of points gives an example where there is 
#|   no visual relationship between x and y.
#| fig-align: center
#| fig-asp: 0.5
neg_lin <- simulated_scatter |> filter(group == 6)
neg_cur <- simulated_scatter |> filter(group == 7)
random <- simulated_scatter |> filter(group == 8)

neg_lin_mod <- augment(lm(y ~ x, data = neg_lin))
neg_cur_mod <- augment(lm(y ~ x, data = neg_cur))
random_mod <- augment(lm(y ~ x, data = random))

p_neg_lin <- ggplot(neg_lin, aes(x = x, y = y)) +
  geom_point(size = 2, alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_void() +
  theme(panel.border = element_rect(colour = "gray", fill = NA, size = 1)) +
  labs(title = "Dataset 1")

p_neg_cur <- ggplot(neg_cur, aes(x = x, y = y)) +
  geom_point(size = 2, alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_void() +
  theme(panel.border = element_rect(colour = "gray", fill = NA, size = 1))+
  labs(title = "Dataset 2")

p_random <- ggplot(random, aes(x = x, y = y)) +
  geom_point(size = 2, alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_void() +
  theme(panel.border = element_rect(colour = "gray", fill = NA, size = 1)) +
  labs(title = "Dataset 3")

p_neg_lin_res <- ggplot(neg_lin_mod, aes(x = .fitted, y = .resid)) +
  geom_point(size = 2, alpha = 0.8) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  theme_void() +
  theme(panel.border = element_rect(colour = "gray", fill = NA, size = 1))

p_neg_cur_res <- ggplot(neg_cur_mod, aes(x = .fitted, y = .resid)) +
  geom_point(size = 2, alpha = 0.8) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  theme_void() +
  theme(panel.border = element_rect(colour = "gray", fill = NA, size = 1))

p_random_res <- ggplot(random_mod, aes(x = .fitted, y = .resid)) +
  geom_point(size = 2, alpha = 0.8) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  theme_void() +
  theme(panel.border = element_rect(colour = "gray", fill = NA, size = 1))

p_neg_lin + theme(plot.margin = unit(c(0, 10, 5, 0), "pt")) +
  p_neg_cur + theme(plot.margin = unit(c(0, 10, 5, 0), "pt")) + p_random +
  p_neg_lin_res + theme(plot.margin = unit(c(0, 10, 5, 0), "pt")) +
  p_neg_cur_res + theme(plot.margin = unit(c(0, 10, 5, 0), "pt")) + p_random_res +
  plot_layout(ncol = 3, heights = c(2, 1))
```

------------------------------------------------------------------------

Dataset 1: the residuals show no obvious patterns.
The residuals are scattered randomly around 0, represented by the dashed line.

Dataset 2: The second dataset shows a pattern in the residuals.
There is some curvature in the scatterplot, which is more obvious in the residual plot.
We should not use a straight line to model these data.
Instead, a more advanced technique should be used to model the curved relationship, such as the variable transformations discussed in @sec-transforming-data.

Dataset 3: The last plot shows very little upwards trend, and the residuals also show no obvious patterns.
It is reasonable to try to fit a linear model to the data.
However, it is unclear whether there is evidence that the slope parameter is different from zero.
The point estimate of the slope parameter is not zero, but we might wonder if this could just be due to chance.
We will address this scenario in @sec-inf-model-slr.
:::

### Describing linear relationships with correlation

We've seen plots with strong linear relationships and others with very weak linear relationships.
It would be useful if we could quantify the strength of these linear relationships with a statistic.

::: {.important data-latex=""}
**Correlation: strength of a linear relationship.**

**Correlation**\index{correlation} which always takes values between -1 and 1, describes the strength and direction of the linear relationship between two variables.
We denote the correlation by $r.$

The correlation value has no units and will not be affected by a linear change in the units (e.g., going from inches to centimeters).
:::

```{r}
#| include: false
terms_chp_07 <- c(terms_chp_07, "correlation")
```

We can compute the correlation using a formula, just as we did with the sample mean and standard deviation.
The formula for correlation, however, is rather complex[^07-model-slr-3], and like with other statistics, we generally perform the calculations on a computer or calculator.

[^07-model-slr-3]: Formally, we can compute the correlation for observations $(x_1, y_1),$ $(x_2, y_2),$ ..., $(x_n, y_n)$ using the formula

$$
r = \frac{1}{n-1} \sum_{i=1}^{n} \frac{x_i-\bar{x}}{s_x}\frac{y_i-\bar{y}}{s_y}
$$

where $\bar{x},$ $\bar{y},$ $s_x,$ and $s_y$ are the sample means and standard deviations for each variable.

\clearpage

@fig-posNegCorPlots shows eight plots and their corresponding correlations.
Only when the relationship is perfectly linear is the correlation either -1 or +1.
If the relationship is strong and positive, the correlation will be near +1.
If it is strong and negative, it will be near -1.
If there is no apparent linear relationship between the variables, then the correlation will be near zero.

```{r}
#| label: fig-posNegCorPlots
#| fig-cap: |
#|   Sample scatterplots and their correlations. The first row shows variables
#|   with a positive relationship, represented by the trend up and to the right. 
#|   The second row shows variables with a negative trend, where a large value 
#|   in one variable is associated with a lower value in the other.
#| fig-alt: |
#|   Eight scatterplots on fabricated data. The first seven plots show 
#|   linear trends with correlations ranging from -1 to +1. The eighth plot shows 
#|   a quadratic relationship whih produces a correlation of -0.28.
#| fig-asp: 0.5
#| out-width: 100%
library(ggpubr) # Adding here instead of _common.R to avoid collision with ggimage

simulated_scatter |>
  filter(group %in% c(9:12, 14:17)) |>
  ggplot(aes(x = x, y = y)) +
  geom_point(size = 2, alpha = 0.8) +
  theme_void() +
  facet_wrap(~group, nrow = 2, scales = "free") +
  theme(
    panel.border = element_rect(colour = "gray", fill = NA, linewidth = 1),
    strip.background = element_blank(),
    strip.text.x = element_blank()
  ) +
  stat_cor(
    aes(label = paste("r", ..r.., sep = "~`=`~")), 
    geom = "label"
  )
```

The correlation is intended to quantify the strength of a linear trend.
Nonlinear trends, even when strong, sometimes produce correlations that do not reflect the strength of the relationship; see three such examples in @fig-corForNonLinearPlots.

```{r}
#| label: fig-corForNonLinearPlots
#| fig-cap: |
#|   Sample scatterplots and their correlations. In each case, there is a strong
#|   relationship between the variables. However, because the relationship is 
#|   not linear, the correlation is relatively weak.
#| fig-alt: |
#|   Three scatterplots on fabricated data which demonstrate strong
#|   patterns between the x and y variables. The first plot shows a 
#|   quadratic trend with a correlation of -0.23. The second plot shows a cyclic
#|   trend (like a sin wave) with a correlation of 0.31. The third plot shows
#|   a distinct relationship that is not obviously functional and has a 
#|   correlation of 0.5.
#| fig-asp: 0.25
#| out-width: 100%
simulated_scatter |>
  filter(group %in% 17:19) |>
  ggplot(aes(x = x, y = y)) +
  geom_point(size = 2, alpha = 0.8) +
  theme_void() +
  facet_wrap(~group, nrow = 1, scales = "free") +
  theme(
    panel.border = element_rect(colour = "gray", fill = NA, size = 1),
    strip.background = element_blank(),
    strip.text.x = element_blank()
  ) +
  stat_cor(
    aes(label = paste("r", ..r.., sep = "~`=`~")),
    geom = "label"
  )
```

::: {.guidedpractice data-latex=""}
No straight line is a good fit for any of the datasets represented in @fig-corForNonLinearPlots.
Try drawing nonlinear curves on each plot.
Once you create a curve for each, describe what is important in your fit.[^07-model-slr-4]
:::

[^07-model-slr-4]: We'll leave it to you to draw the lines.
    In general, the lines you draw should be close to most points and reflect overall trends in the data.

\clearpage

::: {.workedexample data-latex=""}
The plot below displays the relationships between various crop yields in countries.
In the plots, each point represents a different country.
The x and y variables represent the proportion of total yield in the last 50 years which is due to that crop type.
If a country did not produce a particular crop, it has been removed from the plot (so different plots may have different numbers of dots, each corresponding to one country).

Order the six scatterplots from strongest negative to strongest positive linear relationship.

```{r}
#| label: crop-yields-af-prep
# from tidytuesday: https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-01/key_crop_yields.csv

crops_country <- read_csv("data/key_crop_yields.csv")

crops_country <- crops_country |>
  rename_with(tolower) |>
  rename_with(str_remove, contains("tonnes"), " \\(tonnes per hectare\\)") |>
  rename(cocoa = `cocoa beans`) |>
  filter(!is.na(code)) |>
  pivot_longer(
    cols = wheat:bananas,
    names_to = "crop",
    values_to = "yield"
  ) |>
  group_by(code, crop) |>
  summarise(total = sum(yield, na.rm = TRUE), .groups = "drop_last") |>
  mutate(prop = (total / sum(total)) * 100) |>
  ungroup() |>
  filter(prop > 0) |> 
  pivot_wider(
    names_from = crop,
    values_from = c(total, prop),
    values_fill = NA
  )
```

```{r}
#| label: crop-yields-af
#| fig-alt: |
#|   From a real dataset, we display scatterplots of the relationship 
#|   between percent of crop which is each of the following -- bananas, potatoes, 
#|   cassava, soybeans, maize, cocoa, barley, peas, and wheat for each country. For 
#|   example, potatoes and bananas are negatively correlated and bananas and cocoa 
#|   do not seem correlated at all.
#| fig-asp: 0.8
#| out-width: 90%
#| fig-align: center
sb <- ggplot(crops_country) +
  geom_point(aes(x = prop_soybeans, y = prop_bananas), alpha = 0.7) +
  scale_x_continuous(
    limits = c(0, 6),
    labels = label_percent(scale = 1)
  ) +
  scale_y_continuous(labels = label_percent(scale = 1)) +
  labs(
    x = "% Soybeans",
    y = "% Bananas"
  )

sc <- ggplot(crops_country) +
  geom_point(aes(x = prop_soybeans, y = prop_cassava), alpha = 0.7) +
  scale_x_continuous(
    limits = c(0, 6),
    labels = label_percent(scale = 1)
  ) +
  scale_y_continuous(labels = label_percent(scale = 1)) +
  labs(
    x = "% Soybeans",
    y = "% Cassava"
  )

mc <- ggplot(crops_country) +
  geom_point(aes(x = prop_maize, y = prop_cassava), alpha = 0.7) +
  scale_x_continuous(
    limits = c(0, 15),
    labels = label_percent(scale = 1)
  ) +
  scale_y_continuous(labels = label_percent(scale = 1)) +
  labs(
    x = "% Maize",
    y = "% Cassava"
  )

peb <- ggplot(crops_country) +
  geom_point(aes(x = prop_potatoes, y = prop_bananas), alpha = 0.7) +
  scale_x_continuous(
    limits = c(0, 60),
    labels = label_percent(scale = 1)
  ) +
  scale_y_continuous(labels = label_percent(scale = 1)) +
  labs(
    x = "% Potatoes",
    y = "% Bananas"
  )

cb <- ggplot(crops_country) +
  geom_point(aes(x = prop_cocoa, y = prop_bananas), alpha = 0.7) +
  scale_x_continuous(labels = label_percent(scale = 1)) +
  scale_y_continuous(labels = label_percent(scale = 1)) +
  labs(
    x = "% Cocoa",
    y = "% Bananas"
  )

wb <- ggplot(crops_country) +
  geom_point(aes(x = prop_wheat, y = prop_barley), alpha = 0.7) +
  scale_x_continuous(labels = label_percent(scale = 1)) +
  scale_y_continuous(labels = label_percent(scale = 1)) +
  labs(
    x = "% Wheat",
    y = "% Barley"
  )

pob <- ggplot(crops_country) +
  geom_point(aes(x = prop_peas, y = prop_barley), alpha = 0.7) +
  scale_x_continuous(labels = label_percent(scale = 1)) +
  scale_y_continuous(labels = label_percent(scale = 1)) +
  labs(
    x = "% Peas",
    y = "% Barley"
  )

peb + sc + mc + cb + pob + wb +
  plot_layout(ncol = 2) +
  plot_annotation(tag_levels = "A")
```

------------------------------------------------------------------------

The order of most negative correlation to most positive correlation is:

\vspace{-2mm}

$$
A \rightarrow D \rightarrow B \rightarrow C \rightarrow E \rightarrow F
$$

-   Plot A - bananas vs. potatoes: `r round(cor(crops_country$prop_potatoes, crops_country$prop_bananas, use = "pairwise.complete.obs"), digits = 2)`
-   Plot B - cassava vs. soybeans: `r round(cor(crops_country$prop_soybeans, crops_country$prop_cassava, use = "pairwise.complete.obs"), digits = 2)`
-   Plot C - cassava vs. maize: `r round(cor(crops_country$prop_maize, crops_country$prop_cassava, use = "pairwise.complete.obs"), digits = 2)`
-   Plot D - cocoa vs. bananas: `r round(cor(crops_country$prop_cocoa, crops_country$prop_bananas, use = "pairwise.complete.obs"), digits = 2)`
-   Plot E - peas vs. barley: `r round(cor(crops_country$prop_peas, crops_country$prop_barley, use = "pairwise.complete.obs"), digits = 2)`
-   Plot F - wheat vs. barley: `r round(cor(crops_country$prop_wheat, crops_country$prop_barley, use = "pairwise.complete.obs"), digits = 2)`
:::

\vspace{-3mm}

One important aspect of the correlation is that it's *unitless*.
That is, unlike a measurement of the slope of a line (see the next section) which provides an increase in the y-coordinate for a one unit increase in the x-coordinate (in units of the x and y variable), there are no units associated with the correlation of x and y.
@fig-bdims-units shows the relationship between weights and heights of 507 physically active individuals.
In @fig-bdims-units-1, weight is measured in kilograms (kg) and height in centimeters (cm).
In @fig-bdims-units-2, weight has been converted to pounds (lbs) and height to inches (in).
The correlation coefficient ($r = 0.72$) is also noted on both plots.
We can see that the shape of the relationship has not changed, and neither has the correlation coefficient.
The only visual change to the plot is the axis *labeling* of the points.

\clearpage

```{r}
#| label: fig-bdims-units
#| fig-cap: |
#|   Two scatterplots, both displaying the relationship between weights and 
#|   heights of 507 physically healthy adults and the correlation coefficient, 
#|   $r = 0.72$.
#| fig-subcap:
#|   - The units are kilograms and centimeters.
#|   - The units are pounds and inches. 
#| fig-alt: |
#|   Two scatterplots, both displaying the relationship between weights and 
#|   heights of 507 physically healthy adults. In the first plot height is
#|   measured in cm and weight is measured in kg. In the second plot height 
#|   is measured in inches and weight is measured in pounds. The images 
#|   look identical, except for the axes tick marks.
#| fig-asp: 0.5
p_1 <- ggplot(bdims, aes(x = hgt, y = wgt)) +
  geom_point(alpha = 0.8) +
  labs(x = "Height (cm)", y = "Weight (kg)") +
  stat_cor(
    aes(label = paste("r", ..r.., sep = "~`=`~")),
    geom = "label"
  )

p_2 <- bdims |>
  mutate(
    hgt = hgt * 0.393701,
    wgt = wgt * 2.20462
  ) |>
  ggplot(aes(x = hgt, y = wgt)) +
  geom_point(alpha = 0.8) +
  labs(x = "Height (in)", y = "Weight (lbs)") +
  stat_cor(
    aes(label = paste("r", ..r.., sep = "~`=`~")),
    geom = "label"
  )

p_1
p_2
```

## Least squares regression {#sec-least-squares-regression}

Fitting linear models by eye is open to criticism since it is based on an individual's preference.
In this section, we use *least squares regression* as a more rigorous approach to fitting a line to a scatterplot.

### Gift aid for first-year at Elmhurst College

This section considers a dataset on family income and gift aid data from a random sample of fifty students in the first-year class of Elmhurst College in Illinois.
Gift aid is financial aid that does not need to be paid back, as opposed to a loan.
A scatterplot of these data is shown in @fig-elmhurstScatterWLine along with a linear fit.
The line follows a negative trend in the data; students who have higher family incomes tended to have lower gift aid from the university.

::: {.guidedpractice data-latex=""}
Is the correlation positive or negative in @fig-elmhurstScatterWLine?[^07-model-slr-5]
:::

[^07-model-slr-5]: Larger family incomes are associated with lower amounts of aid, so the correlation will be negative.
    Using a computer, the correlation can be computed: -0.499.

```{r}
#| label: fig-elmhurstScatterWLine
#| fig-cap: |
#|   Gift aid and family income for a random sample of 50 first-year 
#|   students from Elmhurst College.
#| fig-alt: |
#|   Scatterplot with family income on the x-axis and gift aid on the
#|   y-axis. The relationship is moderate negative and linear.
#| fig-asp: 0.6
ggplot(elmhurst, aes(x = family_income, y = gift_aid)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_y_continuous(labels = label_dollar(scale = 1, suffix = "K", accuracy = 1)) +
  scale_x_continuous(labels = label_dollar(scale = 1, suffix = "K", accuracy = 1)) +
  labs(
    x = "Family income",
    y = "Gift aid from university"
  )
```

\vspace{-5mm}

### An objective measure for finding the best line

We begin by thinking about what we mean by the "best" line.
Mathematically, we want a line that has small residuals.
But beyond the mathematical reasons, hopefully it also makes sense intuitively that whatever line we fit, the residuals should be small (i.e., the points should be close to the line).
The first option that may come to mind is to minimize the sum of the residual magnitudes:

$$
|e_1| + |e_2| + \dots + |e_n|
$$

which we could accomplish with a computer program.
The resulting dashed line shown in @fig-elmhurstScatterW2Lines demonstrates this fit can be quite reasonable.

```{r}
#| label: fig-elmhurstScatterW2Lines
#| fig-cap: |
#|   Gift aid and family income for a random sample of 50 first-year Elmhurst 
#|   College students. The dashed line is the line that minimizes 
#|   the sum of the absolute value of residuals, the solid line is the 
#|   line that minimizes the sum of squared residuals, i.e., the least squares line.
#| fig-alt: |
#|   Scatterplot with family income on the x-axis and gift aid on the
#|   y-axis. The relationship is moderate negative and linear. Two lines are
#|   superimposed on the scatterplot. One line is fit to the data by minimizing
#|   the sum of squared residuals, i.e., the least squares line. The other line
#|   is fit to the data by minimizing the sum of the abolute value of the residuals.
#| fig-asp: 0.6
ggplot(elmhurst, aes(x = family_income, y = gift_aid)) +
  geom_point(alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_smooth(method = quantreg::rq, formula = y ~ x, se = FALSE, linetype = "dashed") +
  scale_y_continuous(labels = label_dollar(scale = 1, suffix = "K", accuracy = 1)) +
  scale_x_continuous(labels = label_dollar(scale = 1, suffix = "K", accuracy = 1)) +
  labs(
    x = "Family income",
    y = "Gift aid from university"
  )
```

However, a more common practice is to choose the line that minimizes the sum of the squared residuals:

$$
e_{1}^2 + e_{2}^2 + \dots + e_{n}^2
$$

\clearpage

The line that minimizes this least squares criterion is represented as the solid line in @fig-elmhurstScatterW2Lines and is commonly called the **least squares line**\index{least squares line}.
The following are three possible reasons to choose the least squares option instead of trying to minimize the sum of residual magnitudes without any squaring:

```{r}
#| include: false
terms_chp_07 <- c(terms_chp_07, "least squares line")
```

1.  It is the most commonly used method.
2.  Computing the least squares line is widely supported in statistical software.
3.  In many applications, a residual twice as large as another residual is more than twice as bad. For example, being off by 4 is usually more than twice as bad as being off by 2. Squaring the residuals accounts for this discrepancy.
4.  The analyses which link the model to inference about a population are most straightforward when the line is fit through least squares.

The first two reasons are largely for tradition and convenience; the third and fourth reasons explain why the least squares criterion is typically most helpful when working with real data.[^07-model-slr-6]

[^07-model-slr-6]: There are applications where the sum of residual magnitudes may be more useful, and there are plenty of other criteria we might consider.
    However, this book only applies the least squares criterion.

### Finding and interpreting the least squares line

For the Elmhurst data, we could write the equation of the least squares regression line as

$$
\widehat{\texttt{aid}} = \beta_0 + \beta_{1}\times \texttt{family\_income}
$$

Here the equation is set up to predict gift aid based on a student's family income, which would be useful to students considering Elmhurst.
These two values, $\beta_0$ and $\beta_1,$ are the parameters of the regression line.

The parameters are estimated using the observed data.
In practice, this estimation is done using a computer in the same way that other estimates, like a sample mean, can be estimated using a computer or calculator.

The dataset where these data are stored is called `elmhurst`.
The first 5 rows of this dataset are given in @tbl-elmhurst-data.

```{r}
#| label: tbl-elmhurst-data
#| tbl-cap: First five rows of the `elmhurst` dataset.
#| tbl-pos: H
elmhurst |>
  slice_head(n = 5) |>
  kbl(linesep = "", booktabs = TRUE) |>
  kable_styling(
    bootstrap_options = c("striped", "condensed"),
    latex_options = c("striped"), full_width = FALSE
  ) |>
  column_spec(1:3, width = "8em")
```

We can see that family income is recorded in a variable called `family_income` and gift aid from university is recorded in a variable called `gift_aid`.
For now, we won't worry about the `price_paid` variable.
We should also note that these data are from the 2011-2012 academic year, and all monetary amounts are given in \$1,000s, i.e., the family income of the first student in the data shown in @tbl-elmhurst-data is \$92,920 and they received a gift aid of \$21,700.
(The data source states that all numbers have been rounded to the nearest whole dollar.)

Statistical software is usually used to compute the least squares line and the typical output generated as a result of fitting regression models looks like the one shown in @tbl-rOutputForIncomeAidLSRLine.
For now we will focus on the first column of the output, which lists ${b}_0$ and ${b}_1.$ In @sec-inf-model-slr we will dive deeper into the remaining columns which give us information on how accurate and precise these values of intercept and slope that are calculated from a sample of 50 students are in estimating the population parameters of intercept and slope for *all* students.

```{r}
m_ga_fi <- lm(gift_aid ~ family_income, data = elmhurst) |> tidy()
m_ga_fi_int <- m_ga_fi |>
  filter(term == "(Intercept)") |>
  pull(estimate)
m_ga_fi_slope <- m_ga_fi |>
  filter(term == "family_income") |>
  pull(estimate)
```

```{r}
#| label: tbl-rOutputForIncomeAidLSRLine
#| tbl-cap: Summary of least squares fit for the Elmhurst data.
#| tbl-pos: H
m_ga_fi |>
  mutate(p.value = ifelse(p.value < .0001, "<0.0001", round(p.value, 4))) |>
  kbl(
    linesep = "", booktabs = TRUE,
    digits = 2, align = "lrrrr"
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "condensed"),
    latex_options = c("striped")
  ) |>
  column_spec(1, width = "15em", monospace = TRUE) |>
  column_spec(2:5, width = "5em")
```

The model output tells us that the intercept is approximately `r m_ga_fi_int` and the slope on `family_income` is approximately `r m_ga_fi_slope`.

But what do these values mean?
Interpreting parameters in a regression model is often one of the most important steps in the analysis.

::: {.workedexample data-latex=""}
The intercept and slope estimates for the Elmhurst data are $b_0$ = `r m_ga_fi_int` and $b_1$ = `r m_ga_fi_slope`.
What do these numbers really mean?

------------------------------------------------------------------------

Interpreting the slope parameter is helpful in almost any application.
For each additional \$1,000 of family income, we would expect a student to receive a net difference of 1,000 $\times$ (-0.0431) = -\$43.10 in aid on average, i.e., \$43.10 *less*.
Note that a higher family income corresponds to less aid because the coefficient of family income is negative in the model.
We must be cautious in this interpretation: while there is a real association, we cannot interpret a causal connection between the variables because these data are observational.
That is, increasing a particular student's family income may not cause the student's aid to drop.
(Although it would be reasonable to contact the college and ask if the relationship is causal, i.e., if Elmhurst College's aid decisions are partially based on students' family income.)

The estimated intercept $b_0$ = `r m_ga_fi_int` describes the average aid if a student's family had no income, \$`r format(round(m_ga_fi_int*1000,0), scientific = FALSE, big.mark = ",")`.
The meaning of the intercept is relevant to this application since the family income for some students at Elmhurst is \$0.
In other applications, the intercept may have little or no practical value if there are no observations where $x$ is near zero.
:::

::: {.important data-latex=""}
**Interpreting parameters estimated by least squares.**

The slope describes the estimated difference in the predicted average outcome of $y$ if the predictor variable $x$ happened to be one unit larger.
The intercept describes the average outcome of $y$ if $x = 0$ *and* the linear model is valid all the way to $x = 0$ (values of $x = 0$ are not observed or relevant in many applications).
:::

If you would like to learn more about using R to fit linear models, see @sec-model-tutorials for the interactive R tutorials.
An alternative way of calculating the values of intercept and slope of a least squares line is manual calculations using formulas.
While manual calculations are not commonly used by practicing statisticians and data scientists, it is useful to work through the first time you're learning about the least squares line and modeling in general.
Calculating the values by hand leverages two properties of the least squares line:

1.  The slope of the least squares line can be estimated by

$$
b_1 = \frac{s_y}{s_x} r
$$

where $r$ is the correlation between the two variables, and $s_x$ and $s_y$ are the sample standard deviations of the predictor and outcome, respectively.

2.  If $\bar{x}$ is the sample mean of the predictor variable and $\bar{y}$ is the sample mean of the outcome variable, then the point $(\bar{x}, \bar{y})$ falls on the least squares line.

@tbl-summaryStatsElmhurstRegr shows the sample means for the family income and gift aid as \$101,780 and \$19,940, respectively.
We could plot the point $(102, 19.9)$ on @fig-elmhurstScatterWLine to verify it falls on the least squares line (the solid line).

```{r}
#| label: tbl-summaryStatsElmhurstRegr
#| tbl-cap: Summary statistics for family income and gift aid.
#| tbl-pos: H
elmhurst |>
  select(family_income, gift_aid) |>
  summarise(
    fi_m = mean(family_income),
    fi_s = sd(family_income),
    ga_m = mean(gift_aid),
    ga_s = sd(gift_aid),
    r    = cor(family_income, gift_aid)
  ) |>
  kbl(
    linesep = "", booktabs = TRUE,
    col.names = c("mean", "sd", "mean", "sd", "r"),
    align = "ccccc"
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "condensed"),
    latex_options = c("striped"), full_width = FALSE
  ) |>
  add_header_above(c("Family income, x" = 2, "Gift aid, y" = 2, " " = 1)) |>
  column_spec(1:5, width = "6em")
```

Next, we find the point estimates $b_0$ and $b_1$ of the parameters $\beta_0$ and $\beta_1.$

::: {.workedexample data-latex=""}
Using the summary statistics in @tbl-summaryStatsElmhurstRegr, compute the slope for the regression line of gift aid against family income.

------------------------------------------------------------------------

Compute the slope using the summary statistics from @tbl-summaryStatsElmhurstRegr:

$$
b_1 = \frac{s_y}{s_x} r = \frac{5.46}{63.2}(-0.499) = -0.0431
$$
:::

You might recall the form of a line from math class, which we can use to find the model fit, including the estimate of $b_0.$ Given the slope of a line and a point on the line, $(x_0, y_0),$ the equation for the line can be written as

$$
y - y_0 = slope\times (x - x_0)
$$

::: {.important data-latex=""}
**Identifying the least squares line from summary statistics.**

To identify the least squares line from summary statistics:

-   Estimate the slope parameter, $b_1 = (s_y / s_x) r.$
-   Note that the point $(\bar{x}, \bar{y})$ is on the least squares line, use $x_0 = \bar{x}$ and $y_0 = \bar{y}$ with the point-slope equation: $y - \bar{y} = b_1 (x - \bar{x}).$
-   Simplify the equation, we get $y = \bar{y} - b_1 \bar{x} + b_1 x,$ which reveals that $b_0 = \bar{y} - b_1 \bar{x}.$
:::

::: {.workedexample data-latex=""}
Using the point (102, 19.9) from the sample means and the slope estimate $b_1 = -0.0431,$ find the least-squares line for predicting aid based on family income.

------------------------------------------------------------------------

Apply the point-slope equation using $(102, 19.9)$ and the slope $b_1 = -0.0431$:

\vspace{-5mm}

$$
\begin{aligned}
y - y_0  &= b_1 (x - x_0) \\
y - 19.9 &= -0.0431 (x - 102)
\end{aligned}
$$

Expanding the right side and then adding 19.9 to each side, the equation simplifies:

\vspace{-5mm}

$$
\widehat{\texttt{aid}} = 24.3 - 0.0431 \times \texttt{family\_income}
$$

Here we have replaced $y$ with $\widehat{\texttt{aid}}$ and $x$ with $\texttt{family\_income}$ to put the equation in context.
The final least squares equation should always include a "hat" on the variable being predicted, whether it is a generic $``y"$ or a named variable like $``aid"$.
:::

\clearpage

::: {.workedexample data-latex=""}
Suppose a high school senior is considering Elmhurst College.
Can they simply use the linear equation that we have estimated to calculate her financial aid from the university?

------------------------------------------------------------------------

She may use it as an estimate, though some qualifiers on this approach are important.
First, all data come from one first-year class, and the way aid is determined by the university may change from year to year.
Second, the equation will provide an imperfect estimate.
While the linear equation is good at modeling the trend in the data, no individual student's aid will be perfectly predicted (as can be seen from the individual data points around the line).
:::

### Extrapolation is treacherous

Linear models can be used to approximate the relationship between two variables.
However, like any model, they have real limitations.
Linear regression is simply a modeling framework.
The truth is almost always much more complex than a simple line.
For example, we do not know how the data outside of our limited window will behave.

::: {.workedexample data-latex=""}
Use the model $\widehat{\texttt{aid}} = 24.3 - 0.0431 \times \texttt{family\_income}$ to estimate the aid of another first-year student whose family had income of \$1 million.

------------------------------------------------------------------------

We want to calculate the aid for a family with \$1 million income.
Note that in our model this will be represented as 1,000 since the data are in \$1,000s.

\vspace{-5mm}

$$
24.3 - 0.0431 \times 1000 = -18.8
$$

The model predicts this student will have -\$18,800 in aid (!).
However, Elmhurst College does not offer *negative aid* where they select some students to pay extra on top of tuition to attend.
:::

\vspace{-5mm}

Applying a model estimate to values outside of the realm of the original data is called **extrapolation**\index{extrapolation}.
Generally, a linear model is only an approximation of the real relationship between two variables.
If we extrapolate, we are making an unreliable bet that the approximate linear relationship will be valid in places where it has not been analyzed.

```{r}
#| include: false
terms_chp_07 <- c(terms_chp_07, "extrapolation")
```

### Describing the strength of a fit {#sec-r-squared}

We evaluated the strength of the linear relationship between two variables earlier using the correlation, $r.$ However, it is more common to explain the strength of a linear fit using $R^2,$ called **R-squared**\index{R-squared}.
If provided with a linear model, we might like to describe how closely the data cluster around the linear fit.

```{r}
#| include: false
terms_chp_07 <- c(terms_chp_07, "R-squared")
```

The $R^2$ of a linear model describes the amount of variation in the outcome variable that is explained by the least squares line.
For example, consider the Elmhurst data, shown in @fig-elmhurstScatterWLine).
The variance of the outcome variable, aid received, is about $s_{aid}^2 \approx 29.8$ million (calculated from the data, some of which is shown in @tbl-elmhurst-data).
However, if we apply our least squares line, then this model reduces our uncertainty in predicting aid using a student's family income.
The variability in the residuals describes how much variation remains after using the model: $s_{_{RES}}^2 \approx 22.4$ million.
In short, there was a reduction of

$$
\frac{s_{aid}^2 - s_{_{RES}}^2}{s_{aid}^2}
  = \frac{29800 - 22400}{29800}
  = \frac{7500}{29800}
  \approx 0.25,
$$

or about 25%, of the outcome variable's variation by using information about family income for predicting aid using a linear model.
It turns out that $R^2$ corresponds exactly to the squared value of the correlation:

$$
r = -0.499 \rightarrow R^2 = 0.25
$$

\clearpage

::: {.guidedpractice data-latex=""}
If a linear model has a very strong negative relationship with a correlation of -0.97, how much of the variation in the outcome is explained by the predictor?[^07-model-slr-8]
:::

[^07-model-slr-8]: About $R^2 = (-0.97)^2 = 0.94$ or 94% of the variation in the outcome variable is explained by the linear model.

$R^2$ is also called the **coefficient of determination**\index{coefficient of determination}.

```{r}
#| include: false
terms_chp_07 <- c(terms_chp_07, "coefficient of determination")
```

::: {.important data-latex=""}
**Coefficient of determination: proportion of variability in the outcome variable explained by the model.**

Since $r$ is always between -1 and 1, $R^2$ will always be between 0 and 1.
This statistic is called the **coefficient of determination**, and it measures the proportion of variation in the outcome variable, $y,$ that can be explained by the linear model with predictor $x.$
:::

More generally, $R^2$ can be calculated as a ratio of a measure of variability around the line divided by a measure of total variability.

::: {.important data-latex=""}
**Sums of squares to measure variability in** $y.$

We can measure the variability in the $y$ values by how far they tend to fall from their mean, $\bar{y}.$ We define this value as the **total sum of squares**\index{total sum of squares}, calculated using the formula below, where $y_i$ represents each $y$ value in the sample, and $\bar{y}$ represents the mean of the $y$ values in the sample.

\vspace{-2mm}

$$
SST = (y_1 - \bar{y})^2 + (y_2 - \bar{y})^2 + \cdots + (y_n - \bar{y})^2.
$$

Left-over variability in the $y$ values if we know $x$ can be measured by the **sum of squared errors**\index{sum of squared errors}, or sum of squared residuals, calculated using the formula below, where $\hat{y}_i$ represents the predicted value of $y_i$ based on the least squares regression.[^07-model-slr-9],

\vspace{-2mm}

$$
\begin{aligned}
SSE &= (y_1 - \hat{y}_1)^2 + (y_2 - \hat{y}_2)^2 + \cdots + (y_n - \hat{y}_n)^2\\
&= e_{1}^2 + e_{2}^2 + \dots + e_{n}^2
\end{aligned}
$$

The coefficient of determination can then be calculated as

$$
R^2 = \frac{SST - SSE}{SST} = 1 - \frac{SSE}{SST}
$$
:::

[^07-model-slr-9]: The difference $SST - SSE$ is called the **regression sum of squares**\index{regression sum of squares}, $SSR,$ and can also be calculated as $SSR = (\hat{y}_1 - \bar{y})^2 + (\hat{y}_2 - \bar{y})^2 + \cdots + (\hat{y}_n - \bar{y})^2.$ $SSR$ represents the variation in $y$ that was accounted for in our model.

```{r}
#| include: false
terms_chp_07 <- c(terms_chp_07, "total sum of squares", "sum of squared error", "regression sum of squares")
```

::: {.workedexample data-latex=""}
Among 50 students in the `elmhurst` dataset, the total variability in gift aid is $SST = 1461$.[^07-model-slr-10]
The sum of squared residuals is $SSE = 1098.$ Find $R^2.$

------------------------------------------------------------------------

Since we know $SSE$ and $SST,$ we can calculate $R^2$ as

$$
R^2 = 1 - \frac{SSE}{SST} = 1 - \frac{1098}{1461} = 0.25,
$$

the same value we found when we squared the correlation: $R^2 = (-0.499)^2 = 0.25.$
:::

[^07-model-slr-10]: $SST$ can be calculated by finding the sample variance of the outcome variable, $s^2$ and multiplying by $n-1.$

\clearpage

### Categorical predictors with two levels {#sec-categorical-predictor-two-levels}

Categorical variables are also useful in predicting outcomes.
Here we consider a categorical predictor with two levels (recall that a *level* is the same as a *category*).
We'll consider Ebay auctions for a video game, *Mario Kart* for the Nintendo Wii, where both the total price of the auction and the condition of the game were recorded.
Here we want to predict total price based on game condition, which takes values `used` and `new`.

::: {.data data-latex=""}
The [`mariokart`](http://openintrostat.github.io/openintro/reference/mariokart.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
:::

```{r}
mariokart_lt100 <- mariokart |>
  filter(total_pr < 100) |>
  mutate(cond = fct_relevel(cond, "used", "new"))
```

A plot of the auction data is shown in @fig-marioKartNewUsed.
Note that the original dataset contains some Mario Kart games being sold at prices above \$100 but for this analysis we have limited our focus to the `r nrow(mariokart_lt100)` Mario Kart games that were sold below \$100.

```{r}
#| label: fig-marioKartNewUsed
#| fig-cap: |
#|   Total auction prices for the video game Mario Kart, divided into used 
#|   ($x = 0$) and new ($x = 1$) condition games. The least squares regression 
#|   line is also shown.
#| fig-alt: |
#|   Scatterplot of Mario Kart data with condion (used or new) on the 
#|   x-axis and total price on the y-axis. Despite the x-axis being categorical, a 
#|   least squares line is fit to the data with the value of used set to 0 and the 
#|   value of new set to 1.
mariokart_lt100 |>
  filter(total_pr < 100) |>
  ggplot(aes(x = as.numeric(cond) - 1, y = total_pr)) +
  geom_jitter(alpha = 0.8, position = position_jitter(width = 0.05)) +
  geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
  scale_x_continuous(
    breaks = c(0, 1),
    minor_breaks = c(0, 1),
    limits = c(-0.2, 1.2),
    labels = c("0\n(used)", "1\n(new)")
  ) +
  labs(
    x = "Condition",
    y = "Total price"
  )
```

To incorporate the game condition variable into a regression equation, we must convert the categories into a numerical form.
We will do so using an **indicator variable**\index{variable!indicator}\index{indicator variable} called `condnew`, which takes value 1 when the game is new and 0 when the game is used.
Using this indicator variable, the linear model may be written as

$$
\widehat{\texttt{price}} = b_0 + b_1 \times \texttt{condnew}
$$

The parameter estimates are given in @tbl-marioKartNewUsedRegrSummary.

```{r}
#| include: false
terms_chp_07 <- c(terms_chp_07, "indicator variable")
```

```{r}
#| label: tbl-marioKartNewUsedRegrSummary
#| tbl-cap: |
#|   Least squares regression summary for the final auction price against the
#|   condition of the game.
#| tbl-pos: H
m_total_pr_cond <- lm(total_pr ~ cond, data = mariokart_lt100)
m_total_pr_cond_tidy <- tidy(m_total_pr_cond)

m_total_pr_cond_int <- m_total_pr_cond_tidy |>
  filter(term == "(Intercept)") |>
  pull(estimate) |>
  round(2)
m_total_pr_cond_slope <- m_total_pr_cond_tidy |>
  filter(term == "condnew") |>
  pull(estimate) |>
  round(2)

m_total_pr_cond_tidy |>
  mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) |>
  kbl(
    linesep = "", booktabs = TRUE,
    digits = 2, align = "lrrrr"
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "condensed"),
    latex_options = c("striped")
  ) |>
  column_spec(1, width = "15em", monospace = TRUE) |>
  column_spec(2:5, width = "5em")
```

Using values from @tbl-marioKartNewUsedRegrSummary, the model equation can be summarized as

$$
\widehat{\texttt{price}} = `r m_total_pr_cond_int` + `r m_total_pr_cond_slope` \times \texttt{condnew}
$$

::: {.workedexample data-latex=""}
Interpret the two parameters estimated in the model for the price of Mario Kart in eBay auctions.

------------------------------------------------------------------------

The intercept is the estimated price when `condnew` has a value 0, i.e., when the game is in used condition.
That is, the average selling price of a used version of the game is \$42.9.
The slope indicates that, on average, new games sell for about \$10.9 more than used games.
:::

::: {.important data-latex=""}
**Interpreting model estimates for categorical predictors.**

The estimated intercept is the value of the outcome variable for the first category (i.e., the category corresponding to an indicator value of 0).
The estimated slope is the average change in the outcome variable between the two categories.
:::

Note that, fundamentally, the intercept and slope interpretations do not change when modeling categorical variables with two levels.
However, when the predictor variable is binary, the coefficient estimates ($b_0$ and $b_1$) are directly interpretable with respect to the dataset at hand.

We'll elaborate further on modeling categorical predictors in @sec-model-mlr, where we examine the influence of many predictor variables simultaneously using multiple regression.

## Outliers in linear regression {#sec-outliers-in-regression}

In this section, we discuss when outliers are important and influential.
Outliers in a regression model with one predictor and one outcome are observations that fall far from the cloud of points.
These points are especially important because they can have a strong influence on the least squares line.
Note that there are times when observations are outlying in the $x$ direction, the $y$ direction, or both.
However, being outlying in a univariate sense (either $x$ or $y$ or both) is not outlying from the bivariate model.
If the points are in-line with the bivariate model, they will not influence the least squares regression line (even if the observations are outlying in the $x$ or $y$ or both directions!).

::: {.workedexample data-latex=""}
There are three plots shown in @fig-outlier-plots-1 along with the corresponding least squares line and residual plots.
For each scatterplot and residual plot pair, identify the outliers and note how they influence the least squares line.
Recall that an outlier is any point that does not appear to belong with the vast majority of the other points.

------------------------------------------------------------------------

A: There is one outlier far from the other points (in the $y$ direction and it is an outlier of the bivariate model), though it only appears to slightly influence the line.

B: There is one outlier on the right (in the $x$ and $y$ direction although it is not an outlier of the bivariate model), though it is quite close to the least squares line, which suggests it wasn't very influential.

C: There is one point far away from the cloud (in the $x$ and $y$ direction and an outlier of the bivariate model), and this outlier appears to pull the least squares line up on the right; examine how the line around the primary cloud does not appear to fit very well.
:::

::: {.workedexample data-latex=""}
There are three plots shown in @fig-outlier-plots-2 along with the least squares line and residual plots.
As you did in previous exercise, for each scatterplot and residual plot pair, identify the outliers and note how they influence the least squares line.
Recall that an outlier is any point that does not appear to belong with the vast majority of the other points. A point can be outlying in the $x$ direction, in the $y$ direction, or in relation to the bivariate model.

------------------------------------------------------------------------

D: There is a primary cloud and then a small secondary cloud of four outliers (with respect to both $x$ and the bivariate model).
The secondary cloud appears to be influencing the line somewhat strongly, making the least square line fit poorly almost everywhere.
There might be an interesting explanation for the dual clouds, which is something that could be investigated.

E: There is no obvious trend in the main cloud of points and the outlier on the right (with respect to both $x$ and $y$) appears to largely (and problematically) control the slope of the least squares line.
The point creates a bivariate model when seemingly there is none. 

F: There is one outlier far from the cloud (with respect to both $x$ and $y$).
However, it falls quite close to the least squares line and does not appear to be very influential (it is not outlying with respect to the bivariate model).
:::

```{r}
#| label: fig-outlier-plots
#| fig-cap: |
#|   Plots of six datasets, each with a least squares line and 
#|   corresponding residual plot. Each dataset has at least one outlier.
#| fig-subcap: 
#|   - Plots A, B, and C
#|   - Plots D, E, and F
#| fig-alt: |
#|   A grid of 2 by 3 scatterplots with fabricated data, all of which 
#|   contain at least one outlying observation. The top row of plots contains
#|   original x-y data plots with a least squares regression line. The bottom row 
#|   of plots is a series of residual plot with predicted value on the x-axis and 
#|   residual on the y-axis. The first column of plots gives an example of 
#|   an outlier in the bivariate model which does not impact the least squares
#|   regression line. The second column gives an example of a point which is
#|   outlying in both the x and y directions but it is not a model outlier.
#|   The third column gives an example of a point that pulls the regression
#|   line toward it, impacting the least squares line. The fourth column of 
#|   plots gives an example of a small cloud of points that drags the 
#|   regression line toward it. The fifth column gives an example of a point 
#|   which is outlying enough in both the x and y direction that it changes 
#|   a model which has no linearity into what seems like a strong positive
#|   relationship. The sixth column gives an example of a point that is far 
#|   in both the x and y directions but is in-line with the regression model 
#|   and has very little impact on the least squares regression line.
#| out-width: 100%
#| fig-asp: 0.4
# abc
d1 <- simulated_scatter |>
  filter(group == 24) |>
  mutate(outlier = if_else(y == min(y), TRUE, FALSE))
d2 <- simulated_scatter |>
  filter(group == 25) |>
  mutate(outlier = if_else(y == min(y), TRUE, FALSE))
d3 <- simulated_scatter |>
  filter(group == 26) |>
  mutate(outlier = if_else(y == max(y), TRUE, FALSE))

m1_aug <- augment(lm(y ~ x, data = d1)) |>
  mutate(outlier = if_else(y == min(y), TRUE, FALSE))

m2_aug <- augment(lm(y ~ x, data = d2)) |>
  mutate(outlier = if_else(y == min(y), TRUE, FALSE))

m3_aug <- augment(lm(y ~ x, data = d3)) |>
  mutate(outlier = if_else(y == max(y), TRUE, FALSE))

p_1 <- ggplot(d1, aes(x = x, y = y)) +
  geom_point(size = 2, alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_point(
    data = d1 |> filter(outlier),
    size = 5, shape = "circle open",
    color = IMSCOL["red", "full"], stroke = 2
  ) +
  labs(title = "A") +
  theme(
    panel.grid = element_blank(),
    axis.text = element_blank(),
    panel.border = element_rect(colour = "gray", fill = NA, size = 1)
  ) +
  scale_x_continuous(expand = expansion(mult = 0.12)) +
  scale_y_continuous(expand = expansion(mult = 0.12))


p_1_res <- ggplot(m1_aug, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.7, size = 2) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_point(
    data = m1_aug |> filter(outlier),
    size = 5, shape = "circle open",
    color = IMSCOL["red", "full"], stroke = 2
  ) +
  labs(x = "Predicted y", y = "Residual") +
  theme(
    panel.grid = element_blank(),
    axis.text = element_blank(),
    panel.border = element_rect(colour = "gray", fill = NA, size = 1)
  ) +
  scale_x_continuous(expand = expansion(mult = 0.12)) +
  scale_y_continuous(limits = c(-8, 8), expand = expansion(mult = 0.12))

p_2 <- ggplot(d2, aes(x = x, y = y)) +
  geom_point(size = 2, alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_point(
    data = d2 |> filter(outlier),
    size = 5, shape = "circle open",
    color = IMSCOL["green", "full"], stroke = 2
  ) +
  labs(title = "B") +
  theme(
    panel.grid = element_blank(),
    axis.text = element_blank(),
    panel.border = element_rect(colour = "gray", fill = NA, size = 1)
  ) +
  scale_x_continuous(expand = expansion(mult = 0.12)) +
  scale_y_continuous(expand = expansion(mult = 0.12))

p_2_res <- ggplot(m2_aug, aes(x = .fitted, y = .resid)) +
  geom_point(size = 2, alpha = 0.8) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_point(
    data = m2_aug |> filter(outlier),
    size = 5, shape = "circle open",
    color = IMSCOL["green", "full"], stroke = 2
  ) +
  labs(x = "Predicted y", y = "Residual") +
  theme(
    panel.grid = element_blank(),
    axis.text = element_blank(),
    panel.border = element_rect(colour = "gray", fill = NA, size = 1)
  ) +
  scale_x_continuous(expand = expansion(mult = 0.12)) +
  scale_y_continuous(limits = c(-8, 8), expand = expansion(mult = 0.12))

p_3 <- ggplot(d3, aes(x = x, y = y)) +
  geom_point(size = 2, alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_point(
    data = d3 |> filter(outlier),
    size = 5, shape = "circle open",
    color = IMSCOL["pink", "full"], stroke = 2
  ) +
  labs(title = "C") +
  theme(
    panel.grid = element_blank(),
    axis.text = element_blank(),
    panel.border = element_rect(colour = "gray", fill = NA, size = 1)
  ) +
  scale_x_continuous(expand = expansion(mult = 0.12)) +
  scale_y_continuous(expand = expansion(mult = 0.12))

p_3_res <- ggplot(m3_aug, aes(x = .fitted, y = .resid)) +
  geom_point(size = 2, alpha = 0.8) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_point(
    data = m3_aug |> filter(outlier),
    size = 5, shape = "circle open",
    color = IMSCOL["pink", "full"], stroke = 2
  ) +
  labs(x = "Predicted y", y = "Residual") +
  theme(
    panel.grid = element_blank(),
    axis.text = element_blank(),
    panel.border = element_rect(colour = "gray", fill = NA, size = 1)
  ) +
  scale_x_continuous(expand = expansion(mult = 0.12)) +
  scale_y_continuous(limits = c(-8, 8), expand = expansion(mult = 0.12))

p_1 + theme(plot.margin = unit(c(0, 10, 5, 0), "pt")) +
  p_2 + theme(plot.margin = unit(c(0, 10, 5, 0), "pt")) + p_3 +
  p_1_res + theme(plot.margin = unit(c(0, 10, 5, 0), "pt")) +
  p_2_res + theme(plot.margin = unit(c(0, 10, 5, 0), "pt")) + p_3_res +
  plot_layout(ncol = 3, heights = c(2, 1))

# def
d4 <- simulated_scatter |>
  filter(group == 27)

d5 <- simulated_scatter |> filter(group == 28)
d6 <- simulated_scatter |> filter(group == 29)

m4_aug <- augment(lm(y ~ x, data = d4))
m5_aug <- augment(lm(y ~ x, data = d5))
m6_aug <- augment(lm(y ~ x, data = d6))

p_4 <- ggplot(d4, aes(x = x, y = y)) +
  geom_point(size = 2, alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  theme(
        panel.grid = element_blank(),
        axis.text = element_blank(),
        panel.border = element_rect(colour = "gray", fill = NA, size = 1)
    ) +
  labs(title = "D")

p_4_res <- ggplot(m4_aug, aes(x = .fitted, y = .resid)) +
  geom_point(size = 2, alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "Predicted y", y = "Residual") +
  theme(
        panel.grid = element_blank(),
        axis.text = element_blank(),
        panel.border = element_rect(colour = "gray", fill = NA, size = 1)
    ) +
  ylim(-4, 4)

p_5 <- ggplot(d5, aes(x = x, y = y)) +
  geom_point(size = 2, alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  theme(
        panel.grid = element_blank(),
        axis.text = element_blank(),
        panel.border = element_rect(colour = "gray", fill = NA, size = 1)
    ) +
  labs(title = "E")

p_5_res <- ggplot(m5_aug, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.7, size = 2) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "Predicted y", y = "Residual") +
  theme(
        panel.grid = element_blank(),
        axis.text = element_blank(),
        panel.border = element_rect(colour = "gray", fill = NA, size = 1)
    ) +
  ylim(-4, 4)

p_6 <- ggplot(d6, aes(x = x, y = y)) +
  geom_point(size = 2, alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  theme(
        panel.grid = element_blank(),
        axis.text = element_blank(),
        panel.border = element_rect(colour = "gray", fill = NA, size = 1)
    ) +
  labs(title = "F")

p_6_res <- ggplot(m6_aug, aes(x = .fitted, y = .resid)) +
  geom_point(size = 2, alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "Predicted y", y = "Residual") +
  theme(
        panel.grid = element_blank(),
        axis.text = element_blank(),
        panel.border = element_rect(colour = "gray", fill = NA, size = 1)
    ) +
  ylim(-4, 4)

p_4 + theme(plot.margin = unit(c(0, 10, 5, 0), "pt")) +
  p_5 + theme(plot.margin = unit(c(0, 10, 5, 0), "pt")) + p_6 +
  p_4_res + theme(plot.margin = unit(c(0, 10, 5, 0), "pt")) +
  p_5_res + theme(plot.margin = unit(c(0, 10, 5, 0), "pt")) + p_6_res +
  plot_layout(ncol = 3, heights = c(2, 1))
```

\clearpage

Examine the residual plots in @fig-outlier-plots-1 and @fig-outlier-plots-2.
In Plots C, D, and E, you will probably find that there are a few observations which are both away from the remaining points along the x-axis and not in the trajectory of the trend in the rest of the data.
In these cases, the outliers influenced the slope of the least squares lines.
In Plot E, the bulk of the data show no clear trend, but if we fit a line to these data, we impose a trend where there isn't really one.

A good practice for dealing with outlying observations is to produce two analyses: one with and one without the outlying observations. Presenting both analyses to a client and discussing the role of the outlying observations should lead you to a more holistic understanding of the appropriate model for the data.

::: {.important data-latex=""}
**Leverage.**

Points that fall horizontally away from the center of the cloud tend to pull harder on the line, so we call them points with **high leverage**\index{high leverage} or **leverage points**\index{leverage point}.
:::

Points that fall horizontally far from the line are points of high leverage; these points can strongly influence the slope of the least squares line.
If one of these high leverage points does appear to actually invoke its influence on the slope of the line -- as in Plots C, D, and E of @fig-outlier-plots-1 and @fig-outlier-plots-2 -- then we call it an **influential point**\index{influential point}.
Usually we can say a point is influential if, had we fitted the line without it, the influential point would have been unusually far from the least squares line.

```{r}
#| include: false
terms_chp_07 <- c(terms_chp_07, "high leverage", "influential point", "outlier", "leverage point")
```

::: {.important data-latex=""}
**Types of outliers.**\index{outlier}

A point (or a group of points) that stands out from the rest of the data is called an outlier.
Outliers that fall horizontally away from the center of the cloud of points are called leverage points.
Outliers that influence on the slope of the line are called influential points.
:::

It is tempting to remove outliers.
Don't do this without a very good reason.
Models that ignore exceptional (and interesting) cases often perform poorly.
For instance, if a financial firm ignored the largest market swings -- the "outliers" -- they would soon go bankrupt by making poorly thought-out investments.

\clearpage

## Chapter review {#sec-chp7-review}

### Summary

Throughout this chapter, the nuances of the linear model have been described.
You have learned how to create a linear model with explanatory variables that are numerical (e.g., total possum length) and those that are categorical (e.g., whether a video game was new).
The residuals in a linear model are an important metric used to understand how well a model fits; high leverage points, influential points, and other types of outliers can impact the fit of a model.
Correlation is a measure of the strength and direction of the linear relationship of two variables, without specifying which variable is the explanatory and which is the outcome.
Future chapters will focus on generalizing the linear model from the sample of data to claims about the population of interest.

### Terms

The terms introduced in this chapter are presented in @tbl-terms-chp-07.
If you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.
You should be able to easily spot them as **bolded text**.

```{r}
#| label: tbl-terms-chp-07
#| tbl-cap: Terms introduced in this chapter.
#| tbl-pos: H
make_terms_table(terms_chp_07)
```

\clearpage

## Exercises {#sec-chp7-exercises}

Answers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-07].

::: {.exercises data-latex=""}
{{< include exercises/_07-ex-model-slr.qmd >}}
:::