cienciadedatos · beatrizmilz · Nov 20, 2023 · Nov 25, 2023 · Nov 30, 2023 · Jan 19, 2024
diff --git a/.github/workflows/build_book.yaml b/.github/workflows/build_book.yaml
@@ -23,8 +23,8 @@ jobs:
     steps:
       - uses: actions/checkout@v2
 
-      - name: Install Quarto
-        uses: quarto-dev/quarto-actions/install-quarto@v1
+      - name: Set up Quarto
+        uses: quarto-dev/quarto-actions/setup@v2
         with:
           # To install LaTeX to build PDF book
           tinytex: true

diff --git a/EDA.qmd b/EDA.qmd
@@ -73,7 +73,7 @@ You can see variation easily in real life; if you measure any continuous variabl
 This is true even if you measure quantities that are constant, like the speed of light.
 Each of your measurements will include a small amount of error that varies from measurement to measurement.
 Variables can also vary if you measure across different subjects (e.g., the eye colors of different people) or at different times (e.g., the energy levels of an electron at different moments).
-Every variable has its own pattern of variation, which can reveal interesting information about how that it varies between measurements on the same observation as well as across observations.
+Every variable has its own pattern of variation, which can reveal interesting information about how it varies between measurements on the same observation as well as across observations.
 The best way to understand that pattern is to visualize the distribution of the variable's values, which you've learned about in @sec-data-visualization.
 
 We'll start our exploration by visualizing the distribution of weights (`carat`) of \~54,000 diamonds from the `diamonds` dataset.
@@ -597,7 +597,7 @@ ggplot(smaller, aes(x = carat, y = price)) +
 ```
 
 `cut_width(x, width)`, as used above, divides `x` into bins of width `width`.
-By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it's difficult to tell that each boxplot summaries a different number of points.
+By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it's difficult to tell that each boxplot summarizes a different number of points.
 One way to show that is to make the width of the boxplot proportional to the number of points with `varwidth = TRUE`.
 
 #### Exercises

diff --git a/communication.qmd b/communication.qmd
@@ -185,9 +185,10 @@ This useful package will automatically adjust labels so that they don't overlap:
 
 ```{r}
 #| fig-alt: |
-#|   Scatterplot of highway fuel efficiency versus engine size of cars, where 
-#|   points are colored according to the car class. Some points are labelled 
-#|   with the car's name. The labels are box with white, transparent background 
+#|   Scatterplot of highway mileage versus engine size where points are colored 
+#|   by drive type. Smooth curves for each drive type are overlaid. 
+#|   Text labels identify the curves as front-wheel, rear-wheel, and 4-wheel.
+#|   The labels are box with white background 
 #|   and positioned to not overlap.
 
 ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
@@ -364,7 +365,7 @@ ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
 You can use `labels` in the same way (a character vector the same length as `breaks`), but you can also set it to `NULL` to suppress the labels altogether.
 This can be useful for maps, or for publishing plots where you can't share the absolute numbers.
 You can also use `breaks` and `labels` to control the appearance of legends.
-For discrete scales for categorical variables, `labels` can be a named list of the existing levels names and the desired labels for them.
+For discrete scales for categorical variables, `labels` can be a named list of the existing level names and the desired labels for them.
 
 ```{r}
 #| fig-alt: |
@@ -390,7 +391,7 @@ Note that `breaks` is in the original scale of the data.
 #| fig-alt: |
 #|   Two side-by-side box plots of price versus cut of diamonds. The outliers 
 #|   are transparent. On both plots the x-axis labels are formatted as dollars.
-#|   The x-axis labels on the plot start at $0 and go to $15,000, increasing 
+#|   The x-axis labels on the left plot start at $0 and go to $15,000, increasing 
 #|   by $5,000. The x-axis labels on the right plot start at $1K and go to 
 #|   $19K, increasing by $6K. 
 
@@ -461,7 +462,7 @@ The theme setting `legend.position` controls where the legend is drawn:
 #| fig-alt: |
 #|   Four scatterplots of highway fuel efficiency versus engine size of cars 
 #|   where points are colored based on class of car. Clockwise, the legend 
-#|   is placed on the right, left, top, and bottom of the plot.
+#|   is placed on the right, left, bottom, and top of the plot.
 
 base <- ggplot(mpg, aes(x = displ, y = hwy)) +
   geom_point(aes(color = class))
@@ -575,7 +576,7 @@ This will also help ensure your plot is interpretable in black and white.
 
 ```{r}
 #| fig-alt: |
-#|   Two scatterplots of highway mileage versus engine size where both color 
+#|   Scatterplot of highway mileage versus engine size where both color 
 #|   and shape of points are based on drive type. The color palette is not 
 #|   the default ggplot2 palette.
 
@@ -686,8 +687,9 @@ Subsetting the data has affected the x and y scales as well as the smooth curve.
 #| fig-width: 4
 #| message: false
 #| fig-alt: |
-#|   On the left, scatterplot of highway mileage vs. displacement, with 
-#|   displacement. The smooth curve overlaid shows a decreasing, and then 
+#|   On the left, scatterplot of highway mileage vs. displacement 
+#|   where points are colored by drive type. 
+#|   The smooth curve overlaid shows a decreasing, and then 
 #|   increasing trend, like a hockey stick. On the right, same variables 
 #|   are plotted with displacement ranging only from 5 to 6 and highway 
 #|   mileage ranging only from 10 to 25. The smooth curve overlaid shows a 
@@ -969,10 +971,9 @@ In the following, `|` places the `p1` and `p3` next to each other and `/` moves
 #| fig-alt: |
 #|   Three plots laid out such that first and third plot are next to each other 
 #|   and the second plot stretched beneath them. The first plot is a 
-#|   scatterplot of highway mileage versus engine size, third plot is a 
-#|   scatterplot of highway mileage versus city mileage, and the third plot is 
-#|   side-by-side boxplots of highway mileage versus drive train) placed next 
-#|   to each other.
+#|   scatterplot of highway mileage versus engine size, the third plot is a 
+#|   scatterplot of highway mileage versus city mileage, and the second plot is 
+#|   side-by-side boxplots of highway mileage versus drive train). 
 
 p3 <- ggplot(mpg, aes(x = cty, y = hwy)) + 
   geom_point() + 

diff --git a/data-import.qmd b/data-import.qmd
@@ -56,7 +56,7 @@ read_csv("data/students.csv") |>
 
 We can read this file into R using `read_csv()`.
 The first argument is the most important: the path to the file.
-You can think about the path as the address of the file: the file is called `students.csv` and that it lives in the `data` folder.
+You can think about the path as the address of the file: the file is called `students.csv` and it lives in the `data` folder.
 
 ```{r}
 #| message: true
@@ -88,7 +88,7 @@ students
 
 In the `favourite.food` column, there are a bunch of food items, and then the character string `N/A`, which should have been a real `NA` that R will recognize as "not available".
 This is something we can address using the `na` argument.
-By default, `read_csv()` only recognizes empty strings (`""`) in this dataset as `NA`s, we want it to also recognize the character string `"N/A"`.
+By default, `read_csv()` only recognizes empty strings (`""`) in this dataset as `NA`s, and we want it to also recognize the character string `"N/A"`.
 
 ```{r}
 #| message: false
@@ -131,7 +131,7 @@ students |>
 Note that the values in the `meal_plan` variable have stayed the same, but the type of variable denoted underneath the variable name has changed from character (`<chr>`) to factor (`<fct>`).
 You'll learn more about factors in @sec-factors.
 
-Before you analyze these data, you'll probably want to fix the `age` and `id` columns.
+Before you analyze these data, you'll probably want to fix the `age` column.
 Currently, `age` is a character variable because one of the observations is typed out as `five` instead of a numeric `5`.
 We discuss the details of fixing this issue in @sec-import-spreadsheets.
 

diff --git a/data-tidy.qmd b/data-tidy.qmd
@@ -397,7 +397,7 @@ household
 ```
 
 This dataset contains data about five families, with the names and dates of birth of up to two children.
-The new challenge in this dataset is that the column names contain the names of two variables (`dob`, `name)` and the values of another (`child,` with values 1 or 2).
+The new challenge in this dataset is that the column names contain the names of two variables (`dob`, `name`) and the values of another (`child`, with values 1 or 2).
 To solve this problem we again need to supply a vector to `names_to` but this time we use the special `".value"` sentinel; this isn't the name of a variable but a unique value that tells `pivot_longer()` to do something different.
 This overrides the usual `values_to` argument to use the first component of the pivoted column name as a variable name in the output.
 
@@ -456,7 +456,7 @@ cms_patient_experience |>
 Neither of these columns will make particularly great variable names: `measure_cd` doesn't hint at the meaning of the variable and `measure_title` is a long sentence containing spaces.
 We'll use `measure_cd` as the source for our new column names for now, but in a real analysis you might want to create your own variable names that are both short and meaningful.
 
-`pivot_wider()` has the opposite interface to `pivot_longer()`: instead of choosing new column names, we need to provide the existing columns that define the values (`values_from`) and the column name (`names_from)`:
+`pivot_wider()` has the opposite interface to `pivot_longer()`: instead of choosing new column names, we need to provide the existing columns that define the values (`values_from`) and the column name (`names_from`):
 
 ```{r}
 cms_patient_experience |>