diff --git a/.nojekyll b/.nojekyll index 5341b5df3..82c2e4066 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -3956ceaa \ No newline at end of file +7604c4b1 \ No newline at end of file diff --git a/EDA.html b/EDA.html new file mode 100644 index 000000000..4215b2053 --- /dev/null +++ b/EDA.html @@ -0,0 +1,1114 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 10  Exploratory data analysis + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ + + +
+

10  Exploratory data analysis

+
+ + + +
+ + + + +
+ + +

+10.1 Introduction

+

This chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. EDA is an iterative cycle. You:

+
    +
  1. Generate questions about your data.

  2. +
  3. Search for answers by visualizing, transforming, and modelling your data.

  4. +
  5. Use what you learn to refine your questions and/or generate new questions.

  6. +
+

EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive insights that you’ll eventually write up and communicate to others.

+

EDA is an important part of any data analysis, even if the primary research questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you’ll need to deploy all the tools of EDA: visualization, transformation, and modelling.

+

+10.1.1 Prerequisites

+

In this chapter we’ll combine what you’ve learned about dplyr and ggplot2 to interactively ask questions, answer them with data, and then ask new questions.

+ +

+10.2 Questions

+
+

“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox

+
+
+

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey

+
+

Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.

+

EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights can be gleaned from your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data—and develop a set of thought-provoking questions—if you follow up each question with a new question based on what you find.

+

There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:

+
    +
  1. What type of variation occurs within my variables?

  2. +
  3. What type of covariation occurs between my variables?

  4. +
+

The rest of this chapter will look at these two questions. We’ll explain what variation and covariation are, and we’ll show you several ways to answer each question.

+

+10.3 Variation

+

Variation is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error that varies from measurement to measurement. Variables can also vary if you measure across different subjects (e.g., the eye colors of different people) or at different times (e.g., the energy levels of an electron at different moments). Every variable has its own pattern of variation, which can reveal interesting information about how that it varies between measurements on the same observation as well as across observations. The best way to understand that pattern is to visualize the distribution of the variable’s values, which you’ve learned about in Capítulo 1.

+

We’ll start our exploration by visualizing the distribution of weights (carat) of ~54,000 diamonds from the diamonds dataset. Since carat is a numerical variable, we can use a histogram:

+
+
ggplot(diamonds, aes(x = carat)) +
+  geom_histogram(binwidth = 0.5)
+
+

A histogram of carats of diamonds, with the x-axis ranging from 0 to 4.5 and the y-axis ranging from 0 to 30000. The distribution is right skewed with very few diamonds in the bin centered at 0, almost 30000 diamonds in the bin centered at 0.5, approximately 15000 diamonds in the bin centered at 1, and much fewer, approximately 5000 diamonds in the bin centered at 1.5. Beyond this, there's a trailing tail.

+
+
+

Now that you can visualize variation, what should you look for in your plots? And what type of follow-up questions should you ask? We’ve put together a list below of the most useful types of information that you will find in your graphs, along with some follow-up questions for each type of information. The key to asking good follow-up questions will be to rely on your curiosity (What do you want to learn more about?) as well as your skepticism (How could this be misleading?).

+

+10.3.1 Typical values

+

In both bar charts and histograms, tall bars show the common values of a variable, and shorter bars show less-common values. Places that do not have bars reveal values that were not seen in your data. To turn this information into useful questions, look for anything unexpected:

+
    +
  • Which values are the most common? Why?

  • +
  • Which values are rare? Why? Does that match your expectations?

  • +
  • Can you see any unusual patterns? What might explain them?

  • +
+

Let’s take a look at the distribution of carat for smaller diamonds.

+
+
smaller <- diamonds |> 
+  filter(carat < 3)
+
+ggplot(smaller, aes(x = carat)) +
+  geom_histogram(binwidth = 0.01)
+
+

A histogram of carats of diamonds, with the x-axis ranging from 0 to 3 and the y-axis ranging from 0 to roughly 2500. The binwidth is quite narrow (0.01), resulting in a very large number of skinny bars. The distribution is right skewed, with many peaks followed by bars in decreasing heights, until a sharp increase at the next peak.

+
+
+

This histogram suggests several interesting questions:

+
    +
  • Why are there more diamonds at whole carats and common fractions of carats?

  • +
  • Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?

  • +
+

Visualizations can also reveal clusters, which suggest that subgroups exist in your data. To understand the subgroups, ask:

+
    +
  • How are the observations within each subgroup similar to each other?

  • +
  • How are the observations in separate clusters different from each other?

  • +
  • How can you explain or describe the clusters?

  • +
  • Why might the appearance of clusters be misleading?

  • +
+

Some of these questions can be answered with the data while some will require domain expertise about the data. Many of them will prompt you to explore a relationship between variables, for example, to see if the values of one variable can explain the behavior of another variable. We’ll get to that shortly.

+

+10.3.2 Unusual values

+

Outliers are observations that are unusual; data points that don’t seem to fit the pattern. Sometimes outliers are data entry errors, sometimes they are simply values at the extremes that happened to be observed in this data collection, and other times they suggest important new discoveries. When you have a lot of data, outliers are sometimes difficult to see in a histogram. For example, take the distribution of the y variable from the diamonds dataset. The only evidence of outliers is the unusually wide limits on the x-axis.

+
+
ggplot(diamonds, aes(x = y)) + 
+  geom_histogram(binwidth = 0.5)
+
+

A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and the y-axis ranges from 0 to 12000. There is a peak around 5, and the data appear to be completely clustered around the peak.

+
+
+

There are so many observations in the common bins that the rare bins are very short, making it very difficult to see them (although maybe if you stare intently at 0 you’ll spot something). To make it easy to see the unusual values, we need to zoom to small values of the y-axis with coord_cartesian():

+
+
ggplot(diamonds, aes(x = y)) + 
+  geom_histogram(binwidth = 0.5) +
+  coord_cartesian(ylim = c(0, 50))
+
+

A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and the y-axis ranges from 0 to 50. There is a peak around 5, and the data appear to be completely clustered around the peak. Other than those data, there is one bin at 0 with a height of about 8, one a little over 30 with a height of 1 and another one a little below 60 with a height of 1.

+
+
+

coord_cartesian() also has an xlim() argument for when you need to zoom into the x-axis. ggplot2 also has xlim() and ylim() functions that work slightly differently: they throw away the data outside the limits.

+

This allows us to see that there are three unusual values: 0, ~30, and ~60. We pluck them out with dplyr:

+
+
unusual <- diamonds |> 
+  filter(y < 3 | y > 20) |> 
+  select(price, x, y, z) |>
+  arrange(y)
+unusual
+#> # A tibble: 9 × 4
+#>   price     x     y     z
+#>   <int> <dbl> <dbl> <dbl>
+#> 1  5139  0      0    0   
+#> 2  6381  0      0    0   
+#> 3 12800  0      0    0   
+#> 4 15686  0      0    0   
+#> 5 18034  0      0    0   
+#> 6  2130  0      0    0   
+#> 7  2130  0      0    0   
+#> 8  2075  5.15  31.8  5.12
+#> 9 12210  8.09  58.9  8.06
+
+

The y variable measures one of the three dimensions of these diamonds, in mm. We know that diamonds can’t have a width of 0mm, so these values must be incorrect. By doing EDA, we have discovered missing data that was coded as 0, which we never would have found by simply searching for NAs. Going forward we might choose to re-code these values as NAs in order to prevent misleading calculations. We might also suspect that measurements of 32mm and 59mm are implausible: those diamonds are over an inch long, but don’t cost hundreds of thousands of dollars!

+

It’s good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can’t figure out why they’re there, it’s reasonable to omit them, and move on. However, if they have a substantial effect on your results, you shouldn’t drop them without justification. You’ll need to figure out what caused them (e.g., a data entry error) and disclose that you removed them in your write-up.

+

+10.3.3 Exercises

+
    +
  1. Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

  2. +
  3. Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)

  4. +
  5. How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?

  6. +
  7. Compare and contrast coord_cartesian() vs. xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?

  8. +

+10.4 Unusual values

+

If you’ve encountered unusual values in your dataset, and simply want to move on to the rest of your analysis, you have two options.

+
    +
  1. +

    Drop the entire row with the strange values:

    +
    +
    diamonds2 <- diamonds |> 
    +  filter(between(y, 3, 20))
    +
    +

    We don’t recommend this option because one invalid value doesn’t imply that all the other values for that observation are also invalid. Additionally, if you have low quality data, by the time that you’ve applied this approach to every variable you might find that you don’t have any data left!

    +
  2. +
  3. +

    Instead, we recommend replacing the unusual values with missing values. The easiest way to do this is to use mutate() to replace the variable with a modified copy. You can use the if_else() function to replace unusual values with NA:

    +
    +
    diamonds2 <- diamonds |> 
    +  mutate(y = if_else(y < 3 | y > 20, NA, y))
    +
    +
  4. +
+

It’s not obvious where you should plot missing values, so ggplot2 doesn’t include them in the plot, but it does warn that they’ve been removed:

+
+
ggplot(diamonds2, aes(x = x, y = y)) + 
+  geom_point()
+#> Warning: Removed 9 rows containing missing values (`geom_point()`).
+
+

A scatterplot of widths vs. lengths of diamonds. There is a strong, linear association between the two variables. All but one of the diamonds has length greater than 3. The one outlier has a length of 0 and a width of about 6.5.

+
+
+

To suppress that warning, set na.rm = TRUE:

+
+
ggplot(diamonds2, aes(x = x, y = y)) + 
+  geom_point(na.rm = TRUE)
+
+

Other times you want to understand what makes observations with missing values different to observations with recorded values. For example, in nycflights13::flights1, missing values in the dep_time variable indicate that the flight was cancelled. So you might want to compare the scheduled departure times for cancelled and non-cancelled times. You can do this by making a new variable, using is.na() to check if dep_time is missing.

+
+
nycflights13::flights |> 
+  mutate(
+    cancelled = is.na(dep_time),
+    sched_hour = sched_dep_time %/% 100,
+    sched_min = sched_dep_time %% 100,
+    sched_dep_time = sched_hour + (sched_min / 60)
+  ) |> 
+  ggplot(aes(x = sched_dep_time)) + 
+  geom_freqpoly(aes(color = cancelled), binwidth = 1/4)
+
+

A frequency polygon of scheduled departure times of flights. Two lines represent flights that are cancelled and not cancelled. The x-axis ranges from 0 to 25 minutes and the y-axis ranges from 0 to 10000. The number of flights not cancelled are much higher than those cancelled.

+
+
+

However this plot isn’t great because there are many more non-cancelled flights than cancelled flights. In the next section we’ll explore some techniques for improving this comparison.

+

+10.4.1 Exercises

+
    +
  1. What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference in how missing values are handled in histograms and bar charts?

  2. +
  3. What does na.rm = TRUE do in mean() and sum()?

  4. +
  5. Recreate the frequency plot of scheduled_dep_time colored by whether the flight was cancelled or not. Also facet by the cancelled variable. Experiment with different values of the scales variable in the faceting function to mitigate the effect of more non-cancelled flights than cancelled flights.

  6. +

+10.5 Covariation

+

If variation describes the behavior within a variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualize the relationship between two or more variables.

+

+10.5.1 A categorical and a numerical variable

+

For example, let’s explore how the price of a diamond varies with its quality (measured by cut) using geom_freqpoly():

+
+
ggplot(diamonds, aes(x = price)) + 
+  geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)
+
+

A frequency polygon of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 30000 and the y-axis ranges from 0 to 5000. The lines overlap a great deal, suggesting similar frequency distributions of prices of diamonds. One notable feature is that Ideal diamonds have the highest peak around 1500.

+
+
+

Note that ggplot2 uses an ordered color scale for cut because it’s defined as an ordered factor variable in the data. You’ll learn more about these in Seção 16.6.

+

The default appearance of geom_freqpoly() is not that useful here because the height, determined by the overall count, differs so much across cuts, making it hard to see the differences in the shapes of their distributions.

+

To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we’ll display the density, which is the count standardized so that the area under each frequency polygon is one.

+
+
ggplot(diamonds, aes(x = price, y = after_stat(density))) + 
+  geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)
+
+

A frequency polygon of densities of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 20000. The lines overlap a great deal, suggesting similar density distributions of prices of diamonds. One notable feature is that all but Fair diamonds have high peaks around a price of 1500 and Fair diamonds have a higher mean than others.

+
+
+

Note that we’re mapping the density to y, but since density is not a variable in the diamonds dataset, we need to first calculate it. We use the after_stat() function to do so.

+

There’s something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price! But maybe that’s because frequency polygons are a little hard to interpret - there’s a lot going on in this plot.

+

A visually simpler plot for exploring this relationship is using side-by-side boxplots.

+
+
ggplot(diamonds, aes(x = cut, y = price)) +
+  geom_boxplot()
+
+

Side-by-side boxplots of prices of diamonds by cut. The distribution of prices is right skewed for each cut (Fair, Good, Very Good, Premium, and Ideal). The medians are close to each other, with the median for Ideal diamonds lowest and that for Fair highest.

+
+
+

We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counter-intuitive finding that better quality diamonds are typically cheaper! In the exercises, you’ll be challenged to figure out why.

+

cut is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables don’t have such an intrinsic order, so you might want to reorder them to make a more informative display. One way to do that is with fct_reorder(). You’ll learn more about that function in Seção 16.4, but we want to give you a quick preview here because it’s so useful. For example, take the class variable in the mpg dataset. You might be interested to know how highway mileage varies across classes:

+
+
ggplot(mpg, aes(x = class, y = hwy)) +
+  geom_boxplot()
+
+

Side-by-side boxplots of highway mileages of cars by class. Classes are on the x-axis (2seaters, compact, midsize, minivan, pickup, subcompact, and suv).

+
+
+

To make the trend easier to see, we can reorder class based on the median value of hwy:

+
+
ggplot(mpg, aes(x = fct_reorder(class, hwy, median), y = hwy)) +
+  geom_boxplot()
+
+

Side-by-side boxplots of highway mileages of cars by class. Classes are on the x-axis and ordered by increasing median highway mileage (pickup, suv, minivan, 2seater, subcompact, compact, and midsize).

+
+
+

If you have long variable names, geom_boxplot() will work better if you flip it 90°. You can do that by exchanging the x and y aesthetic mappings.

+
+
ggplot(mpg, aes(x = hwy, y = fct_reorder(class, hwy, median))) +
+  geom_boxplot()
+
+

Side-by-side boxplots of highway mileages of cars by class. Classes are on the y-axis and ordered by increasing median highway mileage.

+
+
+

+10.5.1.1 Exercises

+
    +
  1. Use what you’ve learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.

  2. +
  3. Based on EDA, what variable in the diamonds dataset appears to be most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

  4. +
  5. Instead of exchanging the x and y variables, add coord_flip() as a new layer to the vertical boxplot to create a horizontal one. How does this compare to exchanging the variables?

  6. +
  7. One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs. cut. What do you learn? How do you interpret the plots?

  8. +
  9. Create a visualization of diamond prices vs. a categorical variable from the diamonds dataset using geom_violin(), then a faceted geom_histogram(), then a colored geom_freqpoly(), and then a colored geom_density(). Compare and contrast the four plots. What are the pros and cons of each method of visualizing the distribution of a numerical variable based on the levels of a categorical variable?

  10. +
  11. If you have a small dataset, it’s sometimes useful to use geom_jitter() to avoid overplotting to more easily see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.

  12. +

+10.5.2 Two categorical variables

+

To visualize the covariation between categorical variables, you’ll need to count the number of observations for each combination of levels of these categorical variables. One way to do that is to rely on the built-in geom_count():

+
+
ggplot(diamonds, aes(x = cut, y = color)) +
+  geom_count()
+
+

A scatterplot of color vs. cut of diamonds. There is one point for each combination of levels of cut (Fair, Good, Very Good, Premium, and Ideal) and color (D, E, F, G, G, I, and J). The sizes of the points represent the number of observations for that combination. The legend indicates that these sizes range between 1000 and 4000.

+
+
+

The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specific x values and specific y values.

+

Another approach for exploring the relationship between these variables is computing the counts with dplyr:

+
+
diamonds |> 
+  count(color, cut)
+#> # A tibble: 35 × 3
+#>   color cut           n
+#>   <ord> <ord>     <int>
+#> 1 D     Fair        163
+#> 2 D     Good        662
+#> 3 D     Very Good  1513
+#> 4 D     Premium    1603
+#> 5 D     Ideal      2834
+#> 6 E     Fair        224
+#> # ℹ 29 more rows
+
+

Then visualize with geom_tile() and the fill aesthetic:

+
+
diamonds |> 
+  count(color, cut) |>  
+  ggplot(aes(x = color, y = cut)) +
+  geom_tile(aes(fill = n))
+
+

A tile plot of cut vs. color of diamonds. Each tile represents a cut/color combination and tiles are colored according to the number of observations in each tile. There are more Ideal diamonds than other cuts, with the highest number being Ideal diamonds with color G. Fair diamonds and diamonds with color I are the lowest in frequency.

+
+
+

If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the heatmaply package, which creates interactive plots.

+

+10.5.2.1 Exercises

+
    +
  1. How could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?

  2. +
  3. What different data insights do you get with a segmented bar chart if color is mapped to the x aesthetic and cut is mapped to the fill aesthetic? Calculate the counts that fall into each of the segments.

  4. +
  5. Use geom_tile() together with dplyr to explore how average flight departure delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?

  6. +

+10.5.3 Two numerical variables

+

You’ve already seen one great way to visualize the covariation between two numerical variables: draw a scatterplot with geom_point(). You can see covariation as a pattern in the points. For example, you can see a positive relationship between the carat size and price of a diamond: diamonds with more carats have a higher price. The relationship is exponential.

+
+
ggplot(smaller, aes(x = carat, y = price)) +
+  geom_point()
+
+

A scatterplot of price vs. carat. The relationship is positive, somewhat strong, and exponential.

+
+
+

(In this section we’ll use the smaller dataset to stay focused on the bulk of the diamonds that are smaller than 3 carats)

+

Scatterplots become less useful as the size of your dataset grows, because points begin to overplot, and pile up into areas of uniform black, making it hard to judge differences in the density of the data across the 2-dimensional space as well as making it hard to spot the trend. You’ve already seen one way to fix the problem: using the alpha aesthetic to add transparency.

+
+
ggplot(smaller, aes(x = carat, y = price)) + 
+  geom_point(alpha = 1 / 100)
+
+

A scatterplot of price vs. carat. The relationship is positive, somewhat strong, and exponential. The points are transparent, showing clusters where the number of points is higher than other areas, The most obvious clusters are for diamonds with 1, 1.5, and 2 carats.

+
+
+

But using transparency can be challenging for very large datasets. Another solution is to use bin. Previously you used geom_histogram() and geom_freqpoly() to bin in one dimension. Now you’ll learn how to use geom_bin2d() and geom_hex() to bin in two dimensions.

+

geom_bin2d() and geom_hex() divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. geom_bin2d() creates rectangular bins. geom_hex() creates hexagonal bins. You will need to install the hexbin package to use geom_hex().

+
+
ggplot(smaller, aes(x = carat, y = price)) +
+  geom_bin2d()
+
+# install.packages("hexbin")
+ggplot(smaller, aes(x = carat, y = price)) +
+  geom_hex()
+
+
+
+

Plot 1: A binned density plot of price vs. carat. Plot 2: A hexagonal bin plot of price vs. carat. Both plots show that the highest density of diamonds have low carats and low prices.

+
+
+

Plot 1: A binned density plot of price vs. carat. Plot 2: A hexagonal bin plot of price vs. carat. Both plots show that the highest density of diamonds have low carats and low prices.

+
+
+
+
+

Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualizing the combination of a categorical and a continuous variable that you learned about. For example, you could bin carat and then for each group, display a boxplot:

+
+
ggplot(smaller, aes(x = carat, y = price)) + 
+  geom_boxplot(aes(group = cut_width(carat, 0.1)))
+
+

Side-by-side box plots of price by carat. Each box plot represents diamonds that are 0.1 carats apart in weight. The box plots show that as carat increases the median price increases as well. Additionally, diamonds with 1.5 carats or lower have right skewed price distributions, 1.5 to 2 have roughly symmetric price distributions, and diamonds that weigh more have left skewed distributions. Cheaper, smaller diamonds have outliers on the higher end, more expensive, bigger diamonds have outliers on the lower end.

+
+
+

cut_width(x, width), as used above, divides x into bins of width width. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it’s difficult to tell that each boxplot summaries a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with varwidth = TRUE.

+

+10.5.3.1 Exercises

+
    +
  1. Instead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs. cut_number()? How does that impact a visualization of the 2d distribution of carat and price?

  2. +
  3. Visualize the distribution of carat, partitioned by price.

  4. +
  5. How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?

  6. +
  7. Combine two of the techniques you’ve learned to visualize the combined distribution of cut, carat, and price.

  8. +
  9. +

    Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the following plot have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately. Why is a scatterplot a better display than a binned plot for this case?

    +
    +
    diamonds |> 
    +  filter(x >= 4) |> 
    +  ggplot(aes(x = x, y = y)) +
    +  geom_point() +
    +  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
    +
    +
  10. +
  11. +

    Instead of creating boxes of equal width with cut_width(), we could create boxes that contain roughly equal number of points with cut_number(). What are the advantages and disadvantages of this approach?

    +
    +
    ggplot(smaller, aes(x = carat, y = price)) + 
    +  geom_boxplot(aes(group = cut_number(carat, 20)))
    +
    +
  12. +

+10.6 Patterns and models

+

If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:

+
    +
  • Could this pattern be due to coincidence (i.e. random chance)?

  • +
  • How can you describe the relationship implied by the pattern?

  • +
  • How strong is the relationship implied by the pattern?

  • +
  • What other variables might affect the relationship?

  • +
  • Does the relationship change if you look at individual subgroups of the data?

  • +
+

Patterns in your data provide clues about relationships, i.e., they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.

+

Models are a tool for extracting patterns out of data. For example, consider the diamonds data. It’s hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. It’s possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain. The following code fits a model that predicts price from carat and then computes the residuals (the difference between the predicted value and the actual value). The residuals give us a view of the price of the diamond, once the effect of carat has been removed. Note that instead of using the raw values of price and carat, we log transform them first, and fit a model to the log-transformed values. Then, we exponentiate the residuals to put them back in the scale of raw prices.

+
+
library(tidymodels)
+
+diamonds <- diamonds |>
+  mutate(
+    log_price = log(price),
+    log_carat = log(carat)
+  )
+
+diamonds_fit <- linear_reg() |>
+  fit(log_price ~ log_carat, data = diamonds)
+
+diamonds_aug <- augment(diamonds_fit, new_data = diamonds) |>
+  mutate(.resid = exp(.resid))
+
+ggplot(diamonds_aug, aes(x = carat, y = .resid)) + 
+  geom_point()
+
+

A scatterplot of residuals vs. carat of diamonds. The x-axis ranges from 0 to 5, the y-axis ranges from 0 to almost 4. Much of the data are clustered around low values of carat and residuals. There is a clear, curved pattern showing decrease in residuals as carat increases.

+
+
+

Once you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.

+
+
ggplot(diamonds_aug, aes(x = cut, y = .resid)) + 
+  geom_boxplot()
+
+

Side-by-side box plots of residuals by cut. The x-axis displays the various cuts (Fair to Ideal), the y-axis ranges from 0 to almost 5. The medians are quite similar, between roughly 0.75 to 1.25. Each of the distributions of residuals is right skewed, with many outliers on the higher end.

+
+
+

We’re not discussing modelling in this book because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.

+

+10.7 Summary

+

In this chapter you’ve learned a variety of tools to help you understand the variation within your data. You’ve seen techniques that work with a single variable at a time and with a pair of variables. This might seem painfully restrictive if you have tens or hundreds of variables in your data, but they’re the foundation upon which all other techniques are built.

+

In the next chapter, we’ll focus on the tools we can use to communicate our results.

+ + +

+
    +
  1. Remember that when we need to be explicit about where a function (or dataset) comes from, we’ll use the special form package::function() or package::dataset.↩︎

  2. +
+
+ + + + \ No newline at end of file diff --git a/EDA_files/figure-html/unnamed-chunk-12-1.png b/EDA_files/figure-html/unnamed-chunk-12-1.png new file mode 100644 index 000000000..5974dace7 Binary files /dev/null and b/EDA_files/figure-html/unnamed-chunk-12-1.png differ diff --git a/EDA_files/figure-html/unnamed-chunk-14-1.png b/EDA_files/figure-html/unnamed-chunk-14-1.png new file mode 100644 index 000000000..e6225bc63 Binary files /dev/null and b/EDA_files/figure-html/unnamed-chunk-14-1.png differ diff --git a/EDA_files/figure-html/unnamed-chunk-15-1.png b/EDA_files/figure-html/unnamed-chunk-15-1.png new file mode 100644 index 000000000..76ebee1c7 Binary files /dev/null and b/EDA_files/figure-html/unnamed-chunk-15-1.png differ diff --git a/EDA_files/figure-html/unnamed-chunk-16-1.png b/EDA_files/figure-html/unnamed-chunk-16-1.png new file mode 100644 index 000000000..aa0625dbe Binary files /dev/null and b/EDA_files/figure-html/unnamed-chunk-16-1.png differ diff --git a/EDA_files/figure-html/unnamed-chunk-17-1.png b/EDA_files/figure-html/unnamed-chunk-17-1.png new file mode 100644 index 000000000..564faa2a1 Binary files /dev/null and b/EDA_files/figure-html/unnamed-chunk-17-1.png differ diff --git a/EDA_files/figure-html/unnamed-chunk-18-1.png b/EDA_files/figure-html/unnamed-chunk-18-1.png new file mode 100644 index 000000000..2ab9501df Binary files /dev/null and b/EDA_files/figure-html/unnamed-chunk-18-1.png differ diff --git a/EDA_files/figure-html/unnamed-chunk-19-1.png b/EDA_files/figure-html/unnamed-chunk-19-1.png new file mode 100644 index 000000000..5bbd4093f Binary files /dev/null and b/EDA_files/figure-html/unnamed-chunk-19-1.png differ diff --git a/EDA_files/figure-html/unnamed-chunk-20-1.png b/EDA_files/figure-html/unnamed-chunk-20-1.png new file mode 100644 index 000000000..0b6478e03 Binary files /dev/null and b/EDA_files/figure-html/unnamed-chunk-20-1.png differ diff --git a/EDA_files/figure-html/unnamed-chunk-21-1.png b/EDA_files/figure-html/unnamed-chunk-21-1.png new file mode 100644 index 000000000..29c5c3ae6 Binary files /dev/null and b/EDA_files/figure-html/unnamed-chunk-21-1.png differ diff --git a/EDA_files/figure-html/unnamed-chunk-23-1.png b/EDA_files/figure-html/unnamed-chunk-23-1.png new file mode 100644 index 000000000..3fde7f791 Binary files /dev/null and b/EDA_files/figure-html/unnamed-chunk-23-1.png differ diff --git a/EDA_files/figure-html/unnamed-chunk-24-1.png b/EDA_files/figure-html/unnamed-chunk-24-1.png new file mode 100644 index 000000000..5247a573f Binary files /dev/null and b/EDA_files/figure-html/unnamed-chunk-24-1.png differ diff --git a/EDA_files/figure-html/unnamed-chunk-25-1.png b/EDA_files/figure-html/unnamed-chunk-25-1.png new file mode 100644 index 000000000..2c59e3086 Binary files /dev/null and b/EDA_files/figure-html/unnamed-chunk-25-1.png differ diff --git a/EDA_files/figure-html/unnamed-chunk-26-1.png b/EDA_files/figure-html/unnamed-chunk-26-1.png new file mode 100644 index 000000000..03d9f3377 Binary files /dev/null and b/EDA_files/figure-html/unnamed-chunk-26-1.png differ diff --git a/EDA_files/figure-html/unnamed-chunk-26-2.png b/EDA_files/figure-html/unnamed-chunk-26-2.png new file mode 100644 index 000000000..5573e7b79 Binary files /dev/null and b/EDA_files/figure-html/unnamed-chunk-26-2.png differ diff --git a/EDA_files/figure-html/unnamed-chunk-27-1.png b/EDA_files/figure-html/unnamed-chunk-27-1.png new file mode 100644 index 000000000..fb8be3970 Binary files /dev/null and b/EDA_files/figure-html/unnamed-chunk-27-1.png differ diff --git a/EDA_files/figure-html/unnamed-chunk-3-1.png b/EDA_files/figure-html/unnamed-chunk-3-1.png new file mode 100644 index 000000000..a331afc44 Binary files /dev/null and b/EDA_files/figure-html/unnamed-chunk-3-1.png differ diff --git a/EDA_files/figure-html/unnamed-chunk-30-1.png b/EDA_files/figure-html/unnamed-chunk-30-1.png new file mode 100644 index 000000000..3066c713a Binary files /dev/null and b/EDA_files/figure-html/unnamed-chunk-30-1.png differ diff --git a/EDA_files/figure-html/unnamed-chunk-31-1.png b/EDA_files/figure-html/unnamed-chunk-31-1.png new file mode 100644 index 000000000..459e15bde Binary files /dev/null and b/EDA_files/figure-html/unnamed-chunk-31-1.png differ diff --git a/EDA_files/figure-html/unnamed-chunk-4-1.png b/EDA_files/figure-html/unnamed-chunk-4-1.png new file mode 100644 index 000000000..7b2740a6f Binary files /dev/null and b/EDA_files/figure-html/unnamed-chunk-4-1.png differ diff --git a/EDA_files/figure-html/unnamed-chunk-5-1.png b/EDA_files/figure-html/unnamed-chunk-5-1.png new file mode 100644 index 000000000..2e9099dd0 Binary files /dev/null and b/EDA_files/figure-html/unnamed-chunk-5-1.png differ diff --git a/EDA_files/figure-html/unnamed-chunk-6-1.png b/EDA_files/figure-html/unnamed-chunk-6-1.png new file mode 100644 index 000000000..ee699a9c5 Binary files /dev/null and b/EDA_files/figure-html/unnamed-chunk-6-1.png differ diff --git a/arrow.html b/arrow.html new file mode 100644 index 000000000..0516141b4 --- /dev/null +++ b/arrow.html @@ -0,0 +1,955 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 22  Arrow + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ + + +
+

22  Arrow

+
+ + + +
+ + + + +
+ + +

+22.1 Introduction

+

CSV files are designed to be easily read by humans. They’re a good interchange format because they’re very simple and they can be read by every tool under the sun. But CSV files aren’t very efficient: you have to do quite a lot of work to read the data into R. In this chapter, you’ll learn about a powerful alternative: the parquet format, an open standards-based format widely used by big data systems.

+

We’ll pair parquet files with Apache Arrow, a multi-language toolbox designed for efficient analysis and transport of large datasets. We’ll use Apache Arrow via the arrow package, which provides a dplyr backend allowing you to analyze larger-than-memory datasets using familiar dplyr syntax. As an additional benefit, arrow is extremely fast: you’ll see some examples later in the chapter.

+

Both arrow and dbplyr provide dplyr backends, so you might wonder when to use each. In many cases, the choice is made for you, as the data is already in a database or in parquet files, and you’ll want to work with it as is. But if you’re starting with your own data (perhaps CSV files), you can either load it into a database or convert it to parquet. In general, it’s hard to know what will work best, so in the early stages of your analysis we’d encourage you to try both and pick the one that works the best for you.

+

(A big thanks to Danielle Navarro who contributed the initial version of this chapter.)

+

+22.1.1 Prerequisites

+

In this chapter, we’ll continue to use the tidyverse, particularly dplyr, but we’ll pair it with the arrow package which is designed specifically for working with large data.

+ +

Later in the chapter, we’ll also see some connections between arrow and duckdb, so we’ll also need dbplyr and duckdb.

+
+
library(dbplyr, warn.conflicts = FALSE)
+library(duckdb)
+#> Loading required package: DBI
+
+

+22.2 Getting the data

+

We begin by getting a dataset worthy of these tools: a dataset of item checkouts from Seattle public libraries, available online at data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6. This dataset contains 41,389,465 rows that tell you how many times each book was checked out each month from April 2005 to October 2022.

+

The following code will get you a cached copy of the data. The data is a 9GB CSV file, so it will take some time to download. I highly recommend using curl::multi_download() to get very large files as it’s built for exactly this purpose: it gives you a progress bar and it can resume the download if its interrupted.

+
+
dir.create("data", showWarnings = FALSE)
+
+curl::multi_download(
+  "https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv",
+  "data/seattle-library-checkouts.csv",
+  resume = TRUE
+)
+#> # A tibble: 1 × 10
+#>   success status_code resumefrom url                    destfile        error
+#>   <lgl>         <int>      <dbl> <chr>                  <chr>           <chr>
+#> 1 TRUE            200          0 https://r4ds.s3.us-we… data/seattle-l… <NA> 
+#> # ℹ 4 more variables: type <chr>, modified <dttm>, time <dbl>,
+#> #   headers <list>
+
+

+22.3 Opening a dataset

+

Let’s start by taking a look at the data. At 9 GB, this file is large enough that we probably don’t want to load the whole thing into memory. A good rule of thumb is that you usually want at least twice as much memory as the size of the data, and many laptops top out at 16 GB. This means we want to avoid read_csv() and instead use the arrow::open_dataset():

+
+
seattle_csv <- open_dataset(
+  sources = "data/seattle-library-checkouts.csv", 
+  col_types = schema(ISBN = string()),
+  format = "csv"
+)
+
+

What happens when this code is run? open_dataset() will scan a few thousand rows to figure out the structure of the dataset. The ISBN column contains blank values for the first 80,000 rows, so we have to specify the column type to help arrow work out the data structure. Once the data has been scanned by open_dataset(), it records what it’s found and stops; it will only read further rows as you specifically request them. This metadata is what we see if we print seattle_csv:

+
+
seattle_csv
+#> FileSystemDataset with 1 csv file
+#> UsageClass: string
+#> CheckoutType: string
+#> MaterialType: string
+#> CheckoutYear: int64
+#> CheckoutMonth: int64
+#> Checkouts: int64
+#> Title: string
+#> ISBN: string
+#> Creator: string
+#> Subjects: string
+#> Publisher: string
+#> PublicationYear: string
+
+

The first line in the output tells you that seattle_csv is stored locally on-disk as a single CSV file; it will only be loaded into memory as needed. The remainder of the output tells you the column type that arrow has imputed for each column.

+

We can see what’s actually in with glimpse(). This reveals that there are ~41 million rows and 12 columns, and shows us a few values.

+
+
seattle_csv |> glimpse()
+#> FileSystemDataset with 1 csv file
+#> 41,389,465 rows x 12 columns
+#> $ UsageClass      <string> "Physical", "Physical", "Digital", "Physical", "Ph…
+#> $ CheckoutType    <string> "Horizon", "Horizon", "OverDrive", "Horizon", "Hor…
+#> $ MaterialType    <string> "BOOK", "BOOK", "EBOOK", "BOOK", "SOUNDDISC", "BOO…
+#> $ CheckoutYear     <int64> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 20…
+#> $ CheckoutMonth    <int64> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
+#> $ Checkouts        <int64> 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 2, 3, 2, 1, 3, 2,…
+#> $ Title           <string> "Super rich : a guide to having it all / Russell S…
+#> $ ISBN            <string> "", "", "", "", "", "", "", "", "", "", "", "", ""…
+#> $ Creator         <string> "Simmons, Russell", "Barclay, James, 1965-", "Tim …
+#> $ Subjects        <string> "Self realization, Conduct of life, Attitude Psych…
+#> $ Publisher       <string> "Gotham Books,", "Pyr,", "Random House, Inc.", "Di…
+#> $ PublicationYear <string> "c2011.", "2010.", "2015", "2005.", "c2004.", "c20…
+
+

We can start to use this dataset with dplyr verbs, using collect() to force arrow to perform the computation and return some data. For example, this code tells us the total number of checkouts per year:

+
+
seattle_csv |> 
+  group_by(CheckoutYear) |> 
+  summarise(Checkouts = sum(Checkouts)) |> 
+  arrange(CheckoutYear) |> 
+  collect()
+#> # A tibble: 18 × 2
+#>   CheckoutYear Checkouts
+#>          <int>     <int>
+#> 1         2005   3798685
+#> 2         2006   6599318
+#> 3         2007   7126627
+#> 4         2008   8438486
+#> 5         2009   9135167
+#> 6         2010   8608966
+#> # ℹ 12 more rows
+
+

Thanks to arrow, this code will work regardless of how large the underlying dataset is. But it’s currently rather slow: on Hadley’s computer, it took ~10s to run. That’s not terrible given how much data we have, but we can make it much faster by switching to a better format.

+

+22.4 The parquet format

+

To make this data easier to work with, let’s switch to the parquet file format and split it up into multiple files. The following sections will first introduce you to parquet and partitioning, and then apply what we learned to the Seattle library data.

+

+22.4.1 Advantages of parquet

+

Like CSV, parquet is used for rectangular data, but instead of being a text format that you can read with any file editor, it’s a custom binary format designed specifically for the needs of big data. This means that:

+
    +
  • Parquet files are usually smaller than the equivalent CSV file. Parquet relies on efficient encodings to keep file size down, and supports file compression. This helps make parquet files fast because there’s less data to move from disk to memory.

  • +
  • Parquet files have a rich type system. As we talked about in Seção 7.3, a CSV file does not provide any information about column types. For example, a CSV reader has to guess whether "08-10-2022" should be parsed as a string or a date. In contrast, parquet files store data in a way that records the type along with the data.

  • +
  • Parquet files are “column-oriented”. This means that they’re organized column-by-column, much like R’s data frame. This typically leads to better performance for data analysis tasks compared to CSV files, which are organized row-by-row.

  • +
  • Parquet files are “chunked”, which makes it possible to work on different parts of the file at the same time, and, if you’re lucky, to skip some chunks altogether.

  • +
+

There’s one primary disadvantage to parquet files: they are no longer “human readable”, i.e. if you look at a parquet file using readr::read_file(), you’ll just see a bunch of gibberish.

+

+22.4.2 Partitioning

+

As datasets get larger and larger, storing all the data in a single file gets increasingly painful and it’s often useful to split large datasets across many files. When this structuring is done intelligently, this strategy can lead to significant improvements in performance because many analyses will only require a subset of the files.

+

There are no hard and fast rules about how to partition your dataset: the results will depend on your data, access patterns, and the systems that read the data. You’re likely to need to do some experimentation before you find the ideal partitioning for your situation. As a rough guide, arrow suggests that you avoid files smaller than 20MB and larger than 2GB and avoid partitions that produce more than 10,000 files. You should also try to partition by variables that you filter by; as you’ll see shortly, that allows arrow to skip a lot of work by reading only the relevant files.

+

+22.4.3 Rewriting the Seattle library data

+

Let’s apply these ideas to the Seattle library data to see how they play out in practice. We’re going to partition by CheckoutYear, since it’s likely some analyses will only want to look at recent data and partitioning by year yields 18 chunks of a reasonable size.

+

To rewrite the data we define the partition using dplyr::group_by() and then save the partitions to a directory with arrow::write_dataset(). write_dataset() has two important arguments: a directory where we’ll create the files and the format we’ll use.

+
+
pq_path <- "data/seattle-library-checkouts"
+
+
+
seattle_csv |>
+  group_by(CheckoutYear) |>
+  write_dataset(path = pq_path, format = "parquet")
+
+

This takes about a minute to run; as we’ll see shortly this is an initial investment that pays off by making future operations much much faster.

+

Let’s take a look at what we just produced:

+
+
tibble(
+  files = list.files(pq_path, recursive = TRUE),
+  size_MB = file.size(file.path(pq_path, files)) / 1024^2
+)
+#> # A tibble: 18 × 2
+#>   files                            size_MB
+#>   <chr>                              <dbl>
+#> 1 CheckoutYear=2005/part-0.parquet    109.
+#> 2 CheckoutYear=2006/part-0.parquet    164.
+#> 3 CheckoutYear=2007/part-0.parquet    178.
+#> 4 CheckoutYear=2008/part-0.parquet    195.
+#> 5 CheckoutYear=2009/part-0.parquet    214.
+#> 6 CheckoutYear=2010/part-0.parquet    222.
+#> # ℹ 12 more rows
+
+

Our single 9GB CSV file has been rewritten into 18 parquet files. The file names use a “self-describing” convention used by the Apache Hive project. Hive-style partitions name folders with a “key=value” convention, so as you might guess, the CheckoutYear=2005 directory contains all the data where CheckoutYear is 2005. Each file is between 100 and 300 MB and the total size is now around 4 GB, a little over half the size of the original CSV file. This is as we expect since parquet is a much more efficient format.

+

+22.5 Using dplyr with arrow

+

Now we’ve created these parquet files, we’ll need to read them in again. We use open_dataset() again, but this time we give it a directory:

+
+
seattle_pq <- open_dataset(pq_path)
+
+

Now we can write our dplyr pipeline. For example, we could count the total number of books checked out in each month for the last five years:

+
+
query <- seattle_pq |> 
+  filter(CheckoutYear >= 2018, MaterialType == "BOOK") |>
+  group_by(CheckoutYear, CheckoutMonth) |>
+  summarize(TotalCheckouts = sum(Checkouts)) |>
+  arrange(CheckoutYear, CheckoutMonth)
+
+

Writing dplyr code for arrow data is conceptually similar to dbplyr, Capítulo 21: you write dplyr code, which is automatically transformed into a query that the Apache Arrow C++ library understands, which is then executed when you call collect(). If we print out the query object we can see a little information about what we expect Arrow to return when the execution takes place:

+
+
query
+#> FileSystemDataset (query)
+#> CheckoutYear: int32
+#> CheckoutMonth: int64
+#> TotalCheckouts: int64
+#> 
+#> * Grouped by CheckoutYear
+#> * Sorted by CheckoutYear [asc], CheckoutMonth [asc]
+#> See $.data for the source Arrow object
+
+

And we can get the results by calling collect():

+
+
query |> collect()
+#> # A tibble: 58 × 3
+#> # Groups:   CheckoutYear [5]
+#>   CheckoutYear CheckoutMonth TotalCheckouts
+#>          <int>         <int>          <int>
+#> 1         2018             1         355101
+#> 2         2018             2         309813
+#> 3         2018             3         344487
+#> 4         2018             4         330988
+#> 5         2018             5         318049
+#> 6         2018             6         341825
+#> # ℹ 52 more rows
+
+

Like dbplyr, arrow only understands some R expressions, so you may not be able to write exactly the same code you usually would. However, the list of operations and functions supported is fairly extensive and continues to grow; find a complete list of currently supported functions in ?acero.

+

+22.5.1 Performance

+

Let’s take a quick look at the performance impact of switching from CSV to parquet. First, let’s time how long it takes to calculate the number of books checked out in each month of 2021, when the data is stored as a single large csv:

+
+
seattle_csv |> 
+  filter(CheckoutYear == 2021, MaterialType == "BOOK") |>
+  group_by(CheckoutMonth) |>
+  summarize(TotalCheckouts = sum(Checkouts)) |>
+  arrange(desc(CheckoutMonth)) |>
+  collect() |> 
+  system.time()
+#>    user  system elapsed 
+#>  11.951   1.297  11.387
+
+

Now let’s use our new version of the dataset in which the Seattle library checkout data has been partitioned into 18 smaller parquet files:

+
+
seattle_pq |> 
+  filter(CheckoutYear == 2021, MaterialType == "BOOK") |>
+  group_by(CheckoutMonth) |>
+  summarize(TotalCheckouts = sum(Checkouts)) |>
+  arrange(desc(CheckoutMonth)) |>
+  collect() |> 
+  system.time()
+#>    user  system elapsed 
+#>   0.263   0.058   0.063
+
+

The ~100x speedup in performance is attributable to two factors: the multi-file partitioning, and the format of individual files:

+
    +
  • Partitioning improves performance because this query uses CheckoutYear == 2021 to filter the data, and arrow is smart enough to recognize that it only needs to read 1 of the 18 parquet files.
  • +
  • The parquet format improves performance by storing data in a binary format that can be read more directly into memory. The column-wise format and rich metadata means that arrow only needs to read the four columns actually used in the query (CheckoutYear, MaterialType, CheckoutMonth, and Checkouts).
  • +
+

This massive difference in performance is why it pays off to convert large CSVs to parquet!

+

+22.5.2 Using duckdb with arrow

+

There’s one last advantage of parquet and arrow — it’s very easy to turn an arrow dataset into a DuckDB database (Capítulo 21) by calling arrow::to_duckdb():

+
+
seattle_pq |> 
+  to_duckdb() |>
+  filter(CheckoutYear >= 2018, MaterialType == "BOOK") |>
+  group_by(CheckoutYear) |>
+  summarize(TotalCheckouts = sum(Checkouts)) |>
+  arrange(desc(CheckoutYear)) |>
+  collect()
+#> Warning: Missing values are always removed in SQL aggregation functions.
+#> Use `na.rm = TRUE` to silence this warning
+#> This warning is displayed once every 8 hours.
+#> # A tibble: 5 × 2
+#>   CheckoutYear TotalCheckouts
+#>          <int>          <dbl>
+#> 1         2022        2431502
+#> 2         2021        2266438
+#> 3         2020        1241999
+#> 4         2019        3931688
+#> 5         2018        3987569
+
+

The neat thing about to_duckdb() is that the transfer doesn’t involve any memory copying, and speaks to the goals of the arrow ecosystem: enabling seamless transitions from one computing environment to another.

+

+22.5.3 Exercises

+
    +
  1. Figure out the most popular book each year.
  2. +
  3. Which author has the most books in the Seattle library system?
  4. +
  5. How has checkouts of books vs ebooks changed over the last 10 years?
  6. +

+22.6 Summary

+

In this chapter, you’ve been given a taste of the arrow package, which provides a dplyr backend for working with large on-disk datasets. It can work with CSV files, and it’s much much faster if you convert your data to parquet. Parquet is a binary data format that’s designed specifically for data analysis on modern computers. Far fewer tools can work with parquet files compared to CSV, but its partitioned, compressed, and columnar structure makes it much more efficient to analyze.

+

Next up you’ll learn about your first non-rectangular data source, which you’ll handle using tools provided by the tidyr package. We’ll focus on data that comes from JSON files, but the general principles apply to tree-like data regardless of its source.

+ + +
+
+ + + + \ No newline at end of file diff --git a/base-R.html b/base-R.html new file mode 100644 index 000000000..b17a7b0ef --- /dev/null +++ b/base-R.html @@ -0,0 +1,1169 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 27  A field guide to base R + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ + + +
+

27  A field guide to base R

+
+ + + +
+ + + + +
+ + +

+27.1 Introduction

+

To finish off the programming section, we’re going to give you a quick tour of the most important base R functions that we don’t otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code you’ll encounter in the wild.

+

This is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, increasing the consistency across functions, and making each new function or package a little easier to learn and use. It’s not possible to use the tidyverse without using base R, so we’ve actually already taught you a lot of base R functions: from library() to load packages, to sum() and mean() for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like +, -, /, *, |, &, and !. What we haven’t focused on so far is base R workflows, so we will highlight a few of those in this chapter.

+

After you read this book, you’ll learn other approaches to the same problems using base R, data.table, and other packages. You’ll undoubtedly encounter these other approaches when you start reading R code written by others, particularly if you’re using StackOverflow. It’s 100% okay to write code that uses a mix of approaches, and don’t let anyone tell you otherwise!

+

In this chapter, we’ll focus on four big topics: subsetting with [, subsetting with [[ and $, the apply family of functions, and for loops. To finish off, we’ll briefly discuss two essential plotting functions.

+

+27.1.1 Prerequisites

+

This package focuses on base R so doesn’t have any real prerequisites, but we’ll load the tidyverse in order to explain some of the differences.

+ +

+27.2 Selecting multiple elements with [ +

+

[ is used to extract sub-components from vectors and data frames, and is called like x[i] or x[i, j]. In this section, we’ll introduce you to the power of [, first showing you how you can use it with vectors, then how the same principles extend in a straightforward way to two-dimensional (2d) structures like data frames. We’ll then help you cement that knowledge by showing how various dplyr verbs are special cases of [.

+

+27.2.1 Subsetting vectors

+

There are five main types of things that you can subset a vector with, i.e., that can be the i in x[i]:

+
    +
  1. +

    A vector of positive integers. Subsetting with positive integers keeps the elements at those positions:

    +
    +
    x <- c("one", "two", "three", "four", "five")
    +x[c(3, 2, 5)]
    +#> [1] "three" "two"   "five"
    +
    +

    By repeating a position, you can actually make a longer output than input, making the term “subsetting” a bit of a misnomer.

    +
    +
    x[c(1, 1, 5, 5, 5, 2)]
    +#> [1] "one"  "one"  "five" "five" "five" "two"
    +
    +
  2. +
  3. +

    A vector of negative integers. Negative values drop the elements at the specified positions:

    +
    +
    x[c(-1, -3, -5)]
    +#> [1] "two"  "four"
    +
    +
  4. +
  5. +

    A logical vector. Subsetting with a logical vector keeps all values corresponding to a TRUE value. This is most often useful in conjunction with the comparison functions.

    +
    +
    x <- c(10, 3, NA, 5, 8, 1, NA)
    +
    +# All non-missing values of x
    +x[!is.na(x)]
    +#> [1] 10  3  5  8  1
    +
    +# All even (or missing!) values of x
    +x[x %% 2 == 0]
    +#> [1] 10 NA  8 NA
    +
    +

    Unlike filter(), NA indices will be included in the output as NAs.

    +
  6. +
  7. +

    A character vector. If you have a named vector, you can subset it with a character vector:

    +
    +
    x <- c(abc = 1, def = 2, xyz = 5)
    +x[c("xyz", "def")]
    +#> xyz def 
    +#>   5   2
    +
    +

    As with subsetting with positive integers, you can use a character vector to duplicate individual entries.

    +
  8. +
  9. Nothing. The final type of subsetting is nothing, x[], which returns the complete x. This is not useful for subsetting vectors, but as we’ll see shortly, it is useful when subsetting 2d structures like tibbles.

  10. +

+27.2.2 Subsetting data frames

+

There are quite a few different ways1 that you can use [ with a data frame, but the most important way is to select rows and columns independently with df[rows, cols]. Here rows and cols are vectors as described above. For example, df[rows, ] and df[, cols] select just rows or just columns, using the empty subset to preserve the other dimension.

+

Here are a couple of examples:

+
+
df <- tibble(
+  x = 1:3, 
+  y = c("a", "e", "f"), 
+  z = runif(3)
+)
+
+# Select first row and second column
+df[1, 2]
+#> # A tibble: 1 × 1
+#>   y    
+#>   <chr>
+#> 1 a
+
+# Select all rows and columns x and y
+df[, c("x" , "y")]
+#> # A tibble: 3 × 2
+#>       x y    
+#>   <int> <chr>
+#> 1     1 a    
+#> 2     2 e    
+#> 3     3 f
+
+# Select rows where `x` is greater than 1 and all columns
+df[df$x > 1, ]
+#> # A tibble: 2 × 3
+#>       x y         z
+#>   <int> <chr> <dbl>
+#> 1     2 e     0.834
+#> 2     3 f     0.601
+
+

We’ll come back to $ shortly, but you should be able to guess what df$x does from the context: it extracts the x variable from df. We need to use it here because [ doesn’t use tidy evaluation, so you need to be explicit about the source of the x variable.

+

There’s an important difference between tibbles and data frames when it comes to [. In this book, we’ve mainly used tibbles, which are data frames, but they tweak some behaviors to make your life a little easier. In most places, you can use “tibble” and “data frame” interchangeably, so when we want to draw particular attention to R’s built-in data frame, we’ll write data.frame. If df is a data.frame, then df[, cols] will return a vector if col selects a single column and a data frame if it selects more than one column. If df is a tibble, then [ will always return a tibble.

+
+
df1 <- data.frame(x = 1:3)
+df1[, "x"]
+#> [1] 1 2 3
+
+df2 <- tibble(x = 1:3)
+df2[, "x"]
+#> # A tibble: 3 × 1
+#>       x
+#>   <int>
+#> 1     1
+#> 2     2
+#> 3     3
+
+

One way to avoid this ambiguity with data.frames is to explicitly specify drop = FALSE:

+
+
df1[, "x" , drop = FALSE]
+#>   x
+#> 1 1
+#> 2 2
+#> 3 3
+
+

+27.2.3 dplyr equivalents

+

Several dplyr verbs are special cases of [:

+
    +
  • +

    filter() is equivalent to subsetting the rows with a logical vector, taking care to exclude missing values:

    +
    +
    df <- tibble(
    +  x = c(2, 3, 1, 1, NA), 
    +  y = letters[1:5], 
    +  z = runif(5)
    +)
    +df |> filter(x > 1)
    +
    +# same as
    +df[!is.na(df$x) & df$x > 1, ]
    +
    +

    Another common technique in the wild is to use which() for its side-effect of dropping missing values: df[which(df$x > 1), ].

    +
  • +
  • +

    arrange() is equivalent to subsetting the rows with an integer vector, usually created with order():

    +
    +
    df |> arrange(x, y)
    +
    +# same as
    +df[order(df$x, df$y), ]
    +
    +

    You can use order(decreasing = TRUE) to sort all columns in descending order or -rank(col) to sort columns in decreasing order individually.

    +
  • +
  • +

    Both select() and relocate() are similar to subsetting the columns with a character vector:

    +
    +
    df |> select(x, z)
    +
    +# same as
    +df[, c("x", "z")]
    +
    +
  • +
+

Base R also provides a function that combines the features of filter() and select()2 called subset():

+
+
df |> 
+  filter(x > 1) |> 
+  select(y, z)
+#> # A tibble: 2 × 2
+#>   y           z
+#>   <chr>   <dbl>
+#> 1 a     0.157  
+#> 2 b     0.00740
+
+
+
# same as
+df |> subset(x > 1, c(y, z))
+
+

This function was the inspiration for much of dplyr’s syntax.

+

+27.2.4 Exercises

+
    +
  1. +

    Create functions that take a vector as input and return:

    +
      +
    1. The elements at even-numbered positions.
    2. +
    3. Every element except the last value.
    4. +
    5. Only even values (and no missing values).
    6. +
    +
  2. +
  3. Why is x[-which(x > 0)] not the same as x[x <= 0]? Read the documentation for which() and do some experiments to figure it out.

  4. +

+27.3 Selecting a single element with $ and [[ +

+

[, which selects many elements, is paired with [[ and $, which extract a single element. In this section, we’ll show you how to use [[ and $ to pull columns out of data frames, discuss a couple more differences between data.frames and tibbles, and emphasize some important differences between [ and [[ when used with lists.

+

+27.3.1 Data frames

+

[[ and $ can be used to extract columns out of a data frame. [[ can access by position or by name, and $ is specialized for access by name:

+
+
tb <- tibble(
+  x = 1:4,
+  y = c(10, 4, 1, 21)
+)
+
+# by position
+tb[[1]]
+#> [1] 1 2 3 4
+
+# by name
+tb[["x"]]
+#> [1] 1 2 3 4
+tb$x
+#> [1] 1 2 3 4
+
+

They can also be used to create new columns, the base R equivalent of mutate():

+
+
tb$z <- tb$x + tb$y
+tb
+#> # A tibble: 4 × 3
+#>       x     y     z
+#>   <int> <dbl> <dbl>
+#> 1     1    10    11
+#> 2     2     4     6
+#> 3     3     1     4
+#> 4     4    21    25
+
+

There are several other base R approaches to creating new columns including with transform(), with(), and within(). Hadley collected a few examples at https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf.

+

Using $ directly is convenient when performing quick summaries. For example, if you just want to find the size of the biggest diamond or the possible values of cut, there’s no need to use summarize():

+
+
max(diamonds$carat)
+#> [1] 5.01
+
+levels(diamonds$cut)
+#> [1] "Fair"      "Good"      "Very Good" "Premium"   "Ideal"
+
+

dplyr also provides an equivalent to [[/$ that we didn’t mention in Capítulo 3: pull(). pull() takes either a variable name or variable position and returns just that column. That means we could rewrite the above code to use the pipe:

+
+
diamonds |> pull(carat) |> max()
+#> [1] 5.01
+
+diamonds |> pull(cut) |> levels()
+#> [1] "Fair"      "Good"      "Very Good" "Premium"   "Ideal"
+
+

+27.3.2 Tibbles

+

There are a couple of important differences between tibbles and base data.frames when it comes to $. Data frames match the prefix of any variable names (so-called partial matching) and don’t complain if a column doesn’t exist:

+
+
df <- data.frame(x1 = 1)
+df$x
+#> [1] 1
+df$z
+#> NULL
+
+

Tibbles are more strict: they only ever match variable names exactly and they will generate a warning if the column you are trying to access doesn’t exist:

+
+
tb <- tibble(x1 = 1)
+
+tb$x
+#> Warning: Unknown or uninitialised column: `x`.
+#> NULL
+tb$z
+#> Warning: Unknown or uninitialised column: `z`.
+#> NULL
+
+

For this reason we sometimes joke that tibbles are lazy and surly: they do less and complain more.

+

+27.3.3 Lists

+

[[ and $ are also really important for working with lists, and it’s important to understand how they differ from [. Let’s illustrate the differences with a list named l:

+
+
l <- list(
+  a = 1:3, 
+  b = "a string", 
+  c = pi, 
+  d = list(-1, -5)
+)
+
+
    +
  • +

    [ extracts a sub-list. It doesn’t matter how many elements you extract, the result will always be a list.

    +
    +
    str(l[1:2])
    +#> List of 2
    +#>  $ a: int [1:3] 1 2 3
    +#>  $ b: chr "a string"
    +
    +str(l[1])
    +#> List of 1
    +#>  $ a: int [1:3] 1 2 3
    +
    +str(l[4])
    +#> List of 1
    +#>  $ d:List of 2
    +#>   ..$ : num -1
    +#>   ..$ : num -5
    +
    +

    Like with vectors, you can subset with a logical, integer, or character vector.

    +
  • +
  • +

    [[ and $ extract a single component from a list. They remove a level of hierarchy from the list.

    +
    +
    str(l[[1]])
    +#>  int [1:3] 1 2 3
    +
    +str(l[[4]])
    +#> List of 2
    +#>  $ : num -1
    +#>  $ : num -5
    +
    +str(l$a)
    +#>  int [1:3] 1 2 3
    +
    +
  • +
+

The difference between [ and [[ is particularly important for lists because [[ drills down into the list while [ returns a new, smaller list. To help you remember the difference, take a look at the unusual pepper shaker shown in Figura 27.1. If this pepper shaker is your list pepper, then, pepper[1] is a pepper shaker containing a single pepper packet. pepper[2] would look the same, but would contain the second packet. pepper[1:2] would be a pepper shaker containing two pepper packets. pepper[[1]] would extract the pepper packet itself.

+
+
+
+

Three photos. On the left is a photo of a glass pepper shaker. Instead of the pepper shaker containing pepper, it contains a single packet of pepper. In the middle is a photo of a single packet of pepper. On the right is a photo of the contents of a packet of pepper.

+
Figura 27.1: (Left) A pepper shaker that Hadley once found in his hotel room. (Middle) pepper[1]. (Right) pepper[[1]]
+
+
+
+

This same principle applies when you use 1d [ with a data frame: df["x"] returns a one-column data frame and df[["x"]] returns a vector.

+

+27.3.4 Exercises

+
    +
  1. What happens when you use [[ with a positive integer that’s bigger than the length of the vector? What happens when you subset with a name that doesn’t exist?

  2. +
  3. What would pepper[[1]][1] be? What about pepper[[1]][[1]]?

  4. +

+27.4 Apply family

+

In Capítulo 26, you learned tidyverse techniques for iteration like dplyr::across() and the map family of functions. In this section, you’ll learn about their base equivalents, the apply family. In this context apply and map are synonyms because another way of saying “map a function over each element of a vector” is “apply a function over each element of a vector”. Here we’ll give you a quick overview of this family so you can recognize them in the wild.

+

The most important member of this family is lapply(), which is very similar to purrr::map()3. In fact, because we haven’t used any of map()’s more advanced features, you can replace every map() call in Capítulo 26 with lapply().

+

There’s no exact base R equivalent to across() but you can get close by using [ with lapply(). This works because under the hood, data frames are lists of columns, so calling lapply() on a data frame applies the function to each column.

+
+
df <- tibble(a = 1, b = 2, c = "a", d = "b", e = 4)
+
+# First find numeric columns
+num_cols <- sapply(df, is.numeric)
+num_cols
+#>     a     b     c     d     e 
+#>  TRUE  TRUE FALSE FALSE  TRUE
+
+# Then transform each column with lapply() then replace the original values
+df[, num_cols] <- lapply(df[, num_cols, drop = FALSE], \(x) x * 2)
+df
+#> # A tibble: 1 × 5
+#>       a     b c     d         e
+#>   <dbl> <dbl> <chr> <chr> <dbl>
+#> 1     2     4 a     b         8
+
+

The code above uses a new function, sapply(). It’s similar to lapply() but it always tries to simplify the result, hence the s in its name, here producing a logical vector instead of a list. We don’t recommend using it for programming, because the simplification can fail and give you an unexpected type, but it’s usually fine for interactive use. purrr has a similar function called map_vec() that we didn’t mention in Capítulo 26.

+

Base R provides a stricter version of sapply() called vapply(), short for vector apply. It takes an additional argument that specifies the expected type, ensuring that simplification occurs the same way regardless of the input. For example, we could replace the sapply() call above with this vapply() where we specify that we expect is.numeric() to return a logical vector of length 1:

+
+
vapply(df, is.numeric, logical(1))
+#>     a     b     c     d     e 
+#>  TRUE  TRUE FALSE FALSE  TRUE
+
+

The distinction between sapply() and vapply() is really important when they’re inside a function (because it makes a big difference to the function’s robustness to unusual inputs), but it doesn’t usually matter in data analysis.

+

Another important member of the apply family is tapply() which computes a single grouped summary:

+
+
diamonds |> 
+  group_by(cut) |> 
+  summarize(price = mean(price))
+#> # A tibble: 5 × 2
+#>   cut       price
+#>   <ord>     <dbl>
+#> 1 Fair      4359.
+#> 2 Good      3929.
+#> 3 Very Good 3982.
+#> 4 Premium   4584.
+#> 5 Ideal     3458.
+
+tapply(diamonds$price, diamonds$cut, mean)
+#>      Fair      Good Very Good   Premium     Ideal 
+#>  4358.758  3928.864  3981.760  4584.258  3457.542
+
+

Unfortunately tapply() returns its results in a named vector which requires some gymnastics if you want to collect multiple summaries and grouping variables into a data frame (it’s certainly possible to not do this and just work with free floating vectors, but in our experience that just delays the work). If you want to see how you might use tapply() or other base techniques to perform other grouped summaries, Hadley has collected a few techniques in a gist.

+

The final member of the apply family is the titular apply(), which works with matrices and arrays. In particular, watch out for apply(df, 2, something), which is a slow and potentially dangerous way of doing lapply(df, something). This rarely comes up in data science because we usually work with data frames and not matrices.

+

+27.5 for loops

+

for loops are the fundamental building block of iteration that both the apply and map families use under the hood. for loops are powerful and general tools that are important to learn as you become a more experienced R programmer. The basic structure of a for loop looks like this:

+
+
for (element in vector) {
+  # do something with element
+}
+
+

The most straightforward use of for loops is to achieve the same effect as walk(): call some function with a side-effect on each element of a list. For example, in Seção 26.4.1 instead of using walk():

+
+
paths |> walk(append_file)
+
+

We could have used a for loop:

+
+
for (path in paths) {
+  append_file(path)
+}
+
+

Things get a little trickier if you want to save the output of the for loop, for example reading all of the excel files in a directory like we did in Capítulo 26:

+
+
paths <- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE)
+files <- map(paths, readxl::read_excel)
+
+

There are a few different techniques that you can use, but we recommend being explicit about what the output is going to look like upfront. In this case, we’re going to want a list the same length as paths, which we can create with vector():

+
+
files <- vector("list", length(paths))
+
+

Then instead of iterating over the elements of paths, we’ll iterate over their indices, using seq_along() to generate one index for each element of paths:

+
+
seq_along(paths)
+#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12
+
+

Using the indices is important because it allows us to link to each position in the input with the corresponding position in the output:

+
+
for (i in seq_along(paths)) {
+  files[[i]] <- readxl::read_excel(paths[[i]])
+}
+
+

To combine the list of tibbles into a single tibble you can use do.call() + rbind():

+
+
do.call(rbind, files)
+#> # A tibble: 1,704 × 5
+#>   country     continent lifeExp      pop gdpPercap
+#>   <chr>       <chr>       <dbl>    <dbl>     <dbl>
+#> 1 Afghanistan Asia         28.8  8425333      779.
+#> 2 Albania     Europe       55.2  1282697     1601.
+#> 3 Algeria     Africa       43.1  9279525     2449.
+#> 4 Angola      Africa       30.0  4232095     3521.
+#> 5 Argentina   Americas     62.5 17876956     5911.
+#> 6 Australia   Oceania      69.1  8691212    10040.
+#> # ℹ 1,698 more rows
+
+

Rather than making a list and saving the results as we go, a simpler approach is to build up the data frame piece-by-piece:

+
+
out <- NULL
+for (path in paths) {
+  out <- rbind(out, readxl::read_excel(path))
+}
+
+

We recommend avoiding this pattern because it can become very slow when the vector is very long. This is the source of the persistent canard that for loops are slow: they’re not, but iteratively growing a vector is.

+

+27.6 Plots

+

Many R users who don’t otherwise use the tidyverse prefer ggplot2 for plotting due to helpful features like sensible defaults, automatic legends, and a modern look. However, base R plotting functions can still be useful because they’re so concise — it takes very little typing to do a basic exploratory plot.

+

There are two main types of base plot you’ll see in the wild: scatterplots and histograms, produced with plot() and hist() respectively. Here’s a quick example from the diamonds dataset:

+
+
# Left
+hist(diamonds$carat)
+
+# Right
+plot(diamonds$carat, diamonds$price)
+
+
+
+

On the left, histogram of carats of diamonds, ranging from 0 to 5 carats. The distribution is unimodal and right-skewed. On the right, scatter plot of price vs. carat of diamonds, showing a positive relationship that fans out as both price and carat increases. The scatter plot shows very few diamonds bigger than 3 carats compared to diamonds between 0 to 3 carats.

+
+
+

On the left, histogram of carats of diamonds, ranging from 0 to 5 carats. The distribution is unimodal and right-skewed. On the right, scatter plot of price vs. carat of diamonds, showing a positive relationship that fans out as both price and carat increases. The scatter plot shows very few diamonds bigger than 3 carats compared to diamonds between 0 to 3 carats.

+
+
+
+
+

Note that base plotting functions work with vectors, so you need to pull columns out of the data frame using $ or some other technique.

+

+27.7 Summary

+

In this chapter, we’ve shown you a selection of base R functions useful for subsetting and iteration. Compared to approaches discussed elsewhere in the book, these functions tend to have more of a “vector” flavor than a “data frame” flavor because base R functions tend to take individual vectors, rather than a data frame and some column specification. This often makes life easier for programming and so becomes more important as you write more functions and begin to write your own packages.

+

This chapter concludes the programming section of the book. You’ve made a solid start on your journey to becoming not just a data scientist who uses R, but a data scientist who can program in R. We hope these chapters have sparked your interest in programming and that you’re looking forward to learning more outside of this book.

+ + +

+
    +
  1. Read https://adv-r.hadley.nz/subsetting.html#subset-multiple to see how you can also subset a data frame like it is a 1d object and how you can subset it with a matrix.↩︎

  2. +
  3. But it doesn’t handle grouped data frames differently and it doesn’t support selection helper functions like starts_with().↩︎

  4. +
  5. It just lacks convenient features like progress bars and reporting which element caused the problem if there’s an error.↩︎

  6. +
+
+ + + + \ No newline at end of file diff --git a/base-R_files/figure-html/unnamed-chunk-38-1.png b/base-R_files/figure-html/unnamed-chunk-38-1.png new file mode 100644 index 000000000..d8890ea67 Binary files /dev/null and b/base-R_files/figure-html/unnamed-chunk-38-1.png differ diff --git a/base-R_files/figure-html/unnamed-chunk-38-2.png b/base-R_files/figure-html/unnamed-chunk-38-2.png new file mode 100644 index 000000000..c5b851864 Binary files /dev/null and b/base-R_files/figure-html/unnamed-chunk-38-2.png differ diff --git a/communicate.html b/communicate.html index 395a2626a..28dbc7958 100644 --- a/communicate.html +++ b/communicate.html @@ -27,7 +27,8 @@ - + + @@ -132,29 +133,226 @@ 2  Fluxo de Trabalho: básico + + + + + + + + + + + + + + + + + + + + + +
@@ -188,8 +386,8 @@

?sec-quarto, você irá aprender sobre o Quarto, uma ferramenta para integrar texto, código e resultados. Você pode usar o Quarto tanto para comunicação entre analistas, quanto para comunicação entre analistas e pessoas tomadoras de decisão. Graças ao poder dos formatos do Quarto, você pode até usar o mesmo documento para ambos os propósitos.

-
  • No ?sec-quarto-formats, você irá aprender um pouco sobre as muitas outras variedades de outputs possíveis de serem produzidos usando o Quarto, incluindo dashboards, websites e livros.

  • +
  • No Capítulo 28, você irá aprender sobre o Quarto, uma ferramenta para integrar texto, código e resultados. Você pode usar o Quarto tanto para comunicação entre analistas, quanto para comunicação entre analistas e pessoas tomadoras de decisão. Graças ao poder dos formatos do Quarto, você pode até usar o mesmo documento para ambos os propósitos.

  • +
  • No Capítulo 29, você irá aprender um pouco sobre as muitas outras variedades de outputs possíveis de serem produzidos usando o Quarto, incluindo dashboards, websites e livros.

  • Esses capítulos focam principalmente na parte técnica da comunicação, não nos problemas realmente difíceis de comunicar seus pensamentos para outros humanos. Entretanto, há vários outros ótimos livros sobre comunicação, os quais iremos indicar no final de cada capítulo.

    @@ -428,11 +626,14 @@

    diff --git a/communication.html b/communication.html new file mode 100644 index 000000000..6cab37b07 --- /dev/null +++ b/communication.html @@ -0,0 +1,1458 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 11  Communication + + + + + + + + + + + + + + + + + + + + + + + + +
    +
    + +
    + + + +
    +

    11  Communication

    +
    + + + +
    + + + + +
    + + +

    +11.1 Introduction

    +

    In Capítulo 10, you learned how to use plots as tools for exploration. When you make exploratory plots, you know—even before looking—which variables the plot will display. You made each plot for a purpose, could quickly look at it, and then move on to the next plot. In the course of most analyses, you’ll produce tens or hundreds of plots, most of which are immediately thrown away.

    +

    Now that you understand your data, you need to communicate your understanding to others. Your audience will likely not share your background knowledge and will not be deeply invested in the data. To help others quickly build up a good mental model of the data, you will need to invest considerable effort in making your plots as self-explanatory as possible. In this chapter, you’ll learn some of the tools that ggplot2 provides to do so.

    +

    This chapter focuses on the tools you need to create good graphics. We assume that you know what you want, and just need to know how to do it. For that reason, we highly recommend pairing this chapter with a good general visualization book. We particularly like The Truthful Art, by Albert Cairo. It doesn’t teach the mechanics of creating visualizations, but instead focuses on what you need to think about in order to create effective graphics.

    +

    +11.1.1 Prerequisites

    +

    In this chapter, we’ll focus once again on ggplot2. We’ll also use a little dplyr for data manipulation, scales to override the default breaks, labels, transformations and palettes, and a few ggplot2 extension packages, including ggrepel (https://ggrepel.slowkow.com) by Kamil Slowikowski and patchwork (https://patchwork.data-imaginist.com) by Thomas Lin Pedersen. Don’t forget that you’ll need to install those packages with install.packages() if you don’t already have them.

    + +

    +11.2 Labels

    +

    The easiest place to start when turning an exploratory graphic into an expository graphic is with good labels. You add labels with the labs() function.

    +
    +
    ggplot(mpg, aes(x = displ, y = hwy)) +
    +  geom_point(aes(color = class)) +
    +  geom_smooth(se = FALSE) +
    +  labs(
    +    x = "Engine displacement (L)",
    +    y = "Highway fuel economy (mpg)",
    +    color = "Car type",
    +    title = "Fuel efficiency generally decreases with engine size",
    +    subtitle = "Two seaters (sports cars) are an exception because of their light weight",
    +    caption = "Data from fueleconomy.gov"
    +  )
    +
    +

    Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of cars is overlaid. The x-axis is labelled "Engine displacement (L)" and the y-axis is labelled "Highway fuel economy (mpg)". The legend is labelled "Car type". The plot is titled "Fuel efficiency generally decreases with engine size". The subtitle is "Two seaters (sports cars) are an exception because of their light weight" and the caption is "Data from fueleconomy.gov".

    +
    +
    +

    The purpose of a plot title is to summarize the main finding. Avoid titles that just describe what the plot is, e.g., “A scatterplot of engine displacement vs. fuel economy”.

    +

    If you need to add more text, there are two other useful labels: subtitle adds additional detail in a smaller font beneath the title and caption adds text at the bottom right of the plot, often used to describe the source of the data. You can also use labs() to replace the axis and legend titles. It’s usually a good idea to replace short variable names with more detailed descriptions, and to include the units.

    +

    It’s possible to use mathematical equations instead of text strings. Just switch "" out for quote() and read about the available options in ?plotmath:

    +
    +
    df <- tibble(
    +  x = 1:10,
    +  y = cumsum(x^2)
    +)
    +
    +ggplot(df, aes(x, y)) +
    +  geom_point() +
    +  labs(
    +    x = quote(x[i]),
    +    y = quote(sum(x[i] ^ 2, i == 1, n))
    +  )
    +
    +

    Scatterplot with math text on the x and y axis labels. X-axis label says x_i, y-axis label says sum of x_i  squared, for i from 1 to n.

    +
    +
    +

    +11.2.1 Exercises

    +
      +
    1. Create one plot on the fuel economy data with customized title, subtitle, caption, x, y, and color labels.

    2. +
    3. +

      Recreate the following plot using the fuel economy data. Note that both the colors and shapes of points vary by type of drive train.

      +
      +
      +

      Scatterplot of highway versus city fuel efficiency. Shapes and colors of points are determined by type of drive train.

      +
      +
      +
    4. +
    5. Take an exploratory graphic that you’ve created in the last month, and add informative titles to make it easier for others to understand.

    6. +

    +11.3 Annotations

    +

    In addition to labelling major components of your plot, it’s often useful to label individual observations or groups of observations. The first tool you have at your disposal is geom_text(). geom_text() is similar to geom_point(), but it has an additional aesthetic: label. This makes it possible to add textual labels to your plots.

    +

    There are two possible sources of labels. First, you might have a tibble that provides labels. In the following plot we pull out the cars with the highest engine size in each drive type and save their information as a new data frame called label_info.

    +
    +
    label_info <- mpg |>
    +  group_by(drv) |>
    +  arrange(desc(displ)) |>
    +  slice_head(n = 1) |>
    +  mutate(
    +    drive_type = case_when(
    +      drv == "f" ~ "front-wheel drive",
    +      drv == "r" ~ "rear-wheel drive",
    +      drv == "4" ~ "4-wheel drive"
    +    )
    +  ) |>
    +  select(displ, hwy, drv, drive_type)
    +
    +label_info
    +#> # A tibble: 3 × 4
    +#> # Groups:   drv [3]
    +#>   displ   hwy drv   drive_type       
    +#>   <dbl> <int> <chr> <chr>            
    +#> 1   6.5    17 4     4-wheel drive    
    +#> 2   5.3    25 f     front-wheel drive
    +#> 3   7      24 r     rear-wheel drive
    +
    +

    Then, we use this new data frame to directly label the three groups to replace the legend with labels placed directly on the plot. Using the fontface and size arguments we can customize the look of the text labels. They’re larger than the rest of the text on the plot and bolded. (theme(legend.position = "none") turns all the legends off — we’ll talk about it more shortly.)

    +
    +
    ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
    +  geom_point(alpha = 0.3) +
    +  geom_smooth(se = FALSE) +
    +  geom_text(
    +    data = label_info, 
    +    aes(x = displ, y = hwy, label = drive_type),
    +    fontface = "bold", size = 5, hjust = "right", vjust = "bottom"
    +  ) +
    +  theme(legend.position = "none")
    +#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
    +
    +

    Scatterplot of highway mileage versus engine size where points are colored by drive type. Smooth curves for each drive type are overlaid. Text labels identify the curves as front-wheel, rear-wheel, and 4-wheel.

    +
    +
    +

    Note the use of hjust (horizontal justification) and vjust (vertical justification) to control the alignment of the label.

    +

    However the annotated plot we made above is hard to read because the labels overlap with each other, and with the points. We can use the geom_label_repel() function from the ggrepel package to address both of these issues. This useful package will automatically adjust labels so that they don’t overlap:

    +
    +
    ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
    +  geom_point(alpha = 0.3) +
    +  geom_smooth(se = FALSE) +
    +  geom_label_repel(
    +    data = label_info, 
    +    aes(x = displ, y = hwy, label = drive_type),
    +    fontface = "bold", size = 5, nudge_y = 2
    +  ) +
    +  theme(legend.position = "none")
    +#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
    +
    +

    Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. Some points are labelled with the car's name. The labels are box with white, transparent background and positioned to not overlap.

    +
    +
    +

    You can also use the same idea to highlight certain points on a plot with geom_text_repel() from the ggrepel package. Note another handy technique used here: we added a second layer of large, hollow points to further highlight the labelled points.

    +
    +
    potential_outliers <- mpg |>
    +  filter(hwy > 40 | (hwy > 20 & displ > 5))
    +  
    +ggplot(mpg, aes(x = displ, y = hwy)) +
    +  geom_point() +
    +  geom_text_repel(data = potential_outliers, aes(label = model)) +
    +  geom_point(data = potential_outliers, color = "red") +
    +  geom_point(
    +    data = potential_outliers,
    +    color = "red", size = 3, shape = "circle open"
    +  )
    +
    +

    Scatterplot of highway fuel efficiency versus engine size of cars. Points where highway mileage is above 40 as well as above 20 with engine size above 5 are red, with a hollow red circle, and labelled with model name of the car.

    +
    +
    +

    Remember, in addition to geom_text() and geom_label(), you have many other geoms in ggplot2 available to help annotate your plot. A couple ideas:

    +
      +
    • Use geom_hline() and geom_vline() to add reference lines. We often make them thick (linewidth = 2) and white (color = white), and draw them underneath the primary data layer. That makes them easy to see, without drawing attention away from the data.

    • +
    • Use geom_rect() to draw a rectangle around points of interest. The boundaries of the rectangle are defined by aesthetics xmin, xmax, ymin, ymax. Alternatively, look into the ggforce package, specifically geom_mark_hull(), which allows you to annotate subsets of points with hulls.

    • +
    • Use geom_segment() with the arrow argument to draw attention to a point with an arrow. Use aesthetics x and y to define the starting location, and xend and yend to define the end location.

    • +
    +

    Another handy function for adding annotations to plots is annotate(). As a rule of thumb, geoms are generally useful for highlighting a subset of the data while annotate() is useful for adding one or few annotation elements to a plot.

    +

    To demonstrate using annotate(), let’s create some text to add to our plot. The text is a bit long, so we’ll use stringr::str_wrap() to automatically add line breaks to it given the number of characters you want per line:

    +
    +
    trend_text <- "Larger engine sizes tend to have lower fuel economy." |>
    +  str_wrap(width = 30)
    +trend_text
    +#> [1] "Larger engine sizes tend to\nhave lower fuel economy."
    +
    +

    Then, we add two layers of annotation: one with a label geom and the other with a segment geom. The x and y aesthetics in both define where the annotation should start, and the xend and yend aesthetics in the segment annotation define the end location of the segment. Note also that the segment is styled as an arrow.

    +
    +
    ggplot(mpg, aes(x = displ, y = hwy)) +
    +  geom_point() +
    +  annotate(
    +    geom = "label", x = 3.5, y = 38,
    +    label = trend_text,
    +    hjust = "left", color = "red"
    +  ) +
    +  annotate(
    +    geom = "segment",
    +    x = 3, y = 35, xend = 5, yend = 25, color = "red",
    +    arrow = arrow(type = "closed")
    +  )
    +
    +

    Scatterplot of highway fuel efficiency versus engine size of cars. A red arrow pointing down follows the trend of the points and the annotation placed next to the arrow reads "Larger engine sizes tend to have lower fuel economy". The arrow and the annotation text is red.

    +
    +
    +

    Annotation is a powerful tool for communicating main takeaways and interesting features of your visualizations. The only limit is your imagination (and your patience with positioning annotations to be aesthetically pleasing)!

    +

    +11.3.1 Exercises

    +
      +
    1. Use geom_text() with infinite positions to place text at the four corners of the plot.

    2. +
    3. Use annotate() to add a point geom in the middle of your last plot without having to create a tibble. Customize the shape, size, or color of the point.

    4. +
    5. How do labels with geom_text() interact with faceting? How can you add a label to a single facet? How can you put a different label in each facet? (Hint: Think about the dataset that is being passed to geom_text().)

    6. +
    7. What arguments to geom_label() control the appearance of the background box?

    8. +
    9. What are the four arguments to arrow()? How do they work? Create a series of plots that demonstrate the most important options.

    10. +

    +11.4 Scales

    +

    The third way you can make your plot better for communication is to adjust the scales. Scales control how the aesthetic mappings manifest visually.

    +

    +11.4.1 Default scales

    +

    Normally, ggplot2 automatically adds scales for you. For example, when you type:

    +
    +
    ggplot(mpg, aes(x = displ, y = hwy)) +
    +  geom_point(aes(color = class))
    +
    +

    ggplot2 automatically adds default scales behind the scenes:

    +
    +
    ggplot(mpg, aes(x = displ, y = hwy)) +
    +  geom_point(aes(color = class)) +
    +  scale_x_continuous() +
    +  scale_y_continuous() +
    +  scale_color_discrete()
    +
    +

    Note the naming scheme for scales: scale_ followed by the name of the aesthetic, then _, then the name of the scale. The default scales are named according to the type of variable they align with: continuous, discrete, datetime, or date. scale_x_continuous() puts the numeric values from displ on a continuous number line on the x-axis, scale_color_discrete() chooses colors for each of the class of car, etc. There are lots of non-default scales which you’ll learn about below.

    +

    The default scales have been carefully chosen to do a good job for a wide range of inputs. Nevertheless, you might want to override the defaults for two reasons:

    +
      +
    • You might want to tweak some of the parameters of the default scale. This allows you to do things like change the breaks on the axes, or the key labels on the legend.

    • +
    • You might want to replace the scale altogether, and use a completely different algorithm. Often you can do better than the default because you know more about the data.

    • +

    +11.4.2 Axis ticks and legend keys

    +

    Collectively axes and legends are called guides. Axes are used for x and y aesthetics; legends are used for everything else.

    +

    There are two primary arguments that affect the appearance of the ticks on the axes and the keys on the legend: breaks and labels. Breaks controls the position of the ticks, or the values associated with the keys. Labels controls the text label associated with each tick/key. The most common use of breaks is to override the default choice:

    +
    +
    ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
    +  geom_point() +
    +  scale_y_continuous(breaks = seq(15, 40, by = 5)) 
    +
    +

    Scatterplot of highway fuel efficiency versus engine size of cars, colored by drive. The y-axis has breaks starting at 15 and ending at 40, increasing by 5.

    +
    +
    +

    You can use labels in the same way (a character vector the same length as breaks), but you can also set it to NULL to suppress the labels altogether. This can be useful for maps, or for publishing plots where you can’t share the absolute numbers. You can also use breaks and labels to control the appearance of legends. For discrete scales for categorical variables, labels can be a named list of the existing levels names and the desired labels for them.

    +
    +
    ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
    +  geom_point() +
    +  scale_x_continuous(labels = NULL) +
    +  scale_y_continuous(labels = NULL) +
    +  scale_color_discrete(labels = c("4" = "4-wheel", "f" = "front", "r" = "rear"))
    +
    +

    Scatterplot of highway fuel efficiency versus engine size of cars, colored by drive. The x and y-axes do not have any labels at the axis ticks. The legend has custom labels: 4-wheel, front, rear.

    +
    +
    +

    The labels argument coupled with labelling functions from the scales package is also useful for formatting numbers as currency, percent, etc. The plot on the left shows default labelling with label_dollar(), which adds a dollar sign as well as a thousand separator comma. The plot on the right adds further customization by dividing dollar values by 1,000 and adding a suffix “K” (for “thousands”) as well as adding custom breaks. Note that breaks is in the original scale of the data.

    +
    +
    # Left
    +ggplot(diamonds, aes(x = price, y = cut)) +
    +  geom_boxplot(alpha = 0.05) +
    +  scale_x_continuous(labels = label_dollar())
    +
    +# Right
    +ggplot(diamonds, aes(x = price, y = cut)) +
    +  geom_boxplot(alpha = 0.05) +
    +  scale_x_continuous(
    +    labels = label_dollar(scale = 1/1000, suffix = "K"), 
    +    breaks = seq(1000, 19000, by = 6000)
    +  )
    +
    +
    +
    +

    Two side-by-side box plots of price versus cut of diamonds. The outliers are transparent. On both plots the x-axis labels are formatted as dollars. The x-axis labels on the plot start at $0 and go to $15,000, increasing by $5,000. The x-axis labels on the right plot start at $1K and go to $19K, increasing by $6K.

    +
    +
    +

    Two side-by-side box plots of price versus cut of diamonds. The outliers are transparent. On both plots the x-axis labels are formatted as dollars. The x-axis labels on the plot start at $0 and go to $15,000, increasing by $5,000. The x-axis labels on the right plot start at $1K and go to $19K, increasing by $6K.

    +
    +
    +
    +
    +

    Another handy label function is label_percent():

    +
    +
    ggplot(diamonds, aes(x = cut, fill = clarity)) +
    +  geom_bar(position = "fill") +
    +  scale_y_continuous(name = "Percentage", labels = label_percent())
    +
    +

    Segmented bar plots of cut, filled with levels of clarity. The y-axis labels start at 0% and go to 100%, increasing by 25%. The y-axis label name is "Percentage".

    +
    +
    +

    Another use of breaks is when you have relatively few data points and want to highlight exactly where the observations occur. For example, take this plot that shows when each US president started and ended their term.

    +
    +
    presidential |>
    +  mutate(id = 33 + row_number()) |>
    +  ggplot(aes(x = start, y = id)) +
    +  geom_point() +
    +  geom_segment(aes(xend = end, yend = id)) +
    +  scale_x_date(name = NULL, breaks = presidential$start, date_labels = "'%y")
    +
    +

    Line plot of id number of presidents versus the year they started their presidency. Start year is marked with a point and a segment that starts there and ends at the end of the presidency. The x-axis labels are formatted as two digit years starting with an apostrophe, e.g., '53.

    +
    +
    +

    Note that for the breaks argument we pulled out the start variable as a vector with presidential$start because we can’t do an aesthetic mapping for this argument. Also note that the specification of breaks and labels for date and datetime scales is a little different:

    +
      +
    • date_labels takes a format specification, in the same form as parse_datetime().

    • +
    • date_breaks (not shown here), takes a string like “2 days” or “1 month”.

    • +

    +11.4.3 Legend layout

    +

    You will most often use breaks and labels to tweak the axes. While they both also work for legends, there are a few other techniques you are more likely to use.

    +

    To control the overall position of the legend, you need to use a theme() setting. We’ll come back to themes at the end of the chapter, but in brief, they control the non-data parts of the plot. The theme setting legend.position controls where the legend is drawn:

    +
    +
    base <- ggplot(mpg, aes(x = displ, y = hwy)) +
    +  geom_point(aes(color = class))
    +
    +base + theme(legend.position = "right") # the default
    +base + theme(legend.position = "left")
    +base + 
    +  theme(legend.position = "top") +
    +  guides(color = guide_legend(nrow = 3))
    +base + 
    +  theme(legend.position = "bottom") +
    +  guides(color = guide_legend(nrow = 3))
    +
    +
    +
    +

    Four scatterplots of highway fuel efficiency versus engine size of cars where points are colored based on class of car. Clockwise, the legend is placed on the right, left, top, and bottom of the plot.

    +
    +
    +

    Four scatterplots of highway fuel efficiency versus engine size of cars where points are colored based on class of car. Clockwise, the legend is placed on the right, left, top, and bottom of the plot.

    +
    +
    +
    +
    +

    Four scatterplots of highway fuel efficiency versus engine size of cars where points are colored based on class of car. Clockwise, the legend is placed on the right, left, top, and bottom of the plot.

    +
    +
    +

    Four scatterplots of highway fuel efficiency versus engine size of cars where points are colored based on class of car. Clockwise, the legend is placed on the right, left, top, and bottom of the plot.

    +
    +
    +
    +
    +

    If your plot is short and wide, place the legend at the top or bottom, and if it’s tall and narrow, place the legend at the left or right. You can also use legend.position = "none" to suppress the display of the legend altogether.

    +

    To control the display of individual legends, use guides() along with guide_legend() or guide_colorbar(). The following example shows two important settings: controlling the number of rows the legend uses with nrow, and overriding one of the aesthetics to make the points bigger. This is particularly useful if you have used a low alpha to display many points on a plot.

    +
    +
    ggplot(mpg, aes(x = displ, y = hwy)) +
    +  geom_point(aes(color = class)) +
    +  geom_smooth(se = FALSE) +
    +  theme(legend.position = "bottom") +
    +  guides(color = guide_legend(nrow = 2, override.aes = list(size = 4)))
    +#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
    +
    +

    Scatterplot of highway fuel efficiency versus engine size of cars where points are colored based on class of car. Overlaid on the plot is a smooth curve. The legend is in the bottom and classes are listed horizontally in two rows. The points in the legend are larger than the points in the plot.

    +
    +
    +

    Note that the name of the argument in guides() matches the name of the aesthetic, just like in labs().

    +

    +11.4.4 Replacing a scale

    +

    Instead of just tweaking the details a little, you can instead replace the scale altogether. There are two types of scales you’re mostly likely to want to switch out: continuous position scales and color scales. Fortunately, the same principles apply to all the other aesthetics, so once you’ve mastered position and color, you’ll be able to quickly pick up other scale replacements.

    +

    It’s very useful to plot transformations of your variable. For example, it’s easier to see the precise relationship between carat and price if we log transform them:

    +
    +
    # Left
    +ggplot(diamonds, aes(x = carat, y = price)) +
    +  geom_bin2d()
    +
    +# Right
    +ggplot(diamonds, aes(x = log10(carat), y = log10(price))) +
    +  geom_bin2d()
    +
    +
    +
    +

    Two plots of price versus carat of diamonds. Data binned and the color of the rectangles representing each bin based on the number of points that fall into that bin. In the plot on the right, price and carat values are logged and the axis labels shows the logged values.

    +
    +
    +

    Two plots of price versus carat of diamonds. Data binned and the color of the rectangles representing each bin based on the number of points that fall into that bin. In the plot on the right, price and carat values are logged and the axis labels shows the logged values.

    +
    +
    +
    +
    +

    However, the disadvantage of this transformation is that the axes are now labelled with the transformed values, making it hard to interpret the plot. Instead of doing the transformation in the aesthetic mapping, we can instead do it with the scale. This is visually identical, except the axes are labelled on the original data scale.

    +
    +
    ggplot(diamonds, aes(x = carat, y = price)) +
    +  geom_bin2d() + 
    +  scale_x_log10() + 
    +  scale_y_log10()
    +
    +

    Plot of price versus carat of diamonds. Data binned and the color of the rectangles representing each bin based on the number of points that fall into that bin. The axis labels are on the original data scale.

    +
    +
    +

    Another scale that is frequently customized is color. The default categorical scale picks colors that are evenly spaced around the color wheel. Useful alternatives are the ColorBrewer scales which have been hand tuned to work better for people with common types of color blindness. The two plots below look similar, but there is enough difference in the shades of red and green that the dots on the right can be distinguished even by people with red-green color blindness.1

    +
    +
    ggplot(mpg, aes(x = displ, y = hwy)) +
    +  geom_point(aes(color = drv))
    +
    +ggplot(mpg, aes(x = displ, y = hwy)) +
    +  geom_point(aes(color = drv)) +
    +  scale_color_brewer(palette = "Set1")
    +
    +
    +
    +

    Two scatterplots of highway mileage versus engine size where points are colored by drive type. The plot on the left uses the default ggplot2 color palette and the plot on the right uses a different color palette.

    +
    +
    +

    Two scatterplots of highway mileage versus engine size where points are colored by drive type. The plot on the left uses the default ggplot2 color palette and the plot on the right uses a different color palette.

    +
    +
    +
    +
    +

    Don’t forget simpler techniques for improving accessibility. If there are just a few colors, you can add a redundant shape mapping. This will also help ensure your plot is interpretable in black and white.

    +
    +
    ggplot(mpg, aes(x = displ, y = hwy)) +
    +  geom_point(aes(color = drv, shape = drv)) +
    +  scale_color_brewer(palette = "Set1")
    +
    +

    Two scatterplots of highway mileage versus engine size where both color and shape of points are based on drive type. The color palette is not the default ggplot2 palette.

    +
    +
    +

    The ColorBrewer scales are documented online at https://colorbrewer2.org/ and made available in R via the RColorBrewer package, by Erich Neuwirth. Figura 11.1 shows the complete list of all palettes. The sequential (top) and diverging (bottom) palettes are particularly useful if your categorical values are ordered, or have a “middle”. This often arises if you’ve used cut() to make a continuous variable into a categorical variable.

    +
    +
    +
    +

    All colorBrewer scales. One group goes from light to dark colors. Another group is a set of non ordinal colors. And the last group has diverging scales (from dark to light to dark again). Within each set there are a number of palettes.

    +
    Figura 11.1: All colorBrewer scales.
    +
    +
    +
    +

    When you have a predefined mapping between values and colors, use scale_color_manual(). For example, if we map presidential party to color, we want to use the standard mapping of red for Republicans and blue for Democrats. One approach for assigning these colors is using hex color codes:

    +
    +
    presidential |>
    +  mutate(id = 33 + row_number()) |>
    +  ggplot(aes(x = start, y = id, color = party)) +
    +  geom_point() +
    +  geom_segment(aes(xend = end, yend = id)) +
    +  scale_color_manual(values = c(Republican = "#E81B23", Democratic = "#00AEF3"))
    +
    +

    Line plot of id number of presidents versus the year they started their presidency. Start year is marked with a point and a segment that starts there and ends at the end of the presidency. Democratic presidents are represented in blue and Republicans in red.

    +
    +
    +

    For continuous color, you can use the built-in scale_color_gradient() or scale_fill_gradient(). If you have a diverging scale, you can use scale_color_gradient2(). That allows you to give, for example, positive and negative values different colors. That’s sometimes also useful if you want to distinguish points above or below the mean.

    +

    Another option is to use the viridis color scales. The designers, Nathaniel Smith and Stéfan van der Walt, carefully tailored continuous color schemes that are perceptible to people with various forms of color blindness as well as perceptually uniform in both color and black and white. These scales are available as continuous (c), discrete (d), and binned (b) palettes in ggplot2.

    +
    +
    df <- tibble(
    +  x = rnorm(10000),
    +  y = rnorm(10000)
    +)
    +
    +ggplot(df, aes(x, y)) +
    +  geom_hex() +
    +  coord_fixed() +
    +  labs(title = "Default, continuous", x = NULL, y = NULL)
    +
    +ggplot(df, aes(x, y)) +
    +  geom_hex() +
    +  coord_fixed() +
    +  scale_fill_viridis_c() +
    +  labs(title = "Viridis, continuous", x = NULL, y = NULL)
    +
    +ggplot(df, aes(x, y)) +
    +  geom_hex() +
    +  coord_fixed() +
    +  scale_fill_viridis_b() +
    +  labs(title = "Viridis, binned", x = NULL, y = NULL)
    +
    +
    +
    +

    Three hex plots where the color of the hexes show the number of observations that fall into that hex bin. The first plot uses the default, continuous ggplot2 scale. The second plot uses the viridis, continuous scale, and the third plot uses the viridis, binned scale.

    +
    +
    +

    Three hex plots where the color of the hexes show the number of observations that fall into that hex bin. The first plot uses the default, continuous ggplot2 scale. The second plot uses the viridis, continuous scale, and the third plot uses the viridis, binned scale.

    +
    +
    +
    +
    +

    Three hex plots where the color of the hexes show the number of observations that fall into that hex bin. The first plot uses the default, continuous ggplot2 scale. The second plot uses the viridis, continuous scale, and the third plot uses the viridis, binned scale.

    +
    +
    +
    +
    +

    Note that all color scales come in two varieties: scale_color_*() and scale_fill_*() for the color and fill aesthetics respectively (the color scales are available in both UK and US spellings).

    +

    +11.4.5 Zooming

    +

    There are three ways to control the plot limits:

    +
      +
    1. Adjusting what data are plotted.
    2. +
    3. Setting the limits in each scale.
    4. +
    5. Setting xlim and ylim in coord_cartesian().
    6. +
    +

    We’ll demonstrate these options in a series of plots. The plot on the left shows the relationship between engine size and fuel efficiency, colored by type of drive train. The plot on the right shows the same variables, but subsets the data that are plotted. Subsetting the data has affected the x and y scales as well as the smooth curve.

    +
    +
    # Left
    +ggplot(mpg, aes(x = displ, y = hwy)) +
    +  geom_point(aes(color = drv)) +
    +  geom_smooth()
    +
    +# Right
    +mpg |>
    +  filter(displ >= 5 & displ <= 6 & hwy >= 10 & hwy <= 25) |>
    +  ggplot(aes(x = displ, y = hwy)) +
    +  geom_point(aes(color = drv)) +
    +  geom_smooth()
    +
    +
    +
    +

    On the left, scatterplot of highway mileage vs. displacement, with displacement. The smooth curve overlaid shows a decreasing, and then increasing trend, like a hockey stick. On the right, same variables are plotted with displacement ranging only from 5 to 6 and highway mileage ranging only from 10 to 25. The smooth curve overlaid shows a trend that's slightly increasing first and then decreasing.

    +
    +
    +

    On the left, scatterplot of highway mileage vs. displacement, with displacement. The smooth curve overlaid shows a decreasing, and then increasing trend, like a hockey stick. On the right, same variables are plotted with displacement ranging only from 5 to 6 and highway mileage ranging only from 10 to 25. The smooth curve overlaid shows a trend that's slightly increasing first and then decreasing.

    +
    +
    +
    +
    +

    Let’s compare these to the two plots below where the plot on the left sets the limits on individual scales and the plot on the right sets them in coord_cartesian(). We can see that reducing the limits is equivalent to subsetting the data. Therefore, to zoom in on a region of the plot, it’s generally best to use coord_cartesian().

    +
    +
    # Left
    +ggplot(mpg, aes(x = displ, y = hwy)) +
    +  geom_point(aes(color = drv)) +
    +  geom_smooth() +
    +  scale_x_continuous(limits = c(5, 6)) +
    +  scale_y_continuous(limits = c(10, 25))
    +
    +# Right
    +ggplot(mpg, aes(x = displ, y = hwy)) +
    +  geom_point(aes(color = drv)) +
    +  geom_smooth() +
    +  coord_cartesian(xlim = c(5, 6), ylim = c(10, 25))
    +
    +
    +
    +

    On the left, scatterplot of highway mileage vs. displacement, with displacement ranging from 5 to 6 and highway mileage ranging from 10 to 25. The smooth curve overlaid shows a trend that's slightly increasing first and then decreasing. On the right, same variables are plotted with the same limits, however the smooth curve overlaid shows a relatively flat trend with a slight increase at the end.

    +
    +
    +

    On the left, scatterplot of highway mileage vs. displacement, with displacement ranging from 5 to 6 and highway mileage ranging from 10 to 25. The smooth curve overlaid shows a trend that's slightly increasing first and then decreasing. On the right, same variables are plotted with the same limits, however the smooth curve overlaid shows a relatively flat trend with a slight increase at the end.

    +
    +
    +
    +
    +

    On the other hand, setting the limits on individual scales is generally more useful if you want to expand the limits, e.g., to match scales across different plots. For example, if we extract two classes of cars and plot them separately, it’s difficult to compare the plots because all three scales (the x-axis, the y-axis, and the color aesthetic) have different ranges.

    +
    +
    suv <- mpg |> filter(class == "suv")
    +compact <- mpg |> filter(class == "compact")
    +
    +# Left
    +ggplot(suv, aes(x = displ, y = hwy, color = drv)) +
    +  geom_point()
    +
    +# Right
    +ggplot(compact, aes(x = displ, y = hwy, color = drv)) +
    +  geom_point()
    +
    +
    +
    +

    On the left, a scatterplot of highway mileage vs. displacement of SUVs. On the right, a scatterplot of the same variables for compact cars. Points are colored by drive type for both plots. Among SUVs more of the cars are 4-wheel drive and the others are rear-wheel drive, while among compact cars more of the cars are front-wheel drive and the others are 4-wheel drive. SUV plot shows a clear negative relationship between higway mileage and displacement while in the compact cars plot the relationship is much flatter.

    +
    +
    +

    On the left, a scatterplot of highway mileage vs. displacement of SUVs. On the right, a scatterplot of the same variables for compact cars. Points are colored by drive type for both plots. Among SUVs more of the cars are 4-wheel drive and the others are rear-wheel drive, while among compact cars more of the cars are front-wheel drive and the others are 4-wheel drive. SUV plot shows a clear negative relationship between higway mileage and displacement while in the compact cars plot the relationship is much flatter.

    +
    +
    +
    +
    +

    One way to overcome this problem is to share scales across multiple plots, training the scales with the limits of the full data.

    +
    +
    x_scale <- scale_x_continuous(limits = range(mpg$displ))
    +y_scale <- scale_y_continuous(limits = range(mpg$hwy))
    +col_scale <- scale_color_discrete(limits = unique(mpg$drv))
    +
    +# Left
    +ggplot(suv, aes(x = displ, y = hwy, color = drv)) +
    +  geom_point() +
    +  x_scale +
    +  y_scale +
    +  col_scale
    +
    +# Right
    +ggplot(compact, aes(x = displ, y = hwy, color = drv)) +
    +  geom_point() +
    +  x_scale +
    +  y_scale +
    +  col_scale
    +
    +
    +
    +

    On the left, a scatterplot of highway mileage vs. displacement of SUVs. On the right, a scatterplot of the same variables for compact cars. Points are colored by drive type for both plots. Both plots are plotted on the same scale for highway mileage, displacement, and drive type, resulting in the legend showing all three types (front, rear, and 4-wheel drive) for both plots even though there are no front-wheel drive SUVs and no rear-wheel drive compact cars. Since the x and y scales are the same, and go well beyond minimum or maximum highway mileage and displacement, the points do not take up the entire plotting area.

    +
    +
    +

    On the left, a scatterplot of highway mileage vs. displacement of SUVs. On the right, a scatterplot of the same variables for compact cars. Points are colored by drive type for both plots. Both plots are plotted on the same scale for highway mileage, displacement, and drive type, resulting in the legend showing all three types (front, rear, and 4-wheel drive) for both plots even though there are no front-wheel drive SUVs and no rear-wheel drive compact cars. Since the x and y scales are the same, and go well beyond minimum or maximum highway mileage and displacement, the points do not take up the entire plotting area.

    +
    +
    +
    +
    +

    In this particular case, you could have simply used faceting, but this technique is useful more generally, if for instance, you want to spread plots over multiple pages of a report.

    +

    +11.4.6 Exercises

    +
      +
    1. +

      Why doesn’t the following code override the default scale?

      +
      +
      df <- tibble(
      +  x = rnorm(10000),
      +  y = rnorm(10000)
      +)
      +
      +ggplot(df, aes(x, y)) +
      +  geom_hex() +
      +  scale_color_gradient(low = "white", high = "red") +
      +  coord_fixed()
      +
      +
    2. +
    3. What is the first argument to every scale? How does it compare to labs()?

    4. +
    5. +

      Change the display of the presidential terms by:

      +
        +
      1. Combining the two variants that customize colors and x axis breaks.
      2. +
      3. Improving the display of the y axis.
      4. +
      5. Labelling each term with the name of the president.
      6. +
      7. Adding informative plot labels.
      8. +
      9. Placing breaks every 4 years (this is trickier than it seems!).
      10. +
      +
    6. +
    7. +

      First, create the following plot. Then, modify the code using override.aes to make the legend easier to see.

      +
      +
      ggplot(diamonds, aes(x = carat, y = price)) +
      +  geom_point(aes(color = cut), alpha = 1/20)
      +
      +
    8. +

    +11.5 Themes

    +

    Finally, you can customize the non-data elements of your plot with a theme:

    +
    +
    ggplot(mpg, aes(x = displ, y = hwy)) +
    +  geom_point(aes(color = class)) +
    +  geom_smooth(se = FALSE) +
    +  theme_bw()
    +
    +

    Scatterplot of highway mileage vs. displacement of cars, colored by class of car. The plot background is white, with gray grid lines.

    +
    +
    +

    ggplot2 includes the eight themes shown in Figura 11.2, with theme_gray() as the default.2 Many more are included in add-on packages like ggthemes (https://jrnold.github.io/ggthemes), by Jeffrey Arnold. You can also create your own themes, if you are trying to match a particular corporate or journal style.

    +
    +
    +
    +

    Eight barplots created with ggplot2, each with one of the eight built-in themes: theme_bw() - White background with grid lines, theme_light() - Light axes and grid lines, theme_classic() - Classic theme, axes but no grid lines, theme_linedraw() - Only black lines, theme_dark() - Dark background for contrast, theme_minimal() - Minimal theme, no background, theme_gray() - Gray background (default theme), theme_void() - Empty theme, only geoms are visible.

    +
    Figura 11.2: The eight themes built-in to ggplot2.
    +
    +
    +
    +

    It’s also possible to control individual components of each theme, like the size and color of the font used for the y axis. We’ve already seen that legend.position controls where the legend is drawn. There are many other aspects of the legend that can be customized with theme(). For example, in the plot below we change the direction of the legend as well as put a black border around it. Note that customization of the legend box and plot title elements of the theme are done with element_*() functions. These functions specify the styling of non-data components, e.g., the title text is bolded in the face argument of element_text() and the legend border color is defined in the color argument of element_rect(). The theme elements that control the position of the title and the caption are plot.title.position and plot.caption.position, respectively. In the following plot these are set to "plot" to indicate these elements are aligned to the entire plot area, instead of the plot panel (the default). A few other helpful theme() components are used to change the placement for format of the title and caption text.

    +
    +
    ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
    +  geom_point() +
    +  labs(
    +    title = "Larger engine sizes tend to have lower fuel economy",
    +    caption = "Source: https://fueleconomy.gov."
    +  ) +
    +  theme(
    +    legend.position = c(0.6, 0.7),
    +    legend.direction = "horizontal",
    +    legend.box.background = element_rect(color = "black"),
    +    plot.title = element_text(face = "bold"),
    +    plot.title.position = "plot",
    +    plot.caption.position = "plot",
    +    plot.caption = element_text(hjust = 0)
    +  )
    +
    +

    Scatterplot of highway fuel efficiency versus engine size of cars, colored by drive. The plot is titled 'Larger engine sizes tend to have lower fuel economy' with the caption pointing to the source of the data, fueleconomy.gov. The caption and title are left justified, the legend is inside of the plot with a black border.

    +
    +
    +

    For an overview of all theme() components, see help with ?theme. The ggplot2 book is also a great place to go for the full details on theming.

    +

    +11.5.1 Exercises

    +
      +
    1. Pick a theme offered by the ggthemes package and apply it to the last plot you made.
    2. +
    3. Make the axis labels of your plot blue and bolded.
    4. +

    +11.6 Layout

    +

    So far we talked about how to create and modify a single plot. What if you have multiple plots you want to lay out in a certain way? The patchwork package allows you to combine separate plots into the same graphic. We loaded this package earlier in the chapter.

    +

    To place two plots next to each other, you can simply add them to each other. Note that you first need to create the plots and save them as objects (in the following example they’re called p1 and p2). Then, you place them next to each other with +.

    +
    +
    p1 <- ggplot(mpg, aes(x = displ, y = hwy)) + 
    +  geom_point() + 
    +  labs(title = "Plot 1")
    +p2 <- ggplot(mpg, aes(x = drv, y = hwy)) + 
    +  geom_boxplot() + 
    +  labs(title = "Plot 2")
    +p1 + p2
    +
    +

    Two plots (a scatterplot of highway mileage versus engine size and a side-by-side boxplots of highway mileage versus drive train) placed next to each other.

    +
    +
    +

    It’s important to note that in the above code chunk we did not use a new function from the patchwork package. Instead, the package added a new functionality to the + operator.

    +

    You can also create complex plot layouts with patchwork. In the following, | places the p1 and p3 next to each other and / moves p2 to the next line.

    +
    +
    p3 <- ggplot(mpg, aes(x = cty, y = hwy)) + 
    +  geom_point() + 
    +  labs(title = "Plot 3")
    +(p1 | p3) / p2
    +
    +

    Three plots laid out such that first and third plot are next to each other and the second plot stretched beneath them. The first plot is a scatterplot of highway mileage versus engine size, third plot is a scatterplot of highway mileage versus city mileage, and the third plot is side-by-side boxplots of highway mileage versus drive train) placed next to each other.

    +
    +
    +

    Additionally, patchwork allows you to collect legends from multiple plots into one common legend, customize the placement of the legend as well as dimensions of the plots, and add a common title, subtitle, caption, etc. to your plots. Below we create 5 plots. We have turned off the legends on the box plots and the scatterplot and collected the legends for the density plots at the top of the plot with & theme(legend.position = "top"). Note the use of the & operator here instead of the usual +. This is because we’re modifying the theme for the patchwork plot as opposed to the individual ggplots. The legend is placed on top, inside the guide_area(). Finally, we have also customized the heights of the various components of our patchwork – the guide has a height of 1, the box plots 3, density plots 2, and the faceted scatterplot 4. Patchwork divides up the area you have allotted for your plot using this scale and places the components accordingly.

    +
    +
    p1 <- ggplot(mpg, aes(x = drv, y = cty, color = drv)) + 
    +  geom_boxplot(show.legend = FALSE) + 
    +  labs(title = "Plot 1")
    +
    +p2 <- ggplot(mpg, aes(x = drv, y = hwy, color = drv)) + 
    +  geom_boxplot(show.legend = FALSE) + 
    +  labs(title = "Plot 2")
    +
    +p3 <- ggplot(mpg, aes(x = cty, color = drv, fill = drv)) + 
    +  geom_density(alpha = 0.5) + 
    +  labs(title = "Plot 3")
    +
    +p4 <- ggplot(mpg, aes(x = hwy, color = drv, fill = drv)) + 
    +  geom_density(alpha = 0.5) + 
    +  labs(title = "Plot 4")
    +
    +p5 <- ggplot(mpg, aes(x = cty, y = hwy, color = drv)) + 
    +  geom_point(show.legend = FALSE) + 
    +  facet_wrap(~drv) +
    +  labs(title = "Plot 5")
    +
    +(guide_area() / (p1 + p2) / (p3 + p4) / p5) +
    +  plot_annotation(
    +    title = "City and highway mileage for cars with different drive trains",
    +    caption = "Source: https://fueleconomy.gov."
    +  ) +
    +  plot_layout(
    +    guides = "collect",
    +    heights = c(1, 3, 2, 4)
    +    ) &
    +  theme(legend.position = "top")
    +
    +

    Five plots laid out such that first two plots are next to each other. Plots three and four are underneath them. And the fifth plot stretches under them. The patchworked plot is titled "City and highway mileage for cars with different drive trains" and captioned "Source: https://fueleconomy.gov". The first two plots are side-by-side box plots. Plots 3 and 4 are density plots. And the fifth plot is a faceted scatterplot. Each of these plots show geoms colored by drive train, but the patchworked plot has only one legend that applies to all of them, above the plots and beneath the title.

    +
    +
    +

    If you’d like to learn more about combining and layout out multiple plots with patchwork, we recommend looking through the guides on the package website: https://patchwork.data-imaginist.com.

    +

    +11.6.1 Exercises

    +
      +
    1. +

      What happens if you omit the parentheses in the following plot layout. Can you explain why this happens?

      +
      +
      p1 <- ggplot(mpg, aes(x = displ, y = hwy)) + 
      +  geom_point() + 
      +  labs(title = "Plot 1")
      +p2 <- ggplot(mpg, aes(x = drv, y = hwy)) + 
      +  geom_boxplot() + 
      +  labs(title = "Plot 2")
      +p3 <- ggplot(mpg, aes(x = cty, y = hwy)) + 
      +  geom_point() + 
      +  labs(title = "Plot 3")
      +
      +(p1 | p2) / p3
      +
      +
    2. +
    3. +

      Using the three plots from the previous exercise, recreate the following patchwork.

      +
      +
      +

      Three plots: Plot 1 is a scatterplot of highway mileage versus engine size. Plot 2 is side-by-side box plots of highway mileage versus drive train. Plot 3 is side-by-side box plots of city mileage versus drive train. Plots 1 is on the first row. Plots 2 and 3 are on the next row, each span half the width of Plot 1. Plot 1 is labelled "Fig. A", Plot 2 is labelled "Fig. B", and Plot 3 is labelled "Fig. C".

      +
      +
      +
    4. +

    +11.7 Summary

    +

    In this chapter you’ve learned about adding plot labels such as title, subtitle, caption as well as modifying default axis labels, using annotation to add informational text to your plot or to highlight specific data points, customizing the axis scales, and changing the theme of your plot. You’ve also learned about combining multiple plots in a single graph using both simple and complex plot layouts.

    +

    While you’ve so far learned about how to make many different types of plots and how to customize them using a variety of techniques, we’ve barely scratched the surface of what you can create with ggplot2. If you want to get a comprehensive understanding of ggplot2, we recommend reading the book, ggplot2: Elegant Graphics for Data Analysis. Other useful resources are the R Graphics Cookbook by Winston Chang and Fundamentals of Data Visualization by Claus Wilke.

    + + +

    +
      +
    1. You can use a tool like SimDaltonism to simulate color blindness to test these images.↩︎

    2. +
    3. Many people wonder why the default theme has a gray background. This was a deliberate choice because it puts the data forward while still making the grid lines visible. The white grid lines are visible (which is important because they significantly aid position judgments), but they have little visual impact and we can easily tune them out. The gray background gives the plot a similar typographic color to the text, ensuring that the graphics fit in with the flow of a document without jumping out with a bright white background. Finally, the gray background creates a continuous field of color which ensures that the plot is perceived as a single visual entity.↩︎

    4. +
    +
    +
    +
    + + + \ No newline at end of file diff --git a/communication_files/figure-html/default-scales-1.png b/communication_files/figure-html/default-scales-1.png new file mode 100644 index 000000000..e1957c809 Binary files /dev/null and b/communication_files/figure-html/default-scales-1.png differ diff --git a/communication_files/figure-html/fig-brewer-1.png b/communication_files/figure-html/fig-brewer-1.png new file mode 100644 index 000000000..8d4815313 Binary files /dev/null and b/communication_files/figure-html/fig-brewer-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-11-1.png b/communication_files/figure-html/unnamed-chunk-11-1.png new file mode 100644 index 000000000..e1bb96572 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-11-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-13-1.png b/communication_files/figure-html/unnamed-chunk-13-1.png new file mode 100644 index 000000000..e1957c809 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-13-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-14-1.png b/communication_files/figure-html/unnamed-chunk-14-1.png new file mode 100644 index 000000000..de22c510a Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-14-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-15-1.png b/communication_files/figure-html/unnamed-chunk-15-1.png new file mode 100644 index 000000000..1caced931 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-15-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-16-1.png b/communication_files/figure-html/unnamed-chunk-16-1.png new file mode 100644 index 000000000..6815a5ada Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-16-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-16-2.png b/communication_files/figure-html/unnamed-chunk-16-2.png new file mode 100644 index 000000000..5b25c7bd4 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-16-2.png differ diff --git a/communication_files/figure-html/unnamed-chunk-17-1.png b/communication_files/figure-html/unnamed-chunk-17-1.png new file mode 100644 index 000000000..9395ffe8f Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-17-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-18-1.png b/communication_files/figure-html/unnamed-chunk-18-1.png new file mode 100644 index 000000000..4b91113c8 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-18-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-19-1.png b/communication_files/figure-html/unnamed-chunk-19-1.png new file mode 100644 index 000000000..e96e30c2e Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-19-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-19-2.png b/communication_files/figure-html/unnamed-chunk-19-2.png new file mode 100644 index 000000000..38b8e4e1d Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-19-2.png differ diff --git a/communication_files/figure-html/unnamed-chunk-19-3.png b/communication_files/figure-html/unnamed-chunk-19-3.png new file mode 100644 index 000000000..64ec57551 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-19-3.png differ diff --git a/communication_files/figure-html/unnamed-chunk-19-4.png b/communication_files/figure-html/unnamed-chunk-19-4.png new file mode 100644 index 000000000..a52dbd12a Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-19-4.png differ diff --git a/communication_files/figure-html/unnamed-chunk-20-1.png b/communication_files/figure-html/unnamed-chunk-20-1.png new file mode 100644 index 000000000..307aa46a0 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-20-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-21-1.png b/communication_files/figure-html/unnamed-chunk-21-1.png new file mode 100644 index 000000000..b9cdea62b Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-21-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-21-2.png b/communication_files/figure-html/unnamed-chunk-21-2.png new file mode 100644 index 000000000..13ed02e38 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-21-2.png differ diff --git a/communication_files/figure-html/unnamed-chunk-22-1.png b/communication_files/figure-html/unnamed-chunk-22-1.png new file mode 100644 index 000000000..7f6359778 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-22-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-23-1.png b/communication_files/figure-html/unnamed-chunk-23-1.png new file mode 100644 index 000000000..23f72b1a0 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-23-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-23-2.png b/communication_files/figure-html/unnamed-chunk-23-2.png new file mode 100644 index 000000000..e7a58e1c2 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-23-2.png differ diff --git a/communication_files/figure-html/unnamed-chunk-24-1.png b/communication_files/figure-html/unnamed-chunk-24-1.png new file mode 100644 index 000000000..3142b53e5 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-24-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-26-1.png b/communication_files/figure-html/unnamed-chunk-26-1.png new file mode 100644 index 000000000..d97288561 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-26-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-27-1.png b/communication_files/figure-html/unnamed-chunk-27-1.png new file mode 100644 index 000000000..225a8ced9 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-27-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-27-2.png b/communication_files/figure-html/unnamed-chunk-27-2.png new file mode 100644 index 000000000..be71bf6c4 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-27-2.png differ diff --git a/communication_files/figure-html/unnamed-chunk-27-3.png b/communication_files/figure-html/unnamed-chunk-27-3.png new file mode 100644 index 000000000..db7030678 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-27-3.png differ diff --git a/communication_files/figure-html/unnamed-chunk-28-1.png b/communication_files/figure-html/unnamed-chunk-28-1.png new file mode 100644 index 000000000..7b3fbf9e8 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-28-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-28-2.png b/communication_files/figure-html/unnamed-chunk-28-2.png new file mode 100644 index 000000000..c45dfd024 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-28-2.png differ diff --git a/communication_files/figure-html/unnamed-chunk-29-1.png b/communication_files/figure-html/unnamed-chunk-29-1.png new file mode 100644 index 000000000..a8aecb2de Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-29-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-29-2.png b/communication_files/figure-html/unnamed-chunk-29-2.png new file mode 100644 index 000000000..1ce77434a Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-29-2.png differ diff --git a/communication_files/figure-html/unnamed-chunk-3-1.png b/communication_files/figure-html/unnamed-chunk-3-1.png new file mode 100644 index 000000000..e07275b13 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-3-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-30-1.png b/communication_files/figure-html/unnamed-chunk-30-1.png new file mode 100644 index 000000000..a5a4d887c Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-30-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-30-2.png b/communication_files/figure-html/unnamed-chunk-30-2.png new file mode 100644 index 000000000..538eaa9a4 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-30-2.png differ diff --git a/communication_files/figure-html/unnamed-chunk-31-1.png b/communication_files/figure-html/unnamed-chunk-31-1.png new file mode 100644 index 000000000..99c6f35dd Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-31-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-31-2.png b/communication_files/figure-html/unnamed-chunk-31-2.png new file mode 100644 index 000000000..11cbf4ffc Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-31-2.png differ diff --git a/communication_files/figure-html/unnamed-chunk-32-1.png b/communication_files/figure-html/unnamed-chunk-32-1.png new file mode 100644 index 000000000..b5f495d66 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-32-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-33-1.png b/communication_files/figure-html/unnamed-chunk-33-1.png new file mode 100644 index 000000000..3bee6f46c Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-33-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-34-1.png b/communication_files/figure-html/unnamed-chunk-34-1.png new file mode 100644 index 000000000..f908ce252 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-34-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-36-1.png b/communication_files/figure-html/unnamed-chunk-36-1.png new file mode 100644 index 000000000..e4ad6c027 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-36-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-37-1.png b/communication_files/figure-html/unnamed-chunk-37-1.png new file mode 100644 index 000000000..5bec9b28f Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-37-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-38-1.png b/communication_files/figure-html/unnamed-chunk-38-1.png new file mode 100644 index 000000000..d3abc9fc9 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-38-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-39-1.png b/communication_files/figure-html/unnamed-chunk-39-1.png new file mode 100644 index 000000000..2ec653f3c Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-39-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-4-1.png b/communication_files/figure-html/unnamed-chunk-4-1.png new file mode 100644 index 000000000..a929d6463 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-4-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-40-1.png b/communication_files/figure-html/unnamed-chunk-40-1.png new file mode 100644 index 000000000..a2d86e4ab Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-40-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-41-1.png b/communication_files/figure-html/unnamed-chunk-41-1.png new file mode 100644 index 000000000..d8b914489 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-41-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-5-1.png b/communication_files/figure-html/unnamed-chunk-5-1.png new file mode 100644 index 000000000..0c497629c Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-5-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-7-1.png b/communication_files/figure-html/unnamed-chunk-7-1.png new file mode 100644 index 000000000..2513a2426 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-7-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-8-1.png b/communication_files/figure-html/unnamed-chunk-8-1.png new file mode 100644 index 000000000..9efc1f753 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-8-1.png differ diff --git a/communication_files/figure-html/unnamed-chunk-9-1.png b/communication_files/figure-html/unnamed-chunk-9-1.png new file mode 100644 index 000000000..727c882e1 Binary files /dev/null and b/communication_files/figure-html/unnamed-chunk-9-1.png differ diff --git a/data-import.html b/data-import.html new file mode 100644 index 000000000..c7cb2ab14 --- /dev/null +++ b/data-import.html @@ -0,0 +1,1279 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 7  Data import + + + + + + + + + + + + + + + + + + + + + + + + +
    +
    + +
    + + + +
    +

    7  Data import

    +
    + + + +
    + + + + +
    + + +

    +7.1 Introduction

    +

    Working with data provided by R packages is a great way to learn data science tools, but you want to apply what you’ve learned to your own data at some point. In this chapter, you’ll learn the basics of reading data files into R.

    +

    Specifically, this chapter will focus on reading plain-text rectangular files. We’ll start with practical advice for handling features like column names, types, and missing data. You will then learn about reading data from multiple files at once and writing data from R to a file. Finally, you’ll learn how to handcraft data frames in R.

    +

    +7.1.1 Prerequisites

    +

    In this chapter, you’ll learn how to load flat files in R with the readr package, which is part of the core tidyverse.

    + +

    +7.2 Reading data from a file

    +

    To begin, we’ll focus on the most common rectangular data file type: CSV, which is short for comma-separated values. Here is what a simple CSV file looks like. The first row, commonly called the header row, gives the column names, and the following six rows provide the data. The columns are separated, aka delimited, by commas.

    +
    +
    Student ID,Full Name,favourite.food,mealPlan,AGE
    +1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
    +2,Barclay Lynn,French fries,Lunch only,5
    +3,Jayendra Lyne,N/A,Breakfast and lunch,7
    +4,Leon Rossini,Anchovies,Lunch only,
    +5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
    +6,Güvenç Attila,Ice cream,Lunch only,6
    +
    +

    Tabela 7.1 shows a representation of the same data as a table.

    +
    +
    +
    + + +++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Tabela 7.1: Data from the students.csv file as a table.
    Student IDFull Namefavourite.foodmealPlanAGE
    1Sunil HuffmannStrawberry yoghurtLunch only4
    2Barclay LynnFrench friesLunch only5
    3Jayendra LyneN/ABreakfast and lunch7
    4Leon RossiniAnchoviesLunch onlyNA
    5Chidiegwu DunkelPizzaBreakfast and lunchfive
    6Güvenç AttilaIce creamLunch only6
    +
    +
    +
    +

    We can read this file into R using read_csv(). The first argument is the most important: the path to the file. You can think about the path as the address of the file: the file is called students.csv and that it lives in the data folder.

    +
    +
    students <- read_csv("data/students.csv")
    +#> Rows: 6 Columns: 5
    +#> ── Column specification ─────────────────────────────────────────────────────
    +#> Delimiter: ","
    +#> chr (4): Full Name, favourite.food, mealPlan, AGE
    +#> dbl (1): Student ID
    +#> 
    +#> ℹ Use `spec()` to retrieve the full column specification for this data.
    +#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    +
    +

    The code above will work if you have the students.csv file in a data folder in your project. You can download the students.csv file from https://pos.it/r4ds-students-csv or you can read it directly from that URL with:

    +
    +
    students <- read_csv("https://pos.it/r4ds-students-csv")
    +
    +

    When you run read_csv(), it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains). It also prints out some information about retrieving the full column specification and how to quiet this message. This message is an integral part of readr, and we’ll return to it in Seção 7.3.

    +

    +7.2.1 Practical advice

    +

    Once you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis. Let’s take another look at the students data with that in mind.

    +
    +
    students
    +#> # A tibble: 6 × 5
    +#>   `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
    +#>          <dbl> <chr>            <chr>              <chr>               <chr>
    +#> 1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
    +#> 2            2 Barclay Lynn     French fries       Lunch only          5    
    +#> 3            3 Jayendra Lyne    N/A                Breakfast and lunch 7    
    +#> 4            4 Leon Rossini     Anchovies          Lunch only          <NA> 
    +#> 5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
    +#> 6            6 Güvenç Attila    Ice cream          Lunch only          6
    +
    +

    In the favourite.food column, there are a bunch of food items, and then the character string N/A, which should have been a real NA that R will recognize as “not available”. This is something we can address using the na argument. By default, read_csv() only recognizes empty strings ("") in this dataset as NAs, we want it to also recognize the character string "N/A".

    +
    +
    students <- read_csv("data/students.csv", na = c("N/A", ""))
    +
    +students
    +#> # A tibble: 6 × 5
    +#>   `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
    +#>          <dbl> <chr>            <chr>              <chr>               <chr>
    +#> 1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
    +#> 2            2 Barclay Lynn     French fries       Lunch only          5    
    +#> 3            3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
    +#> 4            4 Leon Rossini     Anchovies          Lunch only          <NA> 
    +#> 5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
    +#> 6            6 Güvenç Attila    Ice cream          Lunch only          6
    +
    +

    You might also notice that the Student ID and Full Name columns are surrounded by backticks. That’s because they contain spaces, breaking R’s usual rules for variable names; they’re non-syntactic names. To refer to these variables, you need to surround them with backticks, `:

    +
    +
    students |> 
    +  rename(
    +    student_id = `Student ID`,
    +    full_name = `Full Name`
    +  )
    +#> # A tibble: 6 × 5
    +#>   student_id full_name        favourite.food     mealPlan            AGE  
    +#>        <dbl> <chr>            <chr>              <chr>               <chr>
    +#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
    +#> 2          2 Barclay Lynn     French fries       Lunch only          5    
    +#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
    +#> 4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
    +#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
    +#> 6          6 Güvenç Attila    Ice cream          Lunch only          6
    +
    +

    An alternative approach is to use janitor::clean_names() to use some heuristics to turn them all into snake case at once1.

    +
    +
    students |> janitor::clean_names()
    +#> # A tibble: 6 × 5
    +#>   student_id full_name        favourite_food     meal_plan           age  
    +#>        <dbl> <chr>            <chr>              <chr>               <chr>
    +#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
    +#> 2          2 Barclay Lynn     French fries       Lunch only          5    
    +#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
    +#> 4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
    +#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
    +#> 6          6 Güvenç Attila    Ice cream          Lunch only          6
    +
    +

    Another common task after reading in data is to consider variable types. For example, meal_plan is a categorical variable with a known set of possible values, which in R should be represented as a factor:

    +
    +
    students |>
    +  janitor::clean_names() |>
    +  mutate(meal_plan = factor(meal_plan))
    +#> # A tibble: 6 × 5
    +#>   student_id full_name        favourite_food     meal_plan           age  
    +#>        <dbl> <chr>            <chr>              <fct>               <chr>
    +#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
    +#> 2          2 Barclay Lynn     French fries       Lunch only          5    
    +#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
    +#> 4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
    +#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
    +#> 6          6 Güvenç Attila    Ice cream          Lunch only          6
    +
    +

    Note that the values in the meal_plan variable have stayed the same, but the type of variable denoted underneath the variable name has changed from character (<chr>) to factor (<fct>). You’ll learn more about factors in Capítulo 16.

    +

    Before you analyze these data, you’ll probably want to fix the age and id columns. Currently, age is a character variable because one of the observations is typed out as five instead of a numeric 5. We discuss the details of fixing this issue in Capítulo 20.

    +
    +
    students <- students |>
    +  janitor::clean_names() |>
    +  mutate(
    +    meal_plan = factor(meal_plan),
    +    age = parse_number(if_else(age == "five", "5", age))
    +  )
    +
    +students
    +#> # A tibble: 6 × 5
    +#>   student_id full_name        favourite_food     meal_plan             age
    +#>        <dbl> <chr>            <chr>              <fct>               <dbl>
    +#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
    +#> 2          2 Barclay Lynn     French fries       Lunch only              5
    +#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7
    +#> 4          4 Leon Rossini     Anchovies          Lunch only             NA
    +#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
    +#> 6          6 Güvenç Attila    Ice cream          Lunch only              6
    +
    +

    A new function here is if_else(), which has three arguments. The first argument test should be a logical vector. The result will contain the value of the second argument, yes, when test is TRUE, and the value of the third argument, no, when it is FALSE. Here we’re saying if age is the character string "five", make it "5", and if not leave it as age. You will learn more about if_else() and logical vectors in Capítulo 12.

    +

    +7.2.2 Other arguments

    +

    There are a couple of other important arguments that we need to mention, and they’ll be easier to demonstrate if we first show you a handy trick: read_csv() can read text strings that you’ve created and formatted like a CSV file:

    +
    +
    read_csv(
    +  "a,b,c
    +  1,2,3
    +  4,5,6"
    +)
    +#> # A tibble: 2 × 3
    +#>       a     b     c
    +#>   <dbl> <dbl> <dbl>
    +#> 1     1     2     3
    +#> 2     4     5     6
    +
    +

    Usually, read_csv() uses the first line of the data for the column names, which is a very common convention. But it’s not uncommon for a few lines of metadata to be included at the top of the file. You can use skip = n to skip the first n lines or use comment = "#" to drop all lines that start with (e.g.) #:

    +
    +
    read_csv(
    +  "The first line of metadata
    +  The second line of metadata
    +  x,y,z
    +  1,2,3",
    +  skip = 2
    +)
    +#> # A tibble: 1 × 3
    +#>       x     y     z
    +#>   <dbl> <dbl> <dbl>
    +#> 1     1     2     3
    +
    +read_csv(
    +  "# A comment I want to skip
    +  x,y,z
    +  1,2,3",
    +  comment = "#"
    +)
    +#> # A tibble: 1 × 3
    +#>       x     y     z
    +#>   <dbl> <dbl> <dbl>
    +#> 1     1     2     3
    +
    +

    In other cases, the data might not have column names. You can use col_names = FALSE to tell read_csv() not to treat the first row as headings and instead label them sequentially from X1 to Xn:

    +
    +
    read_csv(
    +  "1,2,3
    +  4,5,6",
    +  col_names = FALSE
    +)
    +#> # A tibble: 2 × 3
    +#>      X1    X2    X3
    +#>   <dbl> <dbl> <dbl>
    +#> 1     1     2     3
    +#> 2     4     5     6
    +
    +

    Alternatively, you can pass col_names a character vector which will be used as the column names:

    +
    +
    read_csv(
    +  "1,2,3
    +  4,5,6",
    +  col_names = c("x", "y", "z")
    +)
    +#> # A tibble: 2 × 3
    +#>       x     y     z
    +#>   <dbl> <dbl> <dbl>
    +#> 1     1     2     3
    +#> 2     4     5     6
    +
    +

    These arguments are all you need to know to read the majority of CSV files that you’ll encounter in practice. (For the rest, you’ll need to carefully inspect your .csv file and read the documentation for read_csv()’s many other arguments.)

    +

    +7.2.3 Other file types

    +

    Once you’ve mastered read_csv(), using readr’s other functions is straightforward; it’s just a matter of knowing which function to reach for:

    +
      +
    • read_csv2() reads semicolon-separated files. These use ; instead of , to separate fields and are common in countries that use , as the decimal marker.

    • +
    • read_tsv() reads tab-delimited files.

    • +
    • read_delim() reads in files with any delimiter, attempting to automatically guess the delimiter if you don’t specify it.

    • +
    • read_fwf() reads fixed-width files. You can specify fields by their widths with fwf_widths() or by their positions with fwf_positions().

    • +
    • read_table() reads a common variation of fixed-width files where columns are separated by white space.

    • +
    • read_log() reads Apache-style log files.

    • +

    +7.2.4 Exercises

    +
      +
    1. What function would you use to read a file where fields were separated with “|”?

    2. +
    3. Apart from file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?

    4. +
    5. What are the most important arguments to read_fwf()?

    6. +
    7. +

      Sometimes strings in a CSV file contain commas. To prevent them from causing problems, they need to be surrounded by a quoting character, like " or '. By default, read_csv() assumes that the quoting character will be ". To read the following text into a data frame, what argument to read_csv() do you need to specify?

      +
      +
      "x,y\n1,'a,b'"
      +
      +
    8. +
    9. +

      Identify what is wrong with each of the following inline CSV files. What happens when you run the code?

      +
      +
      read_csv("a,b\n1,2,3\n4,5,6")
      +read_csv("a,b,c\n1,2\n1,2,3,4")
      +read_csv("a,b\n\"1")
      +read_csv("a,b\n1,2\na,b")
      +read_csv("a;b\n1;3")
      +
      +
    10. +
    11. +

      Practice referring to non-syntactic names in the following data frame by:

      +
        +
      1. Extracting the variable called 1.
      2. +
      3. Plotting a scatterplot of 1 vs. 2.
      4. +
      5. Creating a new column called 3, which is 2 divided by 1.
      6. +
      7. Renaming the columns to one, two, and three.
      8. +
      +
      +
      annoying <- tibble(
      +  `1` = 1:10,
      +  `2` = `1` * 2 + rnorm(length(`1`))
      +)
      +
      +
    12. +

    +7.3 Controlling column types

    +

    A CSV file doesn’t contain any information about the type of each variable (i.e. whether it’s a logical, number, string, etc.), so readr will try to guess the type. This section describes how the guessing process works, how to resolve some common problems that cause it to fail, and, if needed, how to supply the column types yourself. Finally, we’ll mention a few general strategies that are useful if readr is failing catastrophically and you need to get more insight into the structure of your file.

    +

    +7.3.1 Guessing types

    +

    readr uses a heuristic to figure out the column types. For each column, it pulls the values of 1,0002 rows spaced evenly from the first row to the last, ignoring missing values. It then works through the following questions:

    +
      +
    • Does it contain only F, T, FALSE, or TRUE (ignoring case)? If so, it’s a logical.
    • +
    • Does it contain only numbers (e.g., 1, -4.5, 5e6, Inf)? If so, it’s a number.
    • +
    • Does it match the ISO8601 standard? If so, it’s a date or date-time. (We’ll return to date-times in more detail in Seção 17.2).
    • +
    • Otherwise, it must be a string.
    • +
    +

    You can see that behavior in action in this simple example:

    +
    +
    read_csv("
    +  logical,numeric,date,string
    +  TRUE,1,2021-01-15,abc
    +  false,4.5,2021-02-15,def
    +  T,Inf,2021-02-16,ghi
    +")
    +#> # A tibble: 3 × 4
    +#>   logical numeric date       string
    +#>   <lgl>     <dbl> <date>     <chr> 
    +#> 1 TRUE        1   2021-01-15 abc   
    +#> 2 FALSE       4.5 2021-02-15 def   
    +#> 3 TRUE      Inf   2021-02-16 ghi
    +
    +

    This heuristic works well if you have a clean dataset, but in real life, you’ll encounter a selection of weird and beautiful failures.

    +

    +7.3.2 Missing values, column types, and problems

    +

    The most common way column detection fails is that a column contains unexpected values, and you get a character column instead of a more specific type. One of the most common causes for this is a missing value, recorded using something other than the NA that readr expects.

    +

    Take this simple 1 column CSV file as an example:

    +
    +
    simple_csv <- "
    +  x
    +  10
    +  .
    +  20
    +  30"
    +
    +

    If we read it without any additional arguments, x becomes a character column:

    +
    +
    read_csv(simple_csv)
    +#> # A tibble: 4 × 1
    +#>   x    
    +#>   <chr>
    +#> 1 10   
    +#> 2 .    
    +#> 3 20   
    +#> 4 30
    +
    +

    In this very small case, you can easily see the missing value .. But what happens if you have thousands of rows with only a few missing values represented by .s sprinkled among them? One approach is to tell readr that x is a numeric column, and then see where it fails. You can do that with the col_types argument, which takes a named list where the names match the column names in the CSV file:

    +
    +
    df <- read_csv(
    +  simple_csv, 
    +  col_types = list(x = col_double())
    +)
    +#> Warning: One or more parsing issues, call `problems()` on your data frame for
    +#> details, e.g.:
    +#>   dat <- vroom(...)
    +#>   problems(dat)
    +
    +

    Now read_csv() reports that there was a problem, and tells us we can find out more with problems():

    +
    +
    problems(df)
    +#> # A tibble: 1 × 5
    +#>     row   col expected actual file                            
    +#>   <int> <int> <chr>    <chr>  <chr>                           
    +#> 1     3     1 a double .      /tmp/Rtmp7ye2gf/file228416ab4e78
    +
    +

    This tells us that there was a problem in row 3, col 1 where readr expected a double but got a .. That suggests this dataset uses . for missing values. So then we set na = ".", the automatic guessing succeeds, giving us the numeric column that we want:

    +
    +
    read_csv(simple_csv, na = ".")
    +#> # A tibble: 4 × 1
    +#>       x
    +#>   <dbl>
    +#> 1    10
    +#> 2    NA
    +#> 3    20
    +#> 4    30
    +
    +

    +7.3.3 Column types

    +

    readr provides a total of nine column types for you to use:

    +
      +
    • +col_logical() and col_double() read logicals and real numbers. They’re relatively rarely needed (except as above), since readr will usually guess them for you.
    • +
    • +col_integer() reads integers. We seldom distinguish integers and doubles in this book because they’re functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.
    • +
    • +col_character() reads strings. This can be useful to specify explicitly when you have a column that is a numeric identifier, i.e., long series of digits that identifies an object but doesn’t make sense to apply mathematical operations to. Examples include phone numbers, social security numbers, credit card numbers, etc.
    • +
    • +col_factor(), col_date(), and col_datetime() create factors, dates, and date-times respectively; you’ll learn more about those when we get to those data types in Capítulo 16 and Capítulo 17.
    • +
    • +col_number() is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies. You’ll learn more about it in Capítulo 13.
    • +
    • +col_skip() skips a column so it’s not included in the result, which can be useful for speeding up reading the data if you have a large CSV file and you only want to use some of the columns.
    • +
    +

    It’s also possible to override the default column by switching from list() to cols() and specifying .default:

    +
    +
    another_csv <- "
    +x,y,z
    +1,2,3"
    +
    +read_csv(
    +  another_csv, 
    +  col_types = cols(.default = col_character())
    +)
    +#> # A tibble: 1 × 3
    +#>   x     y     z    
    +#>   <chr> <chr> <chr>
    +#> 1 1     2     3
    +
    +

    Another useful helper is cols_only() which will read in only the columns you specify:

    +
    +
    read_csv(
    +  another_csv,
    +  col_types = cols_only(x = col_character())
    +)
    +#> # A tibble: 1 × 1
    +#>   x    
    +#>   <chr>
    +#> 1 1
    +
    +

    +7.4 Reading data from multiple files

    +

    Sometimes your data is split across multiple files instead of being contained in a single file. For example, you might have sales data for multiple months, with each month’s data in a separate file: 01-sales.csv for January, 02-sales.csv for February, and 03-sales.csv for March. With read_csv() you can read these data in at once and stack them on top of each other in a single data frame.

    +
    +
    sales_files <- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv")
    +read_csv(sales_files, id = "file")
    +#> # A tibble: 19 × 6
    +#>   file              month    year brand  item     n
    +#>   <chr>             <chr>   <dbl> <dbl> <dbl> <dbl>
    +#> 1 data/01-sales.csv January  2019     1  1234     3
    +#> 2 data/01-sales.csv January  2019     1  8721     9
    +#> 3 data/01-sales.csv January  2019     1  1822     2
    +#> 4 data/01-sales.csv January  2019     2  3333     1
    +#> 5 data/01-sales.csv January  2019     2  2156     9
    +#> 6 data/01-sales.csv January  2019     2  3987     6
    +#> # ℹ 13 more rows
    +
    +

    Once again, the code above will work if you have the CSV files in a data folder in your project. You can download these files from https://pos.it/r4ds-01-sales, https://pos.it/r4ds-02-sales, and https://pos.it/r4ds-03-sales or you can read them directly with:

    +
    +
    sales_files <- c(
    +  "https://pos.it/r4ds-01-sales",
    +  "https://pos.it/r4ds-02-sales",
    +  "https://pos.it/r4ds-03-sales"
    +)
    +read_csv(sales_files, id = "file")
    +
    +

    The id argument adds a new column called file to the resulting data frame that identifies the file the data come from. This is especially helpful in circumstances where the files you’re reading in do not have an identifying column that can help you trace the observations back to their original sources.

    +

    If you have many files you want to read in, it can get cumbersome to write out their names as a list. Instead, you can use the base list.files() function to find the files for you by matching a pattern in the file names. You’ll learn more about these patterns in Capítulo 15.

    +
    +
    sales_files <- list.files("data", pattern = "sales\\.csv$", full.names = TRUE)
    +sales_files
    +#> [1] "data/01-sales.csv" "data/02-sales.csv" "data/03-sales.csv"
    +
    +

    +7.5 Writing to a file

    +

    readr also comes with two useful functions for writing data back to disk: write_csv() and write_tsv(). The most important arguments to these functions are x (the data frame to save) and file (the location to save it). You can also specify how missing values are written with na, and if you want to append to an existing file.

    +
    +
    write_csv(students, "students.csv")
    +
    +

    Now let’s read that csv file back in. Note that the variable type information that you just set up is lost when you save to CSV because you’re starting over with reading from a plain text file again:

    +
    +
    students
    +#> # A tibble: 6 × 5
    +#>   student_id full_name        favourite_food     meal_plan             age
    +#>        <dbl> <chr>            <chr>              <fct>               <dbl>
    +#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
    +#> 2          2 Barclay Lynn     French fries       Lunch only              5
    +#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7
    +#> 4          4 Leon Rossini     Anchovies          Lunch only             NA
    +#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
    +#> 6          6 Güvenç Attila    Ice cream          Lunch only              6
    +write_csv(students, "students-2.csv")
    +read_csv("students-2.csv")
    +#> # A tibble: 6 × 5
    +#>   student_id full_name        favourite_food     meal_plan             age
    +#>        <dbl> <chr>            <chr>              <chr>               <dbl>
    +#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
    +#> 2          2 Barclay Lynn     French fries       Lunch only              5
    +#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7
    +#> 4          4 Leon Rossini     Anchovies          Lunch only             NA
    +#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
    +#> 6          6 Güvenç Attila    Ice cream          Lunch only              6
    +
    +

    This makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in. There are two main alternatives:

    +
      +
    1. +

      write_rds() and read_rds() are uniform wrappers around the base functions readRDS() and saveRDS(). These store data in R’s custom binary format called RDS. This means that when you reload the object, you are loading the exact same R object that you stored.

      +
      +
      write_rds(students, "students.rds")
      +read_rds("students.rds")
      +#> # A tibble: 6 × 5
      +#>   student_id full_name        favourite_food     meal_plan             age
      +#>        <dbl> <chr>            <chr>              <fct>               <dbl>
      +#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
      +#> 2          2 Barclay Lynn     French fries       Lunch only              5
      +#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7
      +#> 4          4 Leon Rossini     Anchovies          Lunch only             NA
      +#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
      +#> 6          6 Güvenç Attila    Ice cream          Lunch only              6
      +
      +
    2. +
    3. +

      The arrow package allows you to read and write parquet files, a fast binary file format that can be shared across programming languages. We’ll return to arrow in more depth in Capítulo 22.

      +
      +
      library(arrow)
      +write_parquet(students, "students.parquet")
      +read_parquet("students.parquet")
      +#> # A tibble: 6 × 5
      +#>   student_id full_name        favourite_food     meal_plan             age
      +#>        <dbl> <chr>            <chr>              <fct>               <dbl>
      +#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
      +#> 2          2 Barclay Lynn     French fries       Lunch only              5
      +#> 3          3 Jayendra Lyne    NA                 Breakfast and lunch     7
      +#> 4          4 Leon Rossini     Anchovies          Lunch only             NA
      +#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
      +#> 6          6 Güvenç Attila    Ice cream          Lunch only              6
      +
      +
    4. +
    +

    Parquet tends to be much faster than RDS and is usable outside of R, but does require the arrow package.

    +

    +7.6 Data entry

    +

    Sometimes you’ll need to assemble a tibble “by hand” doing a little data entry in your R script. There are two useful functions to help you do this which differ in whether you layout the tibble by columns or by rows. tibble() works by column:

    +
    +
    tibble(
    +  x = c(1, 2, 5), 
    +  y = c("h", "m", "g"),
    +  z = c(0.08, 0.83, 0.60)
    +)
    +#> # A tibble: 3 × 3
    +#>       x y         z
    +#>   <dbl> <chr> <dbl>
    +#> 1     1 h      0.08
    +#> 2     2 m      0.83
    +#> 3     5 g      0.6
    +
    +

    Laying out the data by column can make it hard to see how the rows are related, so an alternative is tribble(), short for transposed tibble, which lets you lay out your data row by row. tribble() is customized for data entry in code: column headings start with ~ and entries are separated by commas. This makes it possible to lay out small amounts of data in an easy to read form:

    +
    +
    tribble(
    +  ~x, ~y, ~z,
    +  1, "h", 0.08,
    +  2, "m", 0.83,
    +  5, "g", 0.60
    +)
    +#> # A tibble: 3 × 3
    +#>       x y         z
    +#>   <dbl> <chr> <dbl>
    +#> 1     1 h      0.08
    +#> 2     2 m      0.83
    +#> 3     5 g      0.6
    +
    +

    +7.7 Summary

    +

    In this chapter, you’ve learned how to load CSV files with read_csv() and to do your own data entry with tibble() and tribble(). You’ve learned how csv files work, some of the problems you might encounter, and how to overcome them. We’ll come to data import a few times in this book: Capítulo 20 from Excel and Google Sheets, Capítulo 21 will show you how to load data from databases, Capítulo 22 from parquet files, Capítulo 23 from JSON, and Capítulo 24 from websites.

    +

    We’re just about at the end of this section of the book, but there’s one important last topic to cover: how to get help. So in the next chapter, you’ll learn some good places to look for help, how to create a reprex to maximize your chances of getting good help, and some general advice on keeping up with the world of R.

    + + +

    +
      +
    1. The janitor package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that use |>.↩︎

    2. +
    3. You can override the default of 1000 with the guess_max argument.↩︎

    4. +
    +
    +
    +
    + + + \ No newline at end of file diff --git a/data-tidy.html b/data-tidy.html new file mode 100644 index 000000000..912802dcf --- /dev/null +++ b/data-tidy.html @@ -0,0 +1,1276 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 5  Data tidying + + + + + + + + + + + + + + + + + + + + + + + + +
    +
    + +
    + + + +
    +

    5  Data tidying

    +
    + + + +
    + + + + +
    + + +

    +5.1 Introduction

    +
    +

    “Happy families are all alike; every unhappy family is unhappy in its own way.”
    +— Leo Tolstoy

    +
    +
    +

    “Tidy datasets are all alike, but every messy dataset is messy in its own way.”
    +— Hadley Wickham

    +
    +

    In this chapter, you will learn a consistent way to organize your data in R using a system called tidy data. Getting your data into this format requires some work up front, but that work pays off in the long term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the data questions you care about.

    +

    In this chapter, you’ll first learn the definition of tidy data and see it applied to a simple toy dataset. Then we’ll dive into the primary tool you’ll use for tidying data: pivoting. Pivoting allows you to change the form of your data without changing any of the values.

    +

    +5.1.1 Prerequisites

    +

    In this chapter, we’ll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidyr is a member of the core tidyverse.

    + +

    From this chapter on, we’ll suppress the loading message from library(tidyverse).

    +

    +5.2 Tidy data

    +

    You can represent the same underlying data in multiple ways. The example below shows the same data organized in three different ways. Each dataset shows the same values of four variables: country, year, population, and number of documented cases of TB (tuberculosis), but each dataset organizes the values in a different way.

    +
    +
    table1
    +#> # A tibble: 6 × 4
    +#>   country      year  cases population
    +#>   <chr>       <dbl>  <dbl>      <dbl>
    +#> 1 Afghanistan  1999    745   19987071
    +#> 2 Afghanistan  2000   2666   20595360
    +#> 3 Brazil       1999  37737  172006362
    +#> 4 Brazil       2000  80488  174504898
    +#> 5 China        1999 212258 1272915272
    +#> 6 China        2000 213766 1280428583
    +
    +table2
    +#> # A tibble: 12 × 4
    +#>   country      year type           count
    +#>   <chr>       <dbl> <chr>          <dbl>
    +#> 1 Afghanistan  1999 cases            745
    +#> 2 Afghanistan  1999 population  19987071
    +#> 3 Afghanistan  2000 cases           2666
    +#> 4 Afghanistan  2000 population  20595360
    +#> 5 Brazil       1999 cases          37737
    +#> 6 Brazil       1999 population 172006362
    +#> # ℹ 6 more rows
    +
    +table3
    +#> # A tibble: 6 × 3
    +#>   country      year rate             
    +#>   <chr>       <dbl> <chr>            
    +#> 1 Afghanistan  1999 745/19987071     
    +#> 2 Afghanistan  2000 2666/20595360    
    +#> 3 Brazil       1999 37737/172006362  
    +#> 4 Brazil       2000 80488/174504898  
    +#> 5 China        1999 212258/1272915272
    +#> 6 China        2000 213766/1280428583
    +
    +

    These are all representations of the same underlying data, but they are not equally easy to use. One of them, table1, will be much easier to work with inside the tidyverse because it’s tidy.

    +

    There are three interrelated rules that make a dataset tidy:

    +
      +
    1. Each variable is a column; each column is a variable.
    2. +
    3. Each observation is a row; each row is an observation.
    4. +
    5. Each value is a cell; each cell is a single value.
    6. +
    +

    Figura 5.1 shows the rules visually.

    +
    +
    +
    +

    Three panels, each representing a tidy data frame. The first panel shows that each variable is a column. The second panel shows that each observation is a row. The third panel shows that each value is a cell.

    +
    Figura 5.1: The following three rules make a dataset tidy: variables are columns, observations are rows, and values are cells.
    +
    +
    +
    +

    Why ensure that your data is tidy? There are two main advantages:

    +
      +
    1. There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.

    2. +
    3. There’s a specific advantage to placing variables in columns because it allows R’s vectorized nature to shine. As you learned in Seção 3.3.1 and Seção 3.5.2, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.

    4. +
    +

    dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data. Here are a few small examples showing how you might work with table1.

    +
    +
    # Compute rate per 10,000
    +table1 |>
    +  mutate(rate = cases / population * 10000)
    +#> # A tibble: 6 × 5
    +#>   country      year  cases population  rate
    +#>   <chr>       <dbl>  <dbl>      <dbl> <dbl>
    +#> 1 Afghanistan  1999    745   19987071 0.373
    +#> 2 Afghanistan  2000   2666   20595360 1.29 
    +#> 3 Brazil       1999  37737  172006362 2.19 
    +#> 4 Brazil       2000  80488  174504898 4.61 
    +#> 5 China        1999 212258 1272915272 1.67 
    +#> 6 China        2000 213766 1280428583 1.67
    +
    +# Compute total cases per year
    +table1 |> 
    +  group_by(year) |> 
    +  summarize(total_cases = sum(cases))
    +#> # A tibble: 2 × 2
    +#>    year total_cases
    +#>   <dbl>       <dbl>
    +#> 1  1999      250740
    +#> 2  2000      296920
    +
    +# Visualize changes over time
    +ggplot(table1, aes(x = year, y = cases)) +
    +  geom_line(aes(group = country), color = "grey50") +
    +  geom_point(aes(color = country, shape = country)) +
    +  scale_x_continuous(breaks = c(1999, 2000)) # x-axis breaks at 1999 and 2000
    +
    +

    This figure shows the number of cases in 1999 and 2000 for Afghanistan, Brazil, and China, with year on the x-axis and number of cases on the y-axis. Each point on the plot represents the number of cases in a given country in a given year. The points for each country are differentiated from others by color and shape and connected with a line, resulting in three, non-parallel, non-intersecting lines. The numbers of cases in China are highest for both 1999 and 2000, with values above 200,000 for both years. The number of cases in Brazil is approximately 40,000 in 1999 and approximately 75,000 in 2000. The numbers of cases in Afghanistan are lowest for both 1999 and 2000, with values that appear to be very close to 0 on this scale.

    +
    +
    +

    +5.2.1 Exercises

    +
      +
    1. For each of the sample tables, describe what each observation and each column represents.

    2. +
    3. +

      Sketch out the process you’d use to calculate the rate for table2 and table3. You will need to perform four operations:

      +
        +
      1. Extract the number of TB cases per country per year.
      2. +
      3. Extract the matching population per country per year.
      4. +
      5. Divide cases by population, and multiply by 10000.
      6. +
      7. Store back in the appropriate place.
      8. +
      +

      You haven’t yet learned all the functions you’d need to actually perform these operations, but you should still be able to think through the transformations you’d need.

      +
    4. +

    +5.3 Lengthening data

    +

    The principles of tidy data might seem so obvious that you wonder if you’ll ever encounter a dataset that isn’t tidy. Unfortunately, however, most real data is untidy. There are two main reasons:

    +
      +
    1. Data is often organized to facilitate some goal other than analysis. For example, it’s common for data to be structured to make data entry, not analysis, easy.

    2. +
    3. Most people aren’t familiar with the principles of tidy data, and it’s hard to derive them yourself unless you spend a lot of time working with data.

    4. +
    +

    This means that most real analyses will require at least a little tidying. You’ll begin by figuring out what the underlying variables and observations are. Sometimes this is easy; other times you’ll need to consult with the people who originally generated the data. Next, you’ll pivot your data into a tidy form, with variables in the columns and observations in the rows.

    +

    tidyr provides two functions for pivoting data: pivot_longer() and pivot_wider(). We’ll first start with pivot_longer() because it’s the most common case. Let’s dive into some examples.

    +

    +5.3.1 Data in column names

    +

    The billboard dataset records the billboard rank of songs in the year 2000:

    +
    +
    billboard
    +#> # A tibble: 317 × 79
    +#>   artist       track               date.entered   wk1   wk2   wk3   wk4   wk5
    +#>   <chr>        <chr>               <date>       <dbl> <dbl> <dbl> <dbl> <dbl>
    +#> 1 2 Pac        Baby Don't Cry (Ke… 2000-02-26      87    82    72    77    87
    +#> 2 2Ge+her      The Hardest Part O… 2000-09-02      91    87    92    NA    NA
    +#> 3 3 Doors Down Kryptonite          2000-04-08      81    70    68    67    66
    +#> 4 3 Doors Down Loser               2000-10-21      76    76    72    69    67
    +#> 5 504 Boyz     Wobble Wobble       2000-04-15      57    34    25    17    17
    +#> 6 98^0         Give Me Just One N… 2000-08-19      51    39    34    26    26
    +#> # ℹ 311 more rows
    +#> # ℹ 71 more variables: wk6 <dbl>, wk7 <dbl>, wk8 <dbl>, wk9 <dbl>, …
    +
    +

    In this dataset, each observation is a song. The first three columns (artist, track and date.entered) are variables that describe the song. Then we have 76 columns (wk1-wk76) that describe the rank of the song in each week1. Here, the column names are one variable (the week) and the cell values are another (the rank).

    +

    To tidy this data, we’ll use pivot_longer():

    +
    +
    billboard |> 
    +  pivot_longer(
    +    cols = starts_with("wk"), 
    +    names_to = "week", 
    +    values_to = "rank"
    +  )
    +#> # A tibble: 24,092 × 5
    +#>    artist track                   date.entered week   rank
    +#>    <chr>  <chr>                   <date>       <chr> <dbl>
    +#>  1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1      87
    +#>  2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2      82
    +#>  3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3      72
    +#>  4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4      77
    +#>  5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5      87
    +#>  6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6      94
    +#>  7 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk7      99
    +#>  8 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk8      NA
    +#>  9 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk9      NA
    +#> 10 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk10     NA
    +#> # ℹ 24,082 more rows
    +
    +

    After the data, there are three key arguments:

    +
      +
    • +cols specifies which columns need to be pivoted, i.e. which columns aren’t variables. This argument uses the same syntax as select() so here we could use !c(artist, track, date.entered) or starts_with("wk").
    • +
    • +names_to names the variable stored in the column names, we named that variable week.
    • +
    • +values_to names the variable stored in the cell values, we named that variable rank.
    • +
    +

    Note that in the code "week" and "rank" are quoted because those are new variables we’re creating, they don’t yet exist in the data when we run the pivot_longer() call.

    +

    Now let’s turn our attention to the resulting, longer data frame. What happens if a song is in the top 100 for less than 76 weeks? Take 2 Pac’s “Baby Don’t Cry”, for example. The above output suggests that it was only in the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values. These NAs don’t really represent unknown observations; they were forced to exist by the structure of the dataset2, so we can ask pivot_longer() to get rid of them by setting values_drop_na = TRUE:

    +
    +
    billboard |> 
    +  pivot_longer(
    +    cols = starts_with("wk"), 
    +    names_to = "week", 
    +    values_to = "rank",
    +    values_drop_na = TRUE
    +  )
    +#> # A tibble: 5,307 × 5
    +#>   artist track                   date.entered week   rank
    +#>   <chr>  <chr>                   <date>       <chr> <dbl>
    +#> 1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1      87
    +#> 2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2      82
    +#> 3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3      72
    +#> 4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4      77
    +#> 5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5      87
    +#> 6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6      94
    +#> # ℹ 5,301 more rows
    +
    +

    The number of rows is now much lower, indicating that many rows with NAs were dropped.

    +

    You might also wonder what happens if a song is in the top 100 for more than 76 weeks? We can’t tell from this data, but you might guess that additional columns wk77, wk78, … would be added to the dataset.

    +

    This data is now tidy, but we could make future computation a bit easier by converting values of week from character strings to numbers using mutate() and readr::parse_number(). parse_number() is a handy function that will extract the first number from a string, ignoring all other text.

    +
    +
    billboard_longer <- billboard |> 
    +  pivot_longer(
    +    cols = starts_with("wk"), 
    +    names_to = "week", 
    +    values_to = "rank",
    +    values_drop_na = TRUE
    +  ) |> 
    +  mutate(
    +    week = parse_number(week)
    +  )
    +billboard_longer
    +#> # A tibble: 5,307 × 5
    +#>   artist track                   date.entered  week  rank
    +#>   <chr>  <chr>                   <date>       <dbl> <dbl>
    +#> 1 2 Pac  Baby Don't Cry (Keep... 2000-02-26       1    87
    +#> 2 2 Pac  Baby Don't Cry (Keep... 2000-02-26       2    82
    +#> 3 2 Pac  Baby Don't Cry (Keep... 2000-02-26       3    72
    +#> 4 2 Pac  Baby Don't Cry (Keep... 2000-02-26       4    77
    +#> 5 2 Pac  Baby Don't Cry (Keep... 2000-02-26       5    87
    +#> 6 2 Pac  Baby Don't Cry (Keep... 2000-02-26       6    94
    +#> # ℹ 5,301 more rows
    +
    +

    Now that we have all the week numbers in one variable and all the rank values in another, we’re in a good position to visualize how song ranks vary over time. The code is shown below and the result is in Figura 5.2. We can see that very few songs stay in the top 100 for more than 20 weeks.

    +
    +
    billboard_longer |> 
    +  ggplot(aes(x = week, y = rank, group = track)) + 
    +  geom_line(alpha = 0.25) + 
    +  scale_y_reverse()
    +
    +
    +

    A line plot with week on the x-axis and rank on the y-axis, where each line represents a song. Most songs appear to start at a high rank, rapidly accelerate to a low rank, and then decay again. There are surprisingly few tracks in the region when week is >20 and rank is >50.

    +
    Figura 5.2: A line plot showing how the rank of a song changes over time.
    +
    +
    +
    +

    +5.3.2 How does pivoting work?

    +

    Now that you’ve seen how we can use pivoting to reshape our data, let’s take a little time to gain some intuition about what pivoting does to the data. Let’s start with a very simple dataset to make it easier to see what’s happening. Suppose we have three patients with ids A, B, and C, and we take two blood pressure measurements on each patient. We’ll create the data with tribble(), a handy function for constructing small tibbles by hand:

    +
    +
    df <- tribble(
    +  ~id,  ~bp1, ~bp2,
    +   "A",  100,  120,
    +   "B",  140,  115,
    +   "C",  120,  125
    +)
    +
    +

    We want our new dataset to have three variables: id (already exists), measurement (the column names), and value (the cell values). To achieve this, we need to pivot df longer:

    +
    +
    df |> 
    +  pivot_longer(
    +    cols = bp1:bp2,
    +    names_to = "measurement",
    +    values_to = "value"
    +  )
    +#> # A tibble: 6 × 3
    +#>   id    measurement value
    +#>   <chr> <chr>       <dbl>
    +#> 1 A     bp1           100
    +#> 2 A     bp2           120
    +#> 3 B     bp1           140
    +#> 4 B     bp2           115
    +#> 5 C     bp1           120
    +#> 6 C     bp2           125
    +
    +

    How does the reshaping work? It’s easier to see if we think about it column by column. As shown in Figura 5.3, the values in a column that was already a variable in the original dataset (id) need to be repeated, once for each column that is pivoted.

    +
    +
    +
    +

    A diagram showing how `pivot_longer()` transforms a simple dataset, using color to highlight how the values in the `id` column ("A", "B", "C") are each repeated twice in the output because there are two columns being pivoted ("bp1" and "bp2").

    +
    Figura 5.3: Columns that are already variables need to be repeated, once for each column that is pivoted.
    +
    +
    +
    +

    The column names become values in a new variable, whose name is defined by names_to, as shown in Figura 5.4. They need to be repeated once for each row in the original dataset.

    +
    +
    +
    +

    A diagram showing how `pivot_longer()` transforms a simple data set, using color to highlight how column names ("bp1" and "bp2") become the values in a new `measurement` column. They are repeated three times because there were three rows in the input.

    +
    Figura 5.4: The column names of pivoted columns become values in a new column. The values need to be repeated once for each row of the original dataset.
    +
    +
    +
    +

    The cell values also become values in a new variable, with a name defined by values_to. They are unwound row by row. Figura 5.5 illustrates the process.

    +
    +
    +
    +

    A diagram showing how `pivot_longer()` transforms data, using color to highlight how the cell values (blood pressure measurements) become the values in a new `value` column. They are unwound row-by-row, so the original rows (100,120), then (140,115), then (120,125), become a column running from 100 to 125.

    +
    Figura 5.5: The number of values is preserved (not repeated), but unwound row-by-row.
    +
    +
    +
    +

    +5.3.3 Many variables in column names

    +

    A more challenging situation occurs when you have multiple pieces of information crammed into the column names, and you would like to store these in separate new variables. For example, take the who2 dataset, the source of table1 and friends that you saw above:

    +
    +
    who2
    +#> # A tibble: 7,240 × 58
    +#>   country      year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554
    +#>   <chr>       <dbl>    <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
    +#> 1 Afghanistan  1980       NA        NA        NA        NA        NA
    +#> 2 Afghanistan  1981       NA        NA        NA        NA        NA
    +#> 3 Afghanistan  1982       NA        NA        NA        NA        NA
    +#> 4 Afghanistan  1983       NA        NA        NA        NA        NA
    +#> 5 Afghanistan  1984       NA        NA        NA        NA        NA
    +#> 6 Afghanistan  1985       NA        NA        NA        NA        NA
    +#> # ℹ 7,234 more rows
    +#> # ℹ 51 more variables: sp_m_5564 <dbl>, sp_m_65 <dbl>, sp_f_014 <dbl>, …
    +
    +

    This dataset, collected by the World Health Organisation, records information about tuberculosis diagnoses. There are two columns that are already variables and are easy to interpret: country and year. They are followed by 56 columns like sp_m_014, ep_m_4554, and rel_m_3544. If you stare at these columns for long enough, you’ll notice there’s a pattern. Each column name is made up of three pieces separated by _. The first piece, sp/rel/ep, describes the method used for the diagnosis, the second piece, m/f is the gender (coded as a binary variable in this dataset), and the third piece, 014/1524/2534/3544/4554/5564/65 is the age range (014 represents 0-14, for example).

    +

    So in this case we have six pieces of information recorded in who2: the country and the year (already columns); the method of diagnosis, the gender category, and the age range category (contained in the other column names); and the count of patients in that category (cell values). To organize these six pieces of information in six separate columns, we use pivot_longer() with a vector of column names for names_to and instructors for splitting the original variable names into pieces for names_sep as well as a column name for values_to:

    +
    +
    who2 |> 
    +  pivot_longer(
    +    cols = !(country:year),
    +    names_to = c("diagnosis", "gender", "age"), 
    +    names_sep = "_",
    +    values_to = "count"
    +  )
    +#> # A tibble: 405,440 × 6
    +#>   country      year diagnosis gender age   count
    +#>   <chr>       <dbl> <chr>     <chr>  <chr> <dbl>
    +#> 1 Afghanistan  1980 sp        m      014      NA
    +#> 2 Afghanistan  1980 sp        m      1524     NA
    +#> 3 Afghanistan  1980 sp        m      2534     NA
    +#> 4 Afghanistan  1980 sp        m      3544     NA
    +#> 5 Afghanistan  1980 sp        m      4554     NA
    +#> 6 Afghanistan  1980 sp        m      5564     NA
    +#> # ℹ 405,434 more rows
    +
    +

    An alternative to names_sep is names_pattern, which you can use to extract variables from more complicated naming scenarios, once you’ve learned about regular expressions in Capítulo 15.

    +

    Conceptually, this is only a minor variation on the simpler case you’ve already seen. Figura 5.6 shows the basic idea: now, instead of the column names pivoting into a single column, they pivot into multiple columns. You can imagine this happening in two steps (first pivoting and then separating) but under the hood it happens in a single step because that’s faster.

    +
    +
    +
    +

    A diagram that uses color to illustrate how supplying `names_sep` and multiple `names_to` creates multiple variables in the output. The input has variable names "x_1" and "y_2" which are split up by "_" to create name and number columns in the output. This is is similar case with a single `names_to`, but what would have been a single output variable is now separated into multiple variables.

    +
    Figura 5.6: Pivoting columns with multiple pieces of information in the names means that each column name now fills in values in multiple output columns.
    +
    +
    +
    +

    +5.3.4 Data and variable names in the column headers

    +

    The next step up in complexity is when the column names include a mix of variable values and variable names. For example, take the household dataset:

    +
    +
    household
    +#> # A tibble: 5 × 5
    +#>   family dob_child1 dob_child2 name_child1 name_child2
    +#>    <int> <date>     <date>     <chr>       <chr>      
    +#> 1      1 1998-11-26 2000-01-29 Susan       Jose       
    +#> 2      2 1996-06-22 NA         Mark        <NA>       
    +#> 3      3 2002-07-11 2004-04-05 Sam         Seth       
    +#> 4      4 2004-10-10 2009-08-27 Craig       Khai       
    +#> 5      5 2000-12-05 2005-02-28 Parker      Gracie
    +
    +

    This dataset contains data about five families, with the names and dates of birth of up to two children. The new challenge in this dataset is that the column names contain the names of two variables (dob, name) and the values of another (child, with values 1 or 2). To solve this problem we again need to supply a vector to names_to but this time we use the special ".value" sentinel; this isn’t the name of a variable but a unique value that tells pivot_longer() to do something different. This overrides the usual values_to argument to use the first component of the pivoted column name as a variable name in the output.

    +
    +
    household |> 
    +  pivot_longer(
    +    cols = !family, 
    +    names_to = c(".value", "child"), 
    +    names_sep = "_", 
    +    values_drop_na = TRUE
    +  )
    +#> # A tibble: 9 × 4
    +#>   family child  dob        name 
    +#>    <int> <chr>  <date>     <chr>
    +#> 1      1 child1 1998-11-26 Susan
    +#> 2      1 child2 2000-01-29 Jose 
    +#> 3      2 child1 1996-06-22 Mark 
    +#> 4      3 child1 2002-07-11 Sam  
    +#> 5      3 child2 2004-04-05 Seth 
    +#> 6      4 child1 2004-10-10 Craig
    +#> # ℹ 3 more rows
    +
    +

    We again use values_drop_na = TRUE, since the shape of the input forces the creation of explicit missing variables (e.g., for families with only one child).

    +

    Figura 5.7 illustrates the basic idea with a simpler example. When you use ".value" in names_to, the column names in the input contribute to both values and variable names in the output.

    +
    +
    +
    +

    A diagram that uses color to illustrate how the special ".value" sentinel works. The input has names "x_1", "x_2", "y_1", and "y_2", and we want to use the first component ("x", "y") as a variable name and the second ("1", "2") as the value for a new "num" column.

    +
    Figura 5.7: Pivoting with names_to = c(".value", "num") splits the column names into two components: the first part determines the output column name (x or y), and the second part determines the value of the num column.
    +
    +
    +
    +

    +5.4 Widening data

    +

    So far we’ve used pivot_longer() to solve the common class of problems where values have ended up in column names. Next we’ll pivot (HA HA) to pivot_wider(), which makes datasets wider by increasing columns and reducing rows and helps when one observation is spread across multiple rows. This seems to arise less commonly in the wild, but it does seem to crop up a lot when dealing with governmental data.

    +

    We’ll start by looking at cms_patient_experience, a dataset from the Centers of Medicare and Medicaid services that collects data about patient experiences:

    +
    +
    cms_patient_experience
    +#> # A tibble: 500 × 5
    +#>   org_pac_id org_nm                     measure_cd   measure_title   prf_rate
    +#>   <chr>      <chr>                      <chr>        <chr>              <dbl>
    +#> 1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_1  CAHPS for MIPS…       63
    +#> 2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_2  CAHPS for MIPS…       87
    +#> 3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_3  CAHPS for MIPS…       86
    +#> 4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_5  CAHPS for MIPS…       57
    +#> 5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_8  CAHPS for MIPS…       85
    +#> 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS…       24
    +#> # ℹ 494 more rows
    +
    +

    The core unit being studied is an organization, but each organization is spread across six rows, with one row for each measurement taken in the survey organization. We can see the complete set of values for measure_cd and measure_title by using distinct():

    +
    +
    cms_patient_experience |> 
    +  distinct(measure_cd, measure_title)
    +#> # A tibble: 6 × 2
    +#>   measure_cd   measure_title                                                 
    +#>   <chr>        <chr>                                                         
    +#> 1 CAHPS_GRP_1  CAHPS for MIPS SSM: Getting Timely Care, Appointments, and In…
    +#> 2 CAHPS_GRP_2  CAHPS for MIPS SSM: How Well Providers Communicate            
    +#> 3 CAHPS_GRP_3  CAHPS for MIPS SSM: Patient's Rating of Provider              
    +#> 4 CAHPS_GRP_5  CAHPS for MIPS SSM: Health Promotion and Education            
    +#> 5 CAHPS_GRP_8  CAHPS for MIPS SSM: Courteous and Helpful Office Staff        
    +#> 6 CAHPS_GRP_12 CAHPS for MIPS SSM: Stewardship of Patient Resources
    +
    +

    Neither of these columns will make particularly great variable names: measure_cd doesn’t hint at the meaning of the variable and measure_title is a long sentence containing spaces. We’ll use measure_cd as the source for our new column names for now, but in a real analysis you might want to create your own variable names that are both short and meaningful.

    +

    pivot_wider() has the opposite interface to pivot_longer(): instead of choosing new column names, we need to provide the existing columns that define the values (values_from) and the column name (names_from):

    +
    +
    cms_patient_experience |> 
    +  pivot_wider(
    +    names_from = measure_cd,
    +    values_from = prf_rate
    +  )
    +#> # A tibble: 500 × 9
    +#>   org_pac_id org_nm                   measure_title   CAHPS_GRP_1 CAHPS_GRP_2
    +#>   <chr>      <chr>                    <chr>                 <dbl>       <dbl>
    +#> 1 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS…          63          NA
    +#> 2 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS…          NA          87
    +#> 3 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS…          NA          NA
    +#> 4 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS…          NA          NA
    +#> 5 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS…          NA          NA
    +#> 6 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS…          NA          NA
    +#> # ℹ 494 more rows
    +#> # ℹ 4 more variables: CAHPS_GRP_3 <dbl>, CAHPS_GRP_5 <dbl>, …
    +
    +

    The output doesn’t look quite right; we still seem to have multiple rows for each organization. That’s because, we also need to tell pivot_wider() which column or columns have values that uniquely identify each row; in this case those are the variables starting with "org":

    +
    +
    cms_patient_experience |> 
    +  pivot_wider(
    +    id_cols = starts_with("org"),
    +    names_from = measure_cd,
    +    values_from = prf_rate
    +  )
    +#> # A tibble: 95 × 8
    +#>   org_pac_id org_nm           CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3 CAHPS_GRP_5
    +#>   <chr>      <chr>                  <dbl>       <dbl>       <dbl>       <dbl>
    +#> 1 0446157747 USC CARE MEDICA…          63          87          86          57
    +#> 2 0446162697 ASSOCIATION OF …          59          85          83          63
    +#> 3 0547164295 BEAVER MEDICAL …          49          NA          75          44
    +#> 4 0749333730 CAPE PHYSICIANS…          67          84          85          65
    +#> 5 0840104360 ALLIANCE PHYSIC…          66          87          87          64
    +#> 6 0840109864 REX HOSPITAL INC          73          87          84          67
    +#> # ℹ 89 more rows
    +#> # ℹ 2 more variables: CAHPS_GRP_8 <dbl>, CAHPS_GRP_12 <dbl>
    +
    +

    This gives us the output that we’re looking for.

    +

    +5.4.1 How does pivot_wider() work?

    +

    To understand how pivot_wider() works, let’s again start with a very simple dataset. This time we have two patients with ids A and B, we have three blood pressure measurements on patient A and two on patient B:

    +
    +
    df <- tribble(
    +  ~id, ~measurement, ~value,
    +  "A",        "bp1",    100,
    +  "B",        "bp1",    140,
    +  "B",        "bp2",    115, 
    +  "A",        "bp2",    120,
    +  "A",        "bp3",    105
    +)
    +
    +

    We’ll take the values from the value column and the names from the measurement column:

    +
    +
    df |> 
    +  pivot_wider(
    +    names_from = measurement,
    +    values_from = value
    +  )
    +#> # A tibble: 2 × 4
    +#>   id      bp1   bp2   bp3
    +#>   <chr> <dbl> <dbl> <dbl>
    +#> 1 A       100   120   105
    +#> 2 B       140   115    NA
    +
    +

    To begin the process pivot_wider() needs to first figure out what will go in the rows and columns. The new column names will be the unique values of measurement.

    +
    +
    df |> 
    +  distinct(measurement) |> 
    +  pull()
    +#> [1] "bp1" "bp2" "bp3"
    +
    +

    By default, the rows in the output are determined by all the variables that aren’t going into the new names or values. These are called the id_cols. Here there is only one column, but in general there can be any number.

    +
    +
    df |> 
    +  select(-measurement, -value) |> 
    +  distinct()
    +#> # A tibble: 2 × 1
    +#>   id   
    +#>   <chr>
    +#> 1 A    
    +#> 2 B
    +
    +

    pivot_wider() then combines these results to generate an empty data frame:

    +
    +
    df |> 
    +  select(-measurement, -value) |> 
    +  distinct() |> 
    +  mutate(x = NA, y = NA, z = NA)
    +#> # A tibble: 2 × 4
    +#>   id    x     y     z    
    +#>   <chr> <lgl> <lgl> <lgl>
    +#> 1 A     NA    NA    NA   
    +#> 2 B     NA    NA    NA
    +
    +

    It then fills in all the missing values using the data in the input. In this case, not every cell in the output has a corresponding value in the input as there’s no third blood pressure measurement for patient B, so that cell remains missing. We’ll come back to this idea that pivot_wider() can “make” missing values in Capítulo 18.

    +

    You might also wonder what happens if there are multiple rows in the input that correspond to one cell in the output. The example below has two rows that correspond to id “A” and measurement “bp1”:

    +
    +
    df <- tribble(
    +  ~id, ~measurement, ~value,
    +  "A",        "bp1",    100,
    +  "A",        "bp1",    102,
    +  "A",        "bp2",    120,
    +  "B",        "bp1",    140, 
    +  "B",        "bp2",    115
    +)
    +
    +

    If we attempt to pivot this we get an output that contains list-columns, which you’ll learn more about in Capítulo 23:

    +
    +
    df |>
    +  pivot_wider(
    +    names_from = measurement,
    +    values_from = value
    +  )
    +#> Warning: Values from `value` are not uniquely identified; output will contain
    +#> list-cols.
    +#> • Use `values_fn = list` to suppress this warning.
    +#> • Use `values_fn = {summary_fun}` to summarise duplicates.
    +#> • Use the following dplyr code to identify duplicates.
    +#>   {data} %>%
    +#>   dplyr::group_by(id, measurement) %>%
    +#>   dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
    +#>   dplyr::filter(n > 1L)
    +#> # A tibble: 2 × 3
    +#>   id    bp1       bp2      
    +#>   <chr> <list>    <list>   
    +#> 1 A     <dbl [2]> <dbl [1]>
    +#> 2 B     <dbl [1]> <dbl [1]>
    +
    +

    Since you don’t know how to work with this sort of data yet, you’ll want to follow the hint in the warning to figure out where the problem is:

    +
    +
    df |> 
    +  group_by(id, measurement) |> 
    +  summarize(n = n(), .groups = "drop") |> 
    +  filter(n > 1)
    +#> # A tibble: 1 × 3
    +#>   id    measurement     n
    +#>   <chr> <chr>       <int>
    +#> 1 A     bp1             2
    +
    +

    It’s then up to you to figure out what’s gone wrong with your data and either repair the underlying damage or use your grouping and summarizing skills to ensure that each combination of row and column values only has a single row.

    +

    +5.5 Summary

    +

    In this chapter you learned about tidy data: data that has variables in columns and observations in rows. Tidy data makes working in the tidyverse easier, because it’s a consistent structure understood by most functions, the main challenge is transforming the data from whatever structure you receive it in to a tidy format. To that end, you learned about pivot_longer() and pivot_wider() which allow you to tidy up many untidy datasets. The examples we presented here are a selection of those from vignette("pivot", package = "tidyr"), so if you encounter a problem that this chapter doesn’t help you with, that vignette is a good place to try next.

    +

    Another challenge is that, for a given dataset, it can be impossible to label the longer or the wider version as the “tidy” one. This is partly a reflection of our definition of tidy data, where we said tidy data has one variable in each column, but we didn’t actually define what a variable is (and it’s surprisingly hard to do so). It’s totally fine to be pragmatic and to say a variable is whatever makes your analysis easiest. So if you’re stuck figuring out how to do some computation, consider switching up the organisation of your data; don’t be afraid to untidy, transform, and re-tidy as needed!

    +

    If you enjoyed this chapter and want to learn more about the underlying theory, you can learn more about the history and theoretical underpinnings in the Tidy Data paper published in the Journal of Statistical Software.

    +

    Now that you’re writing a substantial amount of R code, it’s time to learn more about organizing your code into files and directories. In the next chapter, you’ll learn all about the advantages of scripts and projects, and some of the many tools that they provide to make your life easier.

    + + +

    +
      +
    1. The song will be included as long as it was in the top 100 at some point in 2000, and is tracked for up to 72 weeks after it appears.↩︎

    2. +
    3. We’ll come back to this idea in Capítulo 18.↩︎

    4. +
    +
    +
    +
    + + + \ No newline at end of file diff --git a/data-tidy_files/figure-html/fig-billboard-ranks-1.png b/data-tidy_files/figure-html/fig-billboard-ranks-1.png new file mode 100644 index 000000000..f7c641d66 Binary files /dev/null and b/data-tidy_files/figure-html/fig-billboard-ranks-1.png differ diff --git a/data-tidy_files/figure-html/unnamed-chunk-5-1.png b/data-tidy_files/figure-html/unnamed-chunk-5-1.png new file mode 100644 index 000000000..2b589d193 Binary files /dev/null and b/data-tidy_files/figure-html/unnamed-chunk-5-1.png differ diff --git a/data-transform.html b/data-transform.html new file mode 100644 index 000000000..2de77e139 --- /dev/null +++ b/data-transform.html @@ -0,0 +1,1635 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 3  Data transformation + + + + + + + + + + + + + + + + + + + + + + + + +
    +
    + +
    + + + +
    +

    3  Data transformation

    +
    + + + +
    + + + + +
    + + +

    +3.1 Introduction

    +

    Visualization is an important tool for generating insight, but it’s rare that you get the data in exactly the right form you need to make the graph you want. Often you’ll need to create some new variables or summaries to answer your questions with your data, or maybe you just want to rename the variables or reorder the observations to make the data a little easier to work with. You’ll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the dplyr package and a new dataset on flights that departed from New York City in 2013.

    +

    The goal of this chapter is to give you an overview of all the key tools for transforming a data frame. We’ll start with functions that operate on rows and then columns of a data frame, then circle back to talk more about the pipe, an important tool that you use to combine verbs. We will then introduce the ability to work with groups. We will end the chapter with a case study that showcases these functions in action and we’ll come back to the functions in more detail in later chapters, as we start to dig into specific types of data (e.g., numbers, strings, dates).

    +

    +3.1.1 Prerequisites

    +

    In this chapter we’ll focus on the dplyr package, another core member of the tidyverse. We’ll illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data.

    +
    +
    library(nycflights13)
    +library(tidyverse)
    +#> ── Attaching core tidyverse packages ───────────────────── tidyverse 2.0.0 ──
    +#> ✔ dplyr     1.1.3     ✔ readr     2.1.4
    +#> ✔ forcats   1.0.0     ✔ stringr   1.5.1
    +#> ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
    +#> ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
    +#> ✔ purrr     1.0.2     
    +#> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
    +#> ✖ dplyr::filter() masks stats::filter()
    +#> ✖ dplyr::lag()    masks stats::lag()
    +#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
    +
    +

    Take careful note of the conflicts message that’s printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag(). So far we’ve mostly ignored which package a function comes from because most of the time it doesn’t matter. However, knowing the package can help you find help and find related functions, so when we need to be precise about which package a function comes from, we’ll use the same syntax as R: packagename::functionname().

    +

    +3.1.2 nycflights13

    +

    To explore the basic dplyr verbs, we’re going to use nycflights13::flights. This dataset contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in ?flights.

    +
    +
    flights
    +#> # A tibble: 336,776 × 19
    +#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    +#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    +#> 1  2013     1     1      517            515         2      830            819
    +#> 2  2013     1     1      533            529         4      850            830
    +#> 3  2013     1     1      542            540         2      923            850
    +#> 4  2013     1     1      544            545        -1     1004           1022
    +#> 5  2013     1     1      554            600        -6      812            837
    +#> 6  2013     1     1      554            558        -4      740            728
    +#> # ℹ 336,770 more rows
    +#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
    +
    +

    flights is a tibble, a special type of data frame used by the tidyverse to avoid some common gotchas. The most important difference between tibbles and data frames is the way tibbles print; they are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen. There are a few options to see everything. If you’re using RStudio, the most convenient is probably View(flights), which will open an interactive scrollable and filterable view. Otherwise you can use print(flights, width = Inf) to show all columns, or use glimpse():

    +
    +
    glimpse(flights)
    +#> Rows: 336,776
    +#> Columns: 19
    +#> $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013…
    +#> $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
    +#> $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
    +#> $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 55…
    +#> $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 60…
    +#> $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2,…
    +#> $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 8…
    +#> $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 8…
    +#> $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7,…
    +#> $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6"…
    +#> $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301…
    +#> $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N…
    +#> $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LG…
    +#> $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IA…
    +#> $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149…
    +#> $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 73…
    +#> $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6…
    +#> $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59…
    +#> $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-0…
    +
    +

    In both views, the variables names are followed by abbreviations that tell you the type of each variable: <int> is short for integer, <dbl> is short for double (aka real numbers), <chr> for character (aka strings), and <dttm> for date-time. These are important because the operations you can perform on a column depend so much on its “type”.

    +

    +3.1.3 dplyr basics

    +

    You’re about to learn the primary dplyr verbs (functions) which will allow you to solve the vast majority of your data manipulation challenges. But before we discuss their individual differences, it’s worth stating what they have in common:

    +
      +
    1. The first argument is always a data frame.

    2. +
    3. The subsequent arguments typically describe which columns to operate on, using the variable names (without quotes).

    4. +
    5. The output is always a new data frame.

    6. +
    +

    Because each verb does one thing well, solving complex problems will usually require combining multiple verbs, and we’ll do so with the pipe, |>. We’ll discuss the pipe more in Seção 3.4, but in brief, the pipe takes the thing on its left and passes it along to the function on its right so that x |> f(y) is equivalent to f(x, y), and x |> f(y) |> g(z) is equivalent to g(f(x, y), z). The easiest way to pronounce the pipe is “then”. That makes it possible to get a sense of the following code even though you haven’t yet learned the details:

    +
    +
    flights |>
    +  filter(dest == "IAH") |> 
    +  group_by(year, month, day) |> 
    +  summarize(
    +    arr_delay = mean(arr_delay, na.rm = TRUE)
    +  )
    +
    +

    dplyr’s verbs are organized into four groups based on what they operate on: rows, columns, groups, or tables. In the following sections you’ll learn the most important verbs for rows, columns, and groups, then we’ll come back to the join verbs that work on tables in Capítulo 19. Let’s dive in!

    +

    +3.2 Rows

    +

    The most important verbs that operate on rows of a dataset are filter(), which changes which rows are present without changing their order, and arrange(), which changes the order of the rows without changing which are present. Both functions only affect the rows, and the columns are left unchanged. We’ll also discuss distinct() which finds rows with unique values but unlike arrange() and filter() it can also optionally modify the columns.

    +

    +3.2.1 filter() +

    +

    filter() allows you to keep rows based on the values of the columns1. The first argument is the data frame. The second and subsequent arguments are the conditions that must be true to keep the row. For example, we could find all flights that departed more than 120 minutes (two hours) late:

    +
    +
    flights |> 
    +  filter(dep_delay > 120)
    +#> # A tibble: 9,723 × 19
    +#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    +#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    +#> 1  2013     1     1      848           1835       853     1001           1950
    +#> 2  2013     1     1      957            733       144     1056            853
    +#> 3  2013     1     1     1114            900       134     1447           1222
    +#> 4  2013     1     1     1540           1338       122     2020           1825
    +#> 5  2013     1     1     1815           1325       290     2120           1542
    +#> 6  2013     1     1     1842           1422       260     1958           1535
    +#> # ℹ 9,717 more rows
    +#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
    +
    +

    As well as > (greater than), you can use >= (greater than or equal to), < (less than), <= (less than or equal to), == (equal to), and != (not equal to). You can also combine conditions with & or , to indicate “and” (check for both conditions) or with | to indicate “or” (check for either condition):

    +
    +
    # Flights that departed on January 1
    +flights |> 
    +  filter(month == 1 & day == 1)
    +#> # A tibble: 842 × 19
    +#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    +#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    +#> 1  2013     1     1      517            515         2      830            819
    +#> 2  2013     1     1      533            529         4      850            830
    +#> 3  2013     1     1      542            540         2      923            850
    +#> 4  2013     1     1      544            545        -1     1004           1022
    +#> 5  2013     1     1      554            600        -6      812            837
    +#> 6  2013     1     1      554            558        -4      740            728
    +#> # ℹ 836 more rows
    +#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
    +
    +# Flights that departed in January or February
    +flights |> 
    +  filter(month == 1 | month == 2)
    +#> # A tibble: 51,955 × 19
    +#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    +#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    +#> 1  2013     1     1      517            515         2      830            819
    +#> 2  2013     1     1      533            529         4      850            830
    +#> 3  2013     1     1      542            540         2      923            850
    +#> 4  2013     1     1      544            545        -1     1004           1022
    +#> 5  2013     1     1      554            600        -6      812            837
    +#> 6  2013     1     1      554            558        -4      740            728
    +#> # ℹ 51,949 more rows
    +#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
    +
    +

    There’s a useful shortcut when you’re combining | and ==: %in%. It keeps rows where the variable equals one of the values on the right:

    +
    +
    # A shorter way to select flights that departed in January or February
    +flights |> 
    +  filter(month %in% c(1, 2))
    +#> # A tibble: 51,955 × 19
    +#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    +#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    +#> 1  2013     1     1      517            515         2      830            819
    +#> 2  2013     1     1      533            529         4      850            830
    +#> 3  2013     1     1      542            540         2      923            850
    +#> 4  2013     1     1      544            545        -1     1004           1022
    +#> 5  2013     1     1      554            600        -6      812            837
    +#> 6  2013     1     1      554            558        -4      740            728
    +#> # ℹ 51,949 more rows
    +#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
    +
    +

    We’ll come back to these comparisons and logical operators in more detail in Capítulo 12.

    +

    When you run filter() dplyr executes the filtering operation, creating a new data frame, and then prints it. It doesn’t modify the existing flights dataset because dplyr functions never modify their inputs. To save the result, you need to use the assignment operator, <-:

    +
    +
    jan1 <- flights |> 
    +  filter(month == 1 & day == 1)
    +
    +

    +3.2.2 Common mistakes

    +

    When you’re starting out with R, the easiest mistake to make is to use = instead of == when testing for equality. filter() will let you know when this happens:

    +
    +
    flights |> 
    +  filter(month = 1)
    +#> Error in `filter()`:
    +#> ! We detected a named input.
    +#> ℹ This usually means that you've used `=` instead of `==`.
    +#> ℹ Did you mean `month == 1`?
    +
    +

    Another mistakes is you write “or” statements like you would in English:

    +
    +
    flights |> 
    +  filter(month == 1 | 2)
    +
    +

    This “works”, in the sense that it doesn’t throw an error, but it doesn’t do what you want because | first checks the condition month == 1 and then checks the condition 2, which is not a sensible condition to check. We’ll learn more about what’s happening here and why in Seção 15.6.2.

    +

    +3.2.3 arrange() +

    +

    arrange() changes the order of the rows based on the value of the columns. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. For example, the following code sorts by the departure time, which is spread over four columns. We get the earliest years first, then within a year the earliest months, etc.

    +
    +
    flights |> 
    +  arrange(year, month, day, dep_time)
    +#> # A tibble: 336,776 × 19
    +#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    +#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    +#> 1  2013     1     1      517            515         2      830            819
    +#> 2  2013     1     1      533            529         4      850            830
    +#> 3  2013     1     1      542            540         2      923            850
    +#> 4  2013     1     1      544            545        -1     1004           1022
    +#> 5  2013     1     1      554            600        -6      812            837
    +#> 6  2013     1     1      554            558        -4      740            728
    +#> # ℹ 336,770 more rows
    +#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
    +
    +

    You can use desc() on a column inside of arrange() to re-order the data frame based on that column in descending (big-to-small) order. For example, this code orders flights from most to least delayed:

    +
    +
    flights |> 
    +  arrange(desc(dep_delay))
    +#> # A tibble: 336,776 × 19
    +#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    +#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    +#> 1  2013     1     9      641            900      1301     1242           1530
    +#> 2  2013     6    15     1432           1935      1137     1607           2120
    +#> 3  2013     1    10     1121           1635      1126     1239           1810
    +#> 4  2013     9    20     1139           1845      1014     1457           2210
    +#> 5  2013     7    22      845           1600      1005     1044           1815
    +#> 6  2013     4    10     1100           1900       960     1342           2211
    +#> # ℹ 336,770 more rows
    +#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
    +
    +

    Note that the number of rows has not changed – we’re only arranging the data, we’re not filtering it.

    +

    +3.2.4 distinct() +

    +

    distinct() finds all the unique rows in a dataset, so in a technical sense, it primarily operates on the rows. Most of the time, however, you’ll want the distinct combination of some variables, so you can also optionally supply column names:

    +
    +
    # Remove duplicate rows, if any
    +flights |> 
    +  distinct()
    +#> # A tibble: 336,776 × 19
    +#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    +#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    +#> 1  2013     1     1      517            515         2      830            819
    +#> 2  2013     1     1      533            529         4      850            830
    +#> 3  2013     1     1      542            540         2      923            850
    +#> 4  2013     1     1      544            545        -1     1004           1022
    +#> 5  2013     1     1      554            600        -6      812            837
    +#> 6  2013     1     1      554            558        -4      740            728
    +#> # ℹ 336,770 more rows
    +#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
    +
    +# Find all unique origin and destination pairs
    +flights |> 
    +  distinct(origin, dest)
    +#> # A tibble: 224 × 2
    +#>   origin dest 
    +#>   <chr>  <chr>
    +#> 1 EWR    IAH  
    +#> 2 LGA    IAH  
    +#> 3 JFK    MIA  
    +#> 4 JFK    BQN  
    +#> 5 LGA    ATL  
    +#> 6 EWR    ORD  
    +#> # ℹ 218 more rows
    +
    +

    Alternatively, if you want to the keep other columns when filtering for unique rows, you can use the .keep_all = TRUE option.

    +
    +
    flights |> 
    +  distinct(origin, dest, .keep_all = TRUE)
    +#> # A tibble: 224 × 19
    +#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    +#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    +#> 1  2013     1     1      517            515         2      830            819
    +#> 2  2013     1     1      533            529         4      850            830
    +#> 3  2013     1     1      542            540         2      923            850
    +#> 4  2013     1     1      544            545        -1     1004           1022
    +#> 5  2013     1     1      554            600        -6      812            837
    +#> 6  2013     1     1      554            558        -4      740            728
    +#> # ℹ 218 more rows
    +#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
    +
    +

    It’s not a coincidence that all of these distinct flights are on January 1: distinct() will find the first occurrence of a unique row in the dataset and discard the rest.

    +

    If you want to find the number of occurrences instead, you’re better off swapping distinct() for count(), and with the sort = TRUE argument you can arrange them in descending order of number of occurrences. You’ll learn more about count in Seção 13.3.

    +
    +
    flights |>
    +  count(origin, dest, sort = TRUE)
    +#> # A tibble: 224 × 3
    +#>   origin dest      n
    +#>   <chr>  <chr> <int>
    +#> 1 JFK    LAX   11262
    +#> 2 LGA    ATL   10263
    +#> 3 LGA    ORD    8857
    +#> 4 JFK    SFO    8204
    +#> 5 LGA    CLT    6168
    +#> 6 EWR    ORD    6100
    +#> # ℹ 218 more rows
    +
    +

    +3.2.5 Exercises

    +
      +
    1. +

      In a single pipeline for each condition, find all flights that meet the condition:

      +
        +
      • Had an arrival delay of two or more hours
      • +
      • Flew to Houston (IAH or HOU)
      • +
      • Were operated by United, American, or Delta
      • +
      • Departed in summer (July, August, and September)
      • +
      • Arrived more than two hours late, but didn’t leave late
      • +
      • Were delayed by at least an hour, but made up over 30 minutes in flight
      • +
      +
    2. +
    3. Sort flights to find the flights with longest departure delays. Find the flights that left earliest in the morning.

    4. +
    5. Sort flights to find the fastest flights. (Hint: Try including a math calculation inside of your function.)

    6. +
    7. Was there a flight on every day of 2013?

    8. +
    9. Which flights traveled the farthest distance? Which traveled the least distance?

    10. +
    11. Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.

    12. +

    +3.3 Columns

    +

    There are four important verbs that affect the columns without changing the rows: mutate() creates new columns that are derived from the existing columns, select() changes which columns are present, rename() changes the names of the columns, and relocate() changes the positions of the columns.

    +

    +3.3.1 mutate() +

    +

    The job of mutate() is to add new columns that are calculated from the existing columns. In the transform chapters, you’ll learn a large set of functions that you can use to manipulate different types of variables. For now, we’ll stick with basic algebra, which allows us to compute the gain, how much time a delayed flight made up in the air, and the speed in miles per hour:

    +
    +
    flights |> 
    +  mutate(
    +    gain = dep_delay - arr_delay,
    +    speed = distance / air_time * 60
    +  )
    +#> # A tibble: 336,776 × 21
    +#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    +#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    +#> 1  2013     1     1      517            515         2      830            819
    +#> 2  2013     1     1      533            529         4      850            830
    +#> 3  2013     1     1      542            540         2      923            850
    +#> 4  2013     1     1      544            545        -1     1004           1022
    +#> 5  2013     1     1      554            600        -6      812            837
    +#> 6  2013     1     1      554            558        -4      740            728
    +#> # ℹ 336,770 more rows
    +#> # ℹ 13 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
    +
    +

    By default, mutate() adds new columns on the right hand side of your dataset, which makes it difficult to see what’s happening here. We can use the .before argument to instead add the variables to the left hand side2:

    +
    +
    flights |> 
    +  mutate(
    +    gain = dep_delay - arr_delay,
    +    speed = distance / air_time * 60,
    +    .before = 1
    +  )
    +#> # A tibble: 336,776 × 21
    +#>    gain speed  year month   day dep_time sched_dep_time dep_delay arr_time
    +#>   <dbl> <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
    +#> 1    -9  370.  2013     1     1      517            515         2      830
    +#> 2   -16  374.  2013     1     1      533            529         4      850
    +#> 3   -31  408.  2013     1     1      542            540         2      923
    +#> 4    17  517.  2013     1     1      544            545        -1     1004
    +#> 5    19  394.  2013     1     1      554            600        -6      812
    +#> 6   -16  288.  2013     1     1      554            558        -4      740
    +#> # ℹ 336,770 more rows
    +#> # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, …
    +
    +

    The . is a sign that .before is an argument to the function, not the name of a third new variable we are creating. You can also use .after to add after a variable, and in both .before and .after you can use the variable name instead of a position. For example, we could add the new variables after day:

    +
    +
    flights |> 
    +  mutate(
    +    gain = dep_delay - arr_delay,
    +    speed = distance / air_time * 60,
    +    .after = day
    +  )
    +
    +

    Alternatively, you can control which variables are kept with the .keep argument. A particularly useful argument is "used" which specifies that we only keep the columns that were involved or created in the mutate() step. For example, the following output will contain only the variables dep_delay, arr_delay, air_time, gain, hours, and gain_per_hour.

    +
    +
    flights |> 
    +  mutate(
    +    gain = dep_delay - arr_delay,
    +    hours = air_time / 60,
    +    gain_per_hour = gain / hours,
    +    .keep = "used"
    +  )
    +
    +

    Note that since we haven’t assigned the result of the above computation back to flights, the new variables gain, hours, and gain_per_hour will only be printed but will not be stored in a data frame. And if we want them to be available in a data frame for future use, we should think carefully about whether we want the result to be assigned back to flights, overwriting the original data frame with many more variables, or to a new object. Often, the right answer is a new object that is named informatively to indicate its contents, e.g., delay_gain, but you might also have good reasons for overwriting flights.

    +

    +3.3.2 select() +

    +

    It’s not uncommon to get datasets with hundreds or even thousands of variables. In this situation, the first challenge is often just focusing on the variables you’re interested in. select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables:

    +
      +
    • +

      Select columns by name:

      +
      +
      flights |> 
      +  select(year, month, day)
      +
      +
    • +
    • +

      Select all columns between year and day (inclusive):

      +
      +
      flights |> 
      +  select(year:day)
      +
      +
    • +
    • +

      Select all columns except those from year to day (inclusive):

      +
      +
      flights |> 
      +  select(!year:day)
      +
      +

      Historically this operation was done with - instead of !, so you’re likely to see that in the wild. These two operators serve the same purpose but with subtle differences in behavior. We recommend using ! because it reads as “not” and combines well with & and |.

      +
    • +
    • +

      Select all columns that are characters:

      +
      +
      flights |> 
      +  select(where(is.character))
      +
      +
    • +
    +

    There are a number of helper functions you can use within select():

    +
      +
    • +starts_with("abc"): matches names that begin with “abc”.
    • +
    • +ends_with("xyz"): matches names that end with “xyz”.
    • +
    • +contains("ijk"): matches names that contain “ijk”.
    • +
    • +num_range("x", 1:3): matches x1, x2 and x3.
    • +
    +

    See ?select for more details. Once you know regular expressions (the topic of Capítulo 15) you’ll also be able to use matches() to select variables that match a pattern.

    +

    You can rename variables as you select() them by using =. The new name appears on the left hand side of the =, and the old variable appears on the right hand side:

    +
    +
    flights |> 
    +  select(tail_num = tailnum)
    +#> # A tibble: 336,776 × 1
    +#>   tail_num
    +#>   <chr>   
    +#> 1 N14228  
    +#> 2 N24211  
    +#> 3 N619AA  
    +#> 4 N804JB  
    +#> 5 N668DN  
    +#> 6 N39463  
    +#> # ℹ 336,770 more rows
    +
    +

    +3.3.3 rename() +

    +

    If you want to keep all the existing variables and just want to rename a few, you can use rename() instead of select():

    +
    +
    flights |> 
    +  rename(tail_num = tailnum)
    +#> # A tibble: 336,776 × 19
    +#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    +#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    +#> 1  2013     1     1      517            515         2      830            819
    +#> 2  2013     1     1      533            529         4      850            830
    +#> 3  2013     1     1      542            540         2      923            850
    +#> 4  2013     1     1      544            545        -1     1004           1022
    +#> 5  2013     1     1      554            600        -6      812            837
    +#> 6  2013     1     1      554            558        -4      740            728
    +#> # ℹ 336,770 more rows
    +#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
    +
    +

    If you have a bunch of inconsistently named columns and it would be painful to fix them all by hand, check out janitor::clean_names() which provides some useful automated cleaning.

    +

    +3.3.4 relocate() +

    +

    Use relocate() to move variables around. You might want to collect related variables together or move important variables to the front. By default relocate() moves variables to the front:

    +
    +
    flights |> 
    +  relocate(time_hour, air_time)
    +#> # A tibble: 336,776 × 19
    +#>   time_hour           air_time  year month   day dep_time sched_dep_time
    +#>   <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
    +#> 1 2013-01-01 05:00:00      227  2013     1     1      517            515
    +#> 2 2013-01-01 05:00:00      227  2013     1     1      533            529
    +#> 3 2013-01-01 05:00:00      160  2013     1     1      542            540
    +#> 4 2013-01-01 05:00:00      183  2013     1     1      544            545
    +#> 5 2013-01-01 06:00:00      116  2013     1     1      554            600
    +#> 6 2013-01-01 05:00:00      150  2013     1     1      554            558
    +#> # ℹ 336,770 more rows
    +#> # ℹ 12 more variables: dep_delay <dbl>, arr_time <int>, …
    +
    +

    You can also specify where to put them using the .before and .after arguments, just like in mutate():

    +
    +
    flights |> 
    +  relocate(year:dep_time, .after = time_hour)
    +flights |> 
    +  relocate(starts_with("arr"), .before = dep_time)
    +
    +

    +3.3.5 Exercises

    +
      +
    1. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

    2. +
    3. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

    4. +
    5. What happens if you specify the name of the same variable multiple times in a select() call?

    6. +
    7. +

      What does the any_of() function do? Why might it be helpful in conjunction with this vector?

      +
      +
      variables <- c("year", "month", "day", "dep_delay", "arr_delay")
      +
      +
    8. +
    9. +

      Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?

      +
      +
      flights |> select(contains("TIME"))
      +
      +
    10. +
    11. Rename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.

    12. +
    13. +

      Why doesn’t the following work, and what does the error mean?

      +
      +
      flights |> 
      +  select(tailnum) |> 
      +  arrange(arr_delay)
      +#> Error in `arrange()`:
      +#> ℹ In argument: `..1 = arr_delay`.
      +#> Caused by error:
      +#> ! object 'arr_delay' not found
      +
      +
    14. +

    +3.4 The pipe

    +

    We’ve shown you simple examples of the pipe above, but its real power arises when you start to combine multiple verbs. For example, imagine that you wanted to find the fast flights to Houston’s IAH airport: you need to combine filter(), mutate(), select(), and arrange():

    +
    +
    flights |> 
    +  filter(dest == "IAH") |> 
    +  mutate(speed = distance / air_time * 60) |> 
    +  select(year:day, dep_time, carrier, flight, speed) |> 
    +  arrange(desc(speed))
    +#> # A tibble: 7,198 × 7
    +#>    year month   day dep_time carrier flight speed
    +#>   <int> <int> <int>    <int> <chr>    <int> <dbl>
    +#> 1  2013     7     9      707 UA         226  522.
    +#> 2  2013     8    27     1850 UA        1128  521.
    +#> 3  2013     8    28      902 UA        1711  519.
    +#> 4  2013     8    28     2122 UA        1022  519.
    +#> 5  2013     6    11     1628 UA        1178  515.
    +#> 6  2013     8    27     1017 UA         333  515.
    +#> # ℹ 7,192 more rows
    +
    +

    Even though this pipeline has four steps, it’s easy to skim because the verbs come at the start of each line: start with the flights data, then filter, then mutate, then select, then arrange.

    +

    What would happen if we didn’t have the pipe? We could nest each function call inside the previous call:

    +
    +
    arrange(
    +  select(
    +    mutate(
    +      filter(
    +        flights, 
    +        dest == "IAH"
    +      ),
    +      speed = distance / air_time * 60
    +    ),
    +    year:day, dep_time, carrier, flight, speed
    +  ),
    +  desc(speed)
    +)
    +
    +

    Or we could use a bunch of intermediate objects:

    +
    +
    flights1 <- filter(flights, dest == "IAH")
    +flights2 <- mutate(flights1, speed = distance / air_time * 60)
    +flights3 <- select(flights2, year:day, dep_time, carrier, flight, speed)
    +arrange(flights3, desc(speed))
    +
    +

    While both forms have their time and place, the pipe generally produces data analysis code that is easier to write and read.

    +

    To add the pipe to your code, we recommend using the built-in keyboard shortcut Ctrl/Cmd + Shift + M. You’ll need to make one change to your RStudio options to use |> instead of %>% as shown in Figura 3.1; more on %>% shortly.

    +
    +
    +
    +

    Screenshot showing the "Use native pipe operator" option which can be found on the "Editing" panel of the "Code" options.

    +
    Figura 3.1: To insert |>, make sure the “Use native pipe operator” option is checked.
    +
    +
    +
    +
    +
    +
    + +
    +
    +magrittr +
    +
    +
    +

    If you’ve been using the tidyverse for a while, you might be familiar with the %>% pipe provided by the magrittr package. The magrittr package is included in the core tidyverse, so you can use %>% whenever you load the tidyverse:

    +
    +
    library(tidyverse)
    +
    +mtcars %>% 
    +  group_by(cyl) %>%
    +  summarize(n = n())
    +
    +

    For simple cases, |> and %>% behave identically. So why do we recommend the base pipe? Firstly, because it’s part of base R, it’s always available for you to use, even when you’re not using the tidyverse. Secondly, |> is quite a bit simpler than %>%: in the time between the invention of %>% in 2014 and the inclusion of |> in R 4.1.0 in 2021, we gained a better understanding of the pipe. This allowed the base implementation to jettison infrequently used and less important features.

    +
    +
    +

    +3.5 Groups

    +

    So far you’ve learned about functions that work with rows and columns. dplyr gets even more powerful when you add in the ability to work with groups. In this section, we’ll focus on the most important functions: group_by(), summarize(), and the slice family of functions.

    +

    +3.5.1 group_by() +

    +

    Use group_by() to divide your dataset into groups meaningful for your analysis:

    +
    +
    flights |> 
    +  group_by(month)
    +#> # A tibble: 336,776 × 19
    +#> # Groups:   month [12]
    +#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    +#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    +#> 1  2013     1     1      517            515         2      830            819
    +#> 2  2013     1     1      533            529         4      850            830
    +#> 3  2013     1     1      542            540         2      923            850
    +#> 4  2013     1     1      544            545        -1     1004           1022
    +#> 5  2013     1     1      554            600        -6      812            837
    +#> 6  2013     1     1      554            558        -4      740            728
    +#> # ℹ 336,770 more rows
    +#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
    +
    +

    group_by() doesn’t change the data but, if you look closely at the output, you’ll notice that the output indicates that it is “grouped by” month (Groups: month [12]). This means subsequent operations will now work “by month”. group_by() adds this grouped feature (referred to as class) to the data frame, which changes the behavior of the subsequent verbs applied to the data.

    +

    +3.5.2 summarize() +

    +

    The most important grouped operation is a summary, which, if being used to calculate a single summary statistic, reduces the data frame to have a single row for each group. In dplyr, this operation is performed by summarize()3, as shown by the following example, which computes the average departure delay by month:

    +
    +
    flights |> 
    +  group_by(month) |> 
    +  summarize(
    +    avg_delay = mean(dep_delay)
    +  )
    +#> # A tibble: 12 × 2
    +#>   month avg_delay
    +#>   <int>     <dbl>
    +#> 1     1        NA
    +#> 2     2        NA
    +#> 3     3        NA
    +#> 4     4        NA
    +#> 5     5        NA
    +#> 6     6        NA
    +#> # ℹ 6 more rows
    +
    +

    Uhoh! Something has gone wrong and all of our results are NAs (pronounced “N-A”), R’s symbol for missing value. This happened because some of the observed flights had missing data in the delay column, and so when we calculated the mean including those values, we got an NA result. We’ll come back to discuss missing values in detail in Capítulo 18, but for now we’ll tell the mean() function to ignore all missing values by setting the argument na.rm to TRUE:

    +
    +
    flights |> 
    +  group_by(month) |> 
    +  summarize(
    +    delay = mean(dep_delay, na.rm = TRUE)
    +  )
    +#> # A tibble: 12 × 2
    +#>   month delay
    +#>   <int> <dbl>
    +#> 1     1  10.0
    +#> 2     2  10.8
    +#> 3     3  13.2
    +#> 4     4  13.9
    +#> 5     5  13.0
    +#> 6     6  20.8
    +#> # ℹ 6 more rows
    +
    +

    You can create any number of summaries in a single call to summarize(). You’ll learn various useful summaries in the upcoming chapters, but one very useful summary is n(), which returns the number of rows in each group:

    +
    +
    flights |> 
    +  group_by(month) |> 
    +  summarize(
    +    delay = mean(dep_delay, na.rm = TRUE), 
    +    n = n()
    +  )
    +#> # A tibble: 12 × 3
    +#>   month delay     n
    +#>   <int> <dbl> <int>
    +#> 1     1  10.0 27004
    +#> 2     2  10.8 24951
    +#> 3     3  13.2 28834
    +#> 4     4  13.9 28330
    +#> 5     5  13.0 28796
    +#> 6     6  20.8 28243
    +#> # ℹ 6 more rows
    +
    +

    Means and counts can get you a surprisingly long way in data science!

    +

    +3.5.3 The slice_ functions

    +

    There are five handy functions that allow you extract specific rows within each group:

    +
      +
    • +df |> slice_head(n = 1) takes the first row from each group.
    • +
    • +df |> slice_tail(n = 1) takes the last row in each group.
    • +
    • +df |> slice_min(x, n = 1) takes the row with the smallest value of column x.
    • +
    • +df |> slice_max(x, n = 1) takes the row with the largest value of column x.
    • +
    • +df |> slice_sample(n = 1) takes one random row.
    • +
    +

    You can vary n to select more than one row, or instead of n =, you can use prop = 0.1 to select (e.g.) 10% of the rows in each group. For example, the following code finds the flights that are most delayed upon arrival at each destination:

    +
    +
    flights |> 
    +  group_by(dest) |> 
    +  slice_max(arr_delay, n = 1) |>
    +  relocate(dest)
    +#> # A tibble: 108 × 19
    +#> # Groups:   dest [105]
    +#>   dest   year month   day dep_time sched_dep_time dep_delay arr_time
    +#>   <chr> <int> <int> <int>    <int>          <int>     <dbl>    <int>
    +#> 1 ABQ    2013     7    22     2145           2007        98      132
    +#> 2 ACK    2013     7    23     1139            800       219     1250
    +#> 3 ALB    2013     1    25      123           2000       323      229
    +#> 4 ANC    2013     8    17     1740           1625        75     2042
    +#> 5 ATL    2013     7    22     2257            759       898      121
    +#> 6 AUS    2013     7    10     2056           1505       351     2347
    +#> # ℹ 102 more rows
    +#> # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, …
    +
    +

    Note that there are 105 destinations but we get 108 rows here. What’s up? slice_min() and slice_max() keep tied values so n = 1 means give us all rows with the highest value. If you want exactly one row per group you can set with_ties = FALSE.

    +

    This is similar to computing the max delay with summarize(), but you get the whole corresponding row (or rows if there’s a tie) instead of the single summary statistic.

    +

    +3.5.4 Grouping by multiple variables

    +

    You can create groups using more than one variable. For example, we could make a group for each date.

    +
    +
    daily <- flights |>  
    +  group_by(year, month, day)
    +daily
    +#> # A tibble: 336,776 × 19
    +#> # Groups:   year, month, day [365]
    +#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    +#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    +#> 1  2013     1     1      517            515         2      830            819
    +#> 2  2013     1     1      533            529         4      850            830
    +#> 3  2013     1     1      542            540         2      923            850
    +#> 4  2013     1     1      544            545        -1     1004           1022
    +#> 5  2013     1     1      554            600        -6      812            837
    +#> 6  2013     1     1      554            558        -4      740            728
    +#> # ℹ 336,770 more rows
    +#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
    +
    +

    When you summarize a tibble grouped by more than one variable, each summary peels off the last group. In hindsight, this wasn’t a great way to make this function work, but it’s difficult to change without breaking existing code. To make it obvious what’s happening, dplyr displays a message that tells you how you can change this behavior:

    +
    +
    daily_flights <- daily |> 
    +  summarize(n = n())
    +#> `summarise()` has grouped output by 'year', 'month'. You can override using
    +#> the `.groups` argument.
    +
    +

    If you’re happy with this behavior, you can explicitly request it in order to suppress the message:

    +
    +
    daily_flights <- daily |> 
    +  summarize(
    +    n = n(), 
    +    .groups = "drop_last"
    +  )
    +
    +

    Alternatively, change the default behavior by setting a different value, e.g., "drop" to drop all grouping or "keep" to preserve the same groups.

    +

    +3.5.5 Ungrouping

    +

    You might also want to remove grouping from a data frame without using summarize(). You can do this with ungroup().

    +
    +
    daily |> 
    +  ungroup()
    +#> # A tibble: 336,776 × 19
    +#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    +#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    +#> 1  2013     1     1      517            515         2      830            819
    +#> 2  2013     1     1      533            529         4      850            830
    +#> 3  2013     1     1      542            540         2      923            850
    +#> 4  2013     1     1      544            545        -1     1004           1022
    +#> 5  2013     1     1      554            600        -6      812            837
    +#> 6  2013     1     1      554            558        -4      740            728
    +#> # ℹ 336,770 more rows
    +#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
    +
    +

    Now let’s see what happens when you summarize an ungrouped data frame.

    +
    +
    daily |> 
    +  ungroup() |>
    +  summarize(
    +    avg_delay = mean(dep_delay, na.rm = TRUE), 
    +    flights = n()
    +  )
    +#> # A tibble: 1 × 2
    +#>   avg_delay flights
    +#>       <dbl>   <int>
    +#> 1      12.6  336776
    +
    +

    You get a single row back because dplyr treats all the rows in an ungrouped data frame as belonging to one group.

    +

    +3.5.6 .by +

    +

    dplyr 1.1.0 includes a new, experimental, syntax for per-operation grouping, the .by argument. group_by() and ungroup() aren’t going away, but you can now also use the .by argument to group within a single operation:

    +
    +
    flights |> 
    +  summarize(
    +    delay = mean(dep_delay, na.rm = TRUE), 
    +    n = n(),
    +    .by = month
    +  )
    +
    +

    Or if you want to group by multiple variables:

    +
    +
    flights |> 
    +  summarize(
    +    delay = mean(dep_delay, na.rm = TRUE), 
    +    n = n(),
    +    .by = c(origin, dest)
    +  )
    +
    +

    .by works with all verbs and has the advantage that you don’t need to use the .groups argument to suppress the grouping message or ungroup() when you’re done.

    +

    We didn’t focus on this syntax in this chapter because it was very new when we wrote the book. We did want to mention it because we think it has a lot of promise and it’s likely to be quite popular. You can learn more about it in the dplyr 1.1.0 blog post.

    +

    +3.5.7 Exercises

    +
      +
    1. Which carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))

    2. +
    3. Find the flights that are most delayed upon departure from each destination.

    4. +
    5. How do delays vary over the course of the day. Illustrate your answer with a plot.

    6. +
    7. What happens if you supply a negative n to slice_min() and friends?

    8. +
    9. Explain what count() does in terms of the dplyr verbs you just learned. What does the sort argument to count() do?

    10. +
    11. +

      Suppose we have the following tiny data frame:

      +
      +
      df <- tibble(
      +  x = 1:5,
      +  y = c("a", "b", "a", "a", "b"),
      +  z = c("K", "K", "L", "L", "K")
      +)
      +
      +
        +
      1. +

        Write down what you think the output will look like, then check if you were correct, and describe what group_by() does.

        +
        +
        df |>
        +  group_by(y)
        +
        +
      2. +
      3. +

        Write down what you think the output will look like, then check if you were correct, and describe what arrange() does. Also comment on how it’s different from the group_by() in part (a)?

        +
        +
        df |>
        +  arrange(y)
        +
        +
      4. +
      5. +

        Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does.

        +
        +
        df |>
        +  group_by(y) |>
        +  summarize(mean_x = mean(x))
        +
        +
      6. +
      7. +

        Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. Then, comment on what the message says.

        +
        +
        df |>
        +  group_by(y, z) |>
        +  summarize(mean_x = mean(x))
        +
        +
      8. +
      9. +

        Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. How is the output different from the one in part (d).

        +
        +
        df |>
        +  group_by(y, z) |>
        +  summarize(mean_x = mean(x), .groups = "drop")
        +
        +
      10. +
      11. +

        Write down what you think the outputs will look like, then check if you were correct, and describe what each pipeline does. How are the outputs of the two pipelines different?

        +
        +
        df |>
        +  group_by(y, z) |>
        +  summarize(mean_x = mean(x))
        +
        +df |>
        +  group_by(y, z) |>
        +  mutate(mean_x = mean(x))
        +
        +
      12. +
      +
    12. +

    +3.6 Case study: aggregates and sample size

    +

    Whenever you do any aggregation, it’s always a good idea to include a count (n()). That way, you can ensure that you’re not drawing conclusions based on very small amounts of data. We’ll demonstrate this with some baseball data from the Lahman package. Specifically, we will compare what proportion of times a player gets a hit (H) vs. the number of times they try to put the ball in play (AB):

    +
    +
    batters <- Lahman::Batting |> 
    +  group_by(playerID) |> 
    +  summarize(
    +    performance = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
    +    n = sum(AB, na.rm = TRUE)
    +  )
    +batters
    +#> # A tibble: 20,469 × 3
    +#>   playerID  performance     n
    +#>   <chr>           <dbl> <int>
    +#> 1 aardsda01      0          4
    +#> 2 aaronha01      0.305  12364
    +#> 3 aaronto01      0.229    944
    +#> 4 aasedo01       0          5
    +#> 5 abadan01       0.0952    21
    +#> 6 abadfe01       0.111      9
    +#> # ℹ 20,463 more rows
    +
    +

    When we plot the skill of the batter (measured by the batting average, performance) against the number of opportunities to hit the ball (measured by times at bat, n), you see two patterns:

    +
      +
    1. The variation in performance is larger among players with fewer at-bats. The shape of this plot is very characteristic: whenever you plot a mean (or other summary statistics) vs. group size, you’ll see that the variation decreases as the sample size increases4.

    2. +
    3. There’s a positive correlation between skill (performance) and opportunities to hit the ball (n) because teams want to give their best batters the most opportunities to hit the ball.

    4. +
    +
    +
    batters |> 
    +  filter(n > 100) |> 
    +  ggplot(aes(x = n, y = performance)) +
    +  geom_point(alpha = 1 / 10) + 
    +  geom_smooth(se = FALSE)
    +
    +

    A scatterplot of number of batting performance vs. batting opportunites overlaid with a smoothed line. Average performance increases sharply from 0.2 at when n is 1 to 0.25 when n is ~1000. Average performance continues to increase linearly at a much shallower slope reaching ~0.3 when n is ~15,000.

    +
    +
    +

    Note the handy pattern for combining ggplot2 and dplyr. You just have to remember to switch from |>, for dataset processing, to + for adding layers to your plot.

    +

    This also has important implications for ranking. If you naively sort on desc(performance), the people with the best batting averages are clearly the ones who tried to put the ball in play very few times and happened to get a hit, they’re not necessarily the most skilled players:

    +
    +
    batters |> 
    +  arrange(desc(performance))
    +#> # A tibble: 20,469 × 3
    +#>   playerID  performance     n
    +#>   <chr>           <dbl> <int>
    +#> 1 abramge01           1     1
    +#> 2 alberan01           1     1
    +#> 3 banisje01           1     1
    +#> 4 bartocl01           1     1
    +#> 5 bassdo01            1     1
    +#> 6 birasst01           1     2
    +#> # ℹ 20,463 more rows
    +
    +

    You can find a good explanation of this problem and how to overcome it at http://varianceexplained.org/r/empirical_bayes_baseball/ and https://www.evanmiller.org/how-not-to-sort-by-average-rating.html.

    +

    +3.7 Summary

    +

    In this chapter, you’ve learned the tools that dplyr provides for working with data frames. The tools are roughly grouped into three categories: those that manipulate the rows (like filter() and arrange(), those that manipulate the columns (like select() and mutate()), and those that manipulate groups (like group_by() and summarize()). In this chapter, we’ve focused on these “whole data frame” tools, but you haven’t yet learned much about what you can do with the individual variable. We’ll come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable.

    +

    In the next chapter, we’ll pivot back to workflow to discuss the importance of code style, keeping your code well organized in order to make it easy for you and others to read and understand your code.

    + + +

    +
      +
    1. Later, you’ll learn about the slice_*() family which allows you to choose rows based on their positions.↩︎

    2. +
    3. Remember that in RStudio, the easiest way to see a dataset with many columns is View().↩︎

    4. +
    5. Or summarise(), if you prefer British English.↩︎

    6. +
    7. *cough* the law of large numbers *cough*.↩︎

    8. +
    +
    +
    +
    + + + \ No newline at end of file diff --git a/data-transform_files/figure-html/unnamed-chunk-58-1.png b/data-transform_files/figure-html/unnamed-chunk-58-1.png new file mode 100644 index 000000000..4c1320e7b Binary files /dev/null and b/data-transform_files/figure-html/unnamed-chunk-58-1.png differ diff --git a/data-visualize.html b/data-visualize.html index 6e9157833..588c4e168 100644 --- a/data-visualize.html +++ b/data-visualize.html @@ -167,29 +167,226 @@ 2  Fluxo de Trabalho: básico + + + + + + + + + + + + + + + + + + + + + +
    @@ -375,7 +572,7 @@

    geom_bar()), os gráficos de linhas usam geometrias de linhas (geom_line()), os boxplots usam geometrias de boxplot (geom_boxplot()), os gráficos de dispersão usam geometrias de pontos (geom_point()) e assim por diante.

    -

    A função geom_point() adiciona uma camada de pontos ao seu gráfico, o que cria um gráfico de dispersão. O ggplot2 vem com muitas funções de geometria, cada uma adicionando um tipo diferente de camada a um gráfico. Você aprenderá várias geometrias ao longo do livro, principalmente em ?sec-layers.

    +

    A função geom_point() adiciona uma camada de pontos ao seu gráfico, o que cria um gráfico de dispersão. O ggplot2 vem com muitas funções de geometria, cada uma adicionando um tipo diferente de camada a um gráfico. Você aprenderá várias geometrias ao longo do livro, principalmente em Capítulo 9.

    ggplot(
       data = pinguins,
    @@ -392,7 +589,7 @@ 

    Removed 2 rows containing missing values (geom_point()).

    -

    Estamos vendo essa mensagem porque há dois pinguins em nosso conjunto de dados com valores faltantes (missing values - NA*) de massa corporal e/ou comprimento da nadadeira e o ggplot2 não tem como representá-los no gráfico sem esses dois valores. Assim como o R, o ggplot2 adota a filosofia de que os valores faltantes nunca devem desaparecer silenciosamente. Esse tipo de aviso é provavelmente um dos tipos mais comuns de avisos que você verá ao trabalhar com dados reais - os valores faltantes são um problema muito comum e você aprenderá mais sobre eles ao longo do livro, especialmente em ?sec-missing-values. Nos demais gráficos deste capítulo, vamos suprimir esse aviso para que ele não seja mostrado em cada gráfico que fizermos.

    +

    Estamos vendo essa mensagem porque há dois pinguins em nosso conjunto de dados com valores faltantes (missing values - NA*) de massa corporal e/ou comprimento da nadadeira e o ggplot2 não tem como representá-los no gráfico sem esses dois valores. Assim como o R, o ggplot2 adota a filosofia de que os valores faltantes nunca devem desaparecer silenciosamente. Esse tipo de aviso é provavelmente um dos tipos mais comuns de avisos que você verá ao trabalhar com dados reais - os valores faltantes são um problema muito comum e você aprenderá mais sobre eles ao longo do livro, especialmente em Capítulo 18. Nos demais gráficos deste capítulo, vamos suprimir esse aviso para que ele não seja mostrado em cada gráfico que fizermos.

    1.2.4 Adicionando atributos estéticos e camadas

    Gráficos de dispersão são úteis para exibir a relação entre duas variáveis numéricas, mas é sempre uma boa ideia ter uma postura cética em relação a qualquer relação aparente entre duas variáveis e perguntar se pode haver outras variáveis que expliquem ou mudem a natureza dessa relação aparente. Por exemplo, a relação entre o comprimento das nadadeira e a massa corporal difere de acordo com a espécie? Vamos incluir as espécies em nosso gráfico e ver se isso revela alguma ideia adicional sobre a relação aparente entre essas variáveis. Faremos isso representando as espécies com pontos de cores diferentes.

    @@ -532,7 +729,7 @@

    ) + geom_point()

    -

    Normalmente, o primeiro ou os dois primeiros argumentos de uma função são tão importantes que você logo saberá usar eles de cor. Os dois primeiros argumentos de ggplot() são data e mapping; no restante do livro, não escreveremos esses nomes. Isso economiza digitação e, ao reduzir a quantidade de texto extra, facilita a visualização das diferenças entre os gráficos. Essa é uma preocupação de programação realmente importante, à qual voltaremos em ?sec-functions.

    +

    Normalmente, o primeiro ou os dois primeiros argumentos de uma função são tão importantes que você logo saberá usar eles de cor. Os dois primeiros argumentos de ggplot() são data e mapping; no restante do livro, não escreveremos esses nomes. Isso economiza digitação e, ao reduzir a quantidade de texto extra, facilita a visualização das diferenças entre os gráficos. Essa é uma preocupação de programação realmente importante, à qual voltaremos em Capítulo 25.

    Reescrevendo o gráfico anterior de forma mais concisa, temos:

    ggplot(pinguins, aes(x = comprimento_nadadeira, y = massa_corporal)) +
    @@ -565,7 +762,7 @@ 

    -

    Você aprenderá mais sobre fatores e funções para lidar com fatores (como fct_infreq() mostrado acima) em ?sec-factors.

    +

    Você aprenderá mais sobre fatores e funções para lidar com fatores (como fct_infreq() mostrado acima) em Capítulo 16.

    1.4.2 Uma variável numérica

    Uma variável é numérica (ou quantitativa) se puder assumir uma ampla gama de valores numéricos e se for possível adicionar, subtrair ou calcular médias com esses valores. As variáveis numéricas podem ser contínuas ou discretas.

    @@ -718,7 +915,7 @@

    -

    Você vai aprender sobre muitas outras geometrias para visualizar distribuições de variáveis e relações entre elas em ?sec-layers.

    +

    Você vai aprender sobre muitas outras geometrias para visualizar distribuições de variáveis e relações entre elas em Capítulo 9.

    1.5.5 Exercícios

      @@ -757,9 +954,9 @@

      geom_point() ggsave(filename = "penguin-plot.png") -

      Isso salvará o gráfico no seu diretório de trabalho, um conceito sobre o qual você aprenderá mais em ?sec-workflow-scripts-projects.

      +

      Isso salvará o gráfico no seu diretório de trabalho, um conceito sobre o qual você aprenderá mais em Capítulo 6.

      Se você não especificar a largura width e a altura height, elas serão tiradas das dimensões do dispositivo de plotagem atual. Para obter um código reprodutível, você deverá especificá-los. Você pode obter mais informações sobre a função ggsave() na documentação.

      -

      De modo geral, entretanto, recomendamos que você monte seus relatórios finais usando o Quarto, um sistema de escrita reprodutível que permite intercalar seu código e sua escrita e incluir automaticamente seus gráficos em seus relatórios. Você aprenderá mais sobre o Quarto em ?sec-quarto.

      +

      De modo geral, entretanto, recomendamos que você monte seus relatórios finais usando o Quarto, um sistema de escrita reprodutível que permite intercalar seu código e sua escrita e incluir automaticamente seus gráficos em seus relatórios. Você aprenderá mais sobre o Quarto em Capítulo 28.

      1.6.1 Exercícios

        @@ -788,7 +985,7 @@

        1.8 Resumo

        Neste capítulo, você aprendeu os fundamentos da visualização de dados com o ggplot2. Começamos com a ideia básica que sustenta o ggplot2: uma visualização é um mapeamento de variáveis em seus dados para atributos estéticos como posição (position), cor (color), tamanho (size) e forma (shape). Em seguida, você aprendeu a aumentar a complexidade e melhorar a apresentação de seus gráficos camada por camada. Você também aprendeu sobre gráficos comumente usados para visualizar a distribuição de uma única variável, bem como para visualizar relações entre duas ou mais variáveis ao utilizar mapeamentos de atributos estéticos adicionais e/ou dividindo seu gráfico em pequenos gráficos usando facetas.

        -

        Usaremos as visualizações repetidamente ao longo deste livro, introduzindo novas técnicas à medida que precisarmos delas, além de nos aprofundarmos na criação de visualizações com o ggplot2 em ?sec-layers por meio da ?sec-communication.

        +

        Usaremos as visualizações repetidamente ao longo deste livro, introduzindo novas técnicas à medida que precisarmos delas, além de nos aprofundarmos na criação de visualizações com o ggplot2 em Capítulo 9 por meio da Capítulo 11.

        Com as noções básicas de visualização em seu currículo, no próximo capítulo mudaremos um pouco a direção e daremos algumas orientações práticas sobre o fluxo de trabalho. Intercalamos conselhos sobre fluxo de trabalho com ferramentas de ciência de dados ao longo desta parte do livro, pois isso te ajudará a manter a organização à medida que você escreve quantidades cada vez maiores de código em R.

        diff --git a/databases.html b/databases.html new file mode 100644 index 000000000..28df27ac5 --- /dev/null +++ b/databases.html @@ -0,0 +1,1355 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 21  Databases + + + + + + + + + + + + + + + + + + + + + + + + +
        +
        + +
        + + + +
        +

        21  Databases

        +
        + + + +
        + + + + +
        + + +

        +21.1 Introduction

        +

        A huge amount of data lives in databases, so it’s essential that you know how to access it. Sometimes you can ask someone to download a snapshot into a .csv for you, but this gets painful quickly: every time you need to make a change you’ll have to communicate with another human. You want to be able to reach into the database directly to get the data you need, when you need it.

        +

        In this chapter, you’ll first learn the basics of the DBI package: how to use it to connect to a database and then retrieve data with a SQL1 query. SQL, short for structured query language, is the lingua franca of databases, and is an important language for all data scientists to learn. That said, we’re not going to start with SQL, but instead we’ll teach you dbplyr, which can translate your dplyr code to the SQL. We’ll use that as a way to teach you some of the most important features of SQL. You won’t become a SQL master by the end of the chapter, but you will be able to identify the most important components and understand what they do.

        +

        +21.1.1 Prerequisites

        +

        In this chapter, we’ll introduce DBI and dbplyr. DBI is a low-level interface that connects to databases and executes SQL; dbplyr is a high-level interface that translates your dplyr code to SQL queries then executes them with DBI.

        + +

        +21.2 Database basics

        +

        At the simplest level, you can think about a database as a collection of data frames, called tables in database terminology. Like a data frame, a database table is a collection of named columns, where every value in the column is the same type. There are three high level differences between data frames and database tables:

        +
          +
        • Database tables are stored on disk and can be arbitrarily large. Data frames are stored in memory, and are fundamentally limited (although that limit is still plenty large for many problems).

        • +
        • Database tables almost always have indexes. Much like the index of a book, a database index makes it possible to quickly find rows of interest without having to look at every single row. Data frames and tibbles don’t have indexes, but data.tables do, which is one of the reasons that they’re so fast.

        • +
        • Most classical databases are optimized for rapidly collecting data, not analyzing existing data. These databases are called row-oriented because the data is stored row-by-row, rather than column-by-column like R. More recently, there’s been much development of column-oriented databases that make analyzing the existing data much faster.

        • +
        +

        Databases are run by database management systems (DBMS’s for short), which come in three basic forms:

        +
          +
        • +Client-server DBMS’s run on a powerful central server, which you connect from your computer (the client). They are great for sharing data with multiple people in an organization. Popular client-server DBMS’s include PostgreSQL, MariaDB, SQL Server, and Oracle.
        • +
        • +Cloud DBMS’s, like Snowflake, Amazon’s RedShift, and Google’s BigQuery, are similar to client server DBMS’s, but they run in the cloud. This means that they can easily handle extremely large datasets and can automatically provide more compute resources as needed.
        • +
        • +In-process DBMS’s, like SQLite or duckdb, run entirely on your computer. They’re great for working with large datasets where you’re the primary user.
        • +

        +21.3 Connecting to a database

        +

        To connect to the database from R, you’ll use a pair of packages:

        +
          +
        • You’ll always use DBI (database interface) because it provides a set of generic functions that connect to the database, upload data, run SQL queries, etc.

        • +
        • You’ll also use a package tailored for the DBMS you’re connecting to. This package translates the generic DBI commands into the specifics needed for a given DBMS. There’s usually one package for each DBMS, e.g. RPostgres for PostgreSQL and RMariaDB for MySQL.

        • +
        +

        If you can’t find a specific package for your DBMS, you can usually use the odbc package instead. This uses the ODBC protocol supported by many DBMS. odbc requires a little more setup because you’ll also need to install an ODBC driver and tell the odbc package where to find it.

        +

        Concretely, you create a database connection using DBI::dbConnect(). The first argument selects the DBMS2, then the second and subsequent arguments describe how to connect to it (i.e. where it lives and the credentials that you need to access it). The following code shows a couple of typical examples:

        +
        +
        con <- DBI::dbConnect(
        +  RMariaDB::MariaDB(), 
        +  username = "foo"
        +)
        +con <- DBI::dbConnect(
        +  RPostgres::Postgres(), 
        +  hostname = "databases.mycompany.com", 
        +  port = 1234
        +)
        +
        +

        The precise details of the connection vary a lot from DBMS to DBMS so unfortunately we can’t cover all the details here. This means you’ll need to do a little research on your own. Typically you can ask the other data scientists in your team or talk to your DBA (database administrator). The initial setup will often take a little fiddling (and maybe some googling) to get it right, but you’ll generally only need to do it once.

        +

        +21.3.1 In this book

        +

        Setting up a client-server or cloud DBMS would be a pain for this book, so we’ll instead use an in-process DBMS that lives entirely in an R package: duckdb. Thanks to the magic of DBI, the only difference between using duckdb and any other DBMS is how you’ll connect to the database. This makes it great to teach with because you can easily run this code as well as easily take what you learn and apply it elsewhere.

        +

        Connecting to duckdb is particularly simple because the defaults create a temporary database that is deleted when you quit R. That’s great for learning because it guarantees that you’ll start from a clean slate every time you restart R:

        +
        +
        con <- DBI::dbConnect(duckdb::duckdb())
        +
        +

        duckdb is a high-performance database that’s designed very much for the needs of a data scientist. We use it here because it’s very easy to get started with, but it’s also capable of handling gigabytes of data with great speed. If you want to use duckdb for a real data analysis project, you’ll also need to supply the dbdir argument to make a persistent database and tell duckdb where to save it. Assuming you’re using a project (Capítulo 6), it’s reasonable to store it in the duckdb directory of the current project:

        +
        +
        con <- DBI::dbConnect(duckdb::duckdb(), dbdir = "duckdb")
        +
        +

        +21.3.2 Load some data

        +

        Since this is a new database, we need to start by adding some data. Here we’ll add mpg and diamonds datasets from ggplot2 using DBI::dbWriteTable(). The simplest usage of dbWriteTable() needs three arguments: a database connection, the name of the table to create in the database, and a data frame of data.

        +
        +
        dbWriteTable(con, "mpg", ggplot2::mpg)
        +dbWriteTable(con, "diamonds", ggplot2::diamonds)
        +
        +

        If you’re using duckdb in a real project, we highly recommend learning about duckdb_read_csv() and duckdb_register_arrow(). These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it into R. We’ll also show off a useful technique for loading multiple files into a database in Seção 26.4.1.

        +

        +21.3.3 DBI basics

        +

        You can check that the data is loaded correctly by using a couple of other DBI functions: dbListTables() lists all tables in the database3 and dbReadTable() retrieves the contents of a table.

        +
        +
        dbListTables(con)
        +#> [1] "diamonds" "mpg"
        +
        +con |> 
        +  dbReadTable("diamonds") |> 
        +  as_tibble()
        +#> # A tibble: 53,940 × 10
        +#>   carat cut       color clarity depth table price     x     y     z
        +#>   <dbl> <fct>     <fct> <fct>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
        +#> 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
        +#> 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
        +#> 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
        +#> 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
        +#> 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
        +#> 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
        +#> # ℹ 53,934 more rows
        +
        +

        dbReadTable() returns a data.frame so we use as_tibble() to convert it into a tibble so that it prints nicely.

        +

        If you already know SQL, you can use dbGetQuery() to get the results of running a query on the database:

        +
        +
        sql <- "
        +  SELECT carat, cut, clarity, color, price 
        +  FROM diamonds 
        +  WHERE price > 15000
        +"
        +as_tibble(dbGetQuery(con, sql))
        +#> # A tibble: 1,655 × 5
        +#>   carat cut       clarity color price
        +#>   <dbl> <fct>     <fct>   <fct> <int>
        +#> 1  1.54 Premium   VS2     E     15002
        +#> 2  1.19 Ideal     VVS1    F     15005
        +#> 3  2.1  Premium   SI1     I     15007
        +#> 4  1.69 Ideal     SI1     D     15011
        +#> 5  1.5  Very Good VVS2    G     15013
        +#> 6  1.73 Very Good VS1     G     15014
        +#> # ℹ 1,649 more rows
        +
        +

        If you’ve never seen SQL before, don’t worry! You’ll learn more about it shortly. But if you read it carefully, you might guess that it selects five columns of the diamonds dataset and all the rows where price is greater than 15,000.

        +

        +21.4 dbplyr basics

        +

        Now that we’ve connected to a database and loaded up some data, we can start to learn about dbplyr. dbplyr is a dplyr backend, which means that you keep writing dplyr code but the backend executes it differently. In this, dbplyr translates to SQL; other backends include dtplyr which translates to data.table, and multidplyr which executes your code on multiple cores.

        +

        To use dbplyr, you must first use tbl() to create an object that represents a database table:

        +
        +
        diamonds_db <- tbl(con, "diamonds")
        +diamonds_db
        +#> # Source:   table<diamonds> [?? x 10]
        +#> # Database: DuckDB v0.9.1 [unknown@Linux 6.2.0-1015-azure:R 4.3.2/:memory:]
        +#>   carat cut       color clarity depth table price     x     y     z
        +#>   <dbl> <fct>     <fct> <fct>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
        +#> 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
        +#> 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
        +#> 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
        +#> 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
        +#> 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
        +#> 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
        +#> # ℹ more rows
        +
        +
        +
        +
        + +
        +
        +

        There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organized. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:

        +
        +
        diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
        +diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))
        +
        +

        Other times you might want to use your own SQL query as a starting point:

        +
        +
        diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))
        +
        +
        +
        +
        +

        This object is lazy; when you use dplyr verbs on it, dplyr doesn’t do any work: it just records the sequence of operations that you want to perform and only performs them when needed. For example, take the following pipeline:

        +
        +
        big_diamonds_db <- diamonds_db |> 
        +  filter(price > 15000) |> 
        +  select(carat:clarity, price)
        +
        +big_diamonds_db
        +#> # Source:   SQL [?? x 5]
        +#> # Database: DuckDB v0.9.1 [unknown@Linux 6.2.0-1015-azure:R 4.3.2/:memory:]
        +#>   carat cut       color clarity price
        +#>   <dbl> <fct>     <fct> <fct>   <int>
        +#> 1  1.54 Premium   E     VS2     15002
        +#> 2  1.19 Ideal     F     VVS1    15005
        +#> 3  2.1  Premium   I     SI1     15007
        +#> 4  1.69 Ideal     D     SI1     15011
        +#> 5  1.5  Very Good G     VVS2    15013
        +#> 6  1.73 Very Good G     VS1     15014
        +#> # ℹ more rows
        +
        +

        You can tell this object represents a database query because it prints the DBMS name at the top, and while it tells you the number of columns, it typically doesn’t know the number of rows. This is because finding the total number of rows usually requires executing the complete query, something we’re trying to avoid.

        +

        You can see the SQL code generated by the dplyr function show_query(). If you know dplyr, this is a great way to learn SQL! Write some dplyr code, get dbplyr to translate it to SQL, and then try to figure out how the two languages match up.

        +
        +
        big_diamonds_db |>
        +  show_query()
        +#> <SQL>
        +#> SELECT carat, cut, color, clarity, price
        +#> FROM diamonds
        +#> WHERE (price > 15000.0)
        +
        +

        To get all the data back into R, you call collect(). Behind the scenes, this generates the SQL, calls dbGetQuery() to get the data, then turns the result into a tibble:

        +
        +
        big_diamonds <- big_diamonds_db |> 
        +  collect()
        +big_diamonds
        +#> # A tibble: 1,655 × 5
        +#>   carat cut       color clarity price
        +#>   <dbl> <fct>     <fct> <fct>   <int>
        +#> 1  1.54 Premium   E     VS2     15002
        +#> 2  1.19 Ideal     F     VVS1    15005
        +#> 3  2.1  Premium   I     SI1     15007
        +#> 4  1.69 Ideal     D     SI1     15011
        +#> 5  1.5  Very Good G     VVS2    15013
        +#> 6  1.73 Very Good G     VS1     15014
        +#> # ℹ 1,649 more rows
        +
        +

        Typically, you’ll use dbplyr to select the data you want from the database, performing basic filtering and aggregation using the translations described below. Then, once you’re ready to analyse the data with functions that are unique to R, you’ll collect() the data to get an in-memory tibble, and continue your work with pure R code.

        +

        +21.5 SQL

        +

        The rest of the chapter will teach you a little SQL through the lens of dbplyr. It’s a rather non-traditional introduction to SQL but we hope it will get you quickly up to speed with the basics. Luckily, if you understand dplyr you’re in a great place to quickly pick up SQL because so many of the concepts are the same.

        +

        We’ll explore the relationship between dplyr and SQL using a couple of old friends from the nycflights13 package: flights and planes. These datasets are easy to get into our learning database because dbplyr comes with a function that copies the tables from nycflights13 to our database:

        +
        +
        dbplyr::copy_nycflights13(con)
        +#> Creating table: airlines
        +#> Creating table: airports
        +#> Creating table: flights
        +#> Creating table: planes
        +#> Creating table: weather
        +flights <- tbl(con, "flights")
        +planes <- tbl(con, "planes")
        +
        +

        +21.5.1 SQL basics

        +

        The top-level components of SQL are called statements. Common statements include CREATE for defining new tables, INSERT for adding data, and SELECT for retrieving data. We will focus on SELECT statements, also called queries, because they are almost exclusively what you’ll use as a data scientist.

        +

        A query is made up of clauses. There are five important clauses: SELECT, FROM, WHERE, ORDER BY, and GROUP BY. Every query must have the SELECT4 and FROM5 clauses and the simplest query is SELECT * FROM table, which selects all columns from the specified table . This is what dbplyr generates for an unadulterated table :

        +
        +
        flights |> show_query()
        +#> <SQL>
        +#> SELECT *
        +#> FROM flights
        +planes |> show_query()
        +#> <SQL>
        +#> SELECT *
        +#> FROM planes
        +
        +

        WHERE and ORDER BY control which rows are included and how they are ordered:

        +
        +
        flights |> 
        +  filter(dest == "IAH") |> 
        +  arrange(dep_delay) |>
        +  show_query()
        +#> <SQL>
        +#> SELECT flights.*
        +#> FROM flights
        +#> WHERE (dest = 'IAH')
        +#> ORDER BY dep_delay
        +
        +

        GROUP BY converts the query to a summary, causing aggregation to happen:

        +
        +
        flights |> 
        +  group_by(dest) |> 
        +  summarize(dep_delay = mean(dep_delay, na.rm = TRUE)) |> 
        +  show_query()
        +#> <SQL>
        +#> SELECT dest, AVG(dep_delay) AS dep_delay
        +#> FROM flights
        +#> GROUP BY dest
        +
        +

        There are two important differences between dplyr verbs and SELECT clauses:

        +
          +
        • In SQL, case doesn’t matter: you can write select, SELECT, or even SeLeCt. In this book we’ll stick with the common convention of writing SQL keywords in uppercase to distinguish them from table or variables names.
        • +
        • In SQL, order matters: you must always write the clauses in the order SELECT, FROM, WHERE, GROUP BY, ORDER BY. Confusingly, this order doesn’t match how the clauses actually evaluated which is first FROM, then WHERE, GROUP BY, SELECT, and ORDER BY.
        • +
        +

        The following sections explore each clause in more detail.

        +
        +
        +
        + +
        +
        +

        Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that we’ll focus on in this book are very similar between DBMS’s, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. It’s not perfect, but it’s continually improving, and if you hit a problem you can file an issue on GitHub to help us do better.

        +
        +
        +
        +

        +21.5.2 SELECT

        +

        The SELECT clause is the workhorse of queries and performs the same job as select(), mutate(), rename(), relocate(), and, as you’ll learn in the next section, summarize().

        +

        select(), rename(), and relocate() have very direct translations to SELECT as they just affect where a column appears (if at all) along with its name:

        +
        +
        planes |> 
        +  select(tailnum, type, manufacturer, model, year) |> 
        +  show_query()
        +#> <SQL>
        +#> SELECT tailnum, "type", manufacturer, model, "year"
        +#> FROM planes
        +
        +planes |> 
        +  select(tailnum, type, manufacturer, model, year) |> 
        +  rename(year_built = year) |> 
        +  show_query()
        +#> <SQL>
        +#> SELECT tailnum, "type", manufacturer, model, "year" AS year_built
        +#> FROM planes
        +
        +planes |> 
        +  select(tailnum, type, manufacturer, model, year) |> 
        +  relocate(manufacturer, model, .before = type) |> 
        +  show_query()
        +#> <SQL>
        +#> SELECT tailnum, manufacturer, model, "type", "year"
        +#> FROM planes
        +
        +

        This example also shows you how SQL does renaming. In SQL terminology renaming is called aliasing and is done with AS. Note that unlike mutate(), the old name is on the left and the new name is on the right.

        +
        +
        +
        + +
        +
        +

        In the examples above note that "year" and "type" are wrapped in double quotes. That’s because these are reserved words in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.

        +

        When working with other databases you’re likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.

        +
        SELECT "tailnum", "type", "manufacturer", "model", "year"
        +FROM "planes"
        +

        Some other database systems use backticks instead of quotes:

        +
        SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
        +FROM `planes`
        +
        +
        +
        +

        The translations for mutate() are similarly straightforward: each variable becomes a new expression in SELECT:

        +
        +
        flights |> 
        +  mutate(
        +    speed = distance / (air_time / 60)
        +  ) |> 
        +  show_query()
        +#> <SQL>
        +#> SELECT flights.*, distance / (air_time / 60.0) AS speed
        +#> FROM flights
        +
        +

        We’ll come back to the translation of individual components (like /) in Seção 21.6.

        +

        +21.5.3 FROM

        +

        The FROM clause defines the data source. It’s going to be rather uninteresting for a little while, because we’re just using single tables. You’ll see more complex examples once we hit the join functions.

        +

        +21.5.4 GROUP BY

        +

        group_by() is translated to the GROUP BY6 clause and summarize() is translated to the SELECT clause:

        +
        +
        diamonds_db |> 
        +  group_by(cut) |> 
        +  summarize(
        +    n = n(),
        +    avg_price = mean(price, na.rm = TRUE)
        +  ) |> 
        +  show_query()
        +#> <SQL>
        +#> SELECT cut, COUNT(*) AS n, AVG(price) AS avg_price
        +#> FROM diamonds
        +#> GROUP BY cut
        +
        +

        We’ll come back to what’s happening with translation n() and mean() in Seção 21.6.

        +

        +21.5.5 WHERE

        +

        filter() is translated to the WHERE clause:

        +
        +
        flights |> 
        +  filter(dest == "IAH" | dest == "HOU") |> 
        +  show_query()
        +#> <SQL>
        +#> SELECT flights.*
        +#> FROM flights
        +#> WHERE (dest = 'IAH' OR dest = 'HOU')
        +
        +flights |> 
        +  filter(arr_delay > 0 & arr_delay < 20) |> 
        +  show_query()
        +#> <SQL>
        +#> SELECT flights.*
        +#> FROM flights
        +#> WHERE (arr_delay > 0.0 AND arr_delay < 20.0)
        +
        +

        There are a few important details to note here:

        +
          +
        • +| becomes OR and & becomes AND.
        • +
        • SQL uses = for comparison, not ==. SQL doesn’t have assignment, so there’s no potential for confusion there.
        • +
        • SQL uses only '' for strings, not "". In SQL, "" is used to identify variables, like R’s ``.
        • +
        +

        Another useful SQL operator is IN, which is very close to R’s %in%:

        +
        +
        flights |> 
        +  filter(dest %in% c("IAH", "HOU")) |> 
        +  show_query()
        +#> <SQL>
        +#> SELECT flights.*
        +#> FROM flights
        +#> WHERE (dest IN ('IAH', 'HOU'))
        +
        +

        SQL uses NULL instead of NA. NULLs behave similarly to NAs. The main difference is that while they’re “infectious” in comparisons and arithmetic, they are silently dropped when summarizing. dbplyr will remind you about this behavior the first time you hit it:

        +
        +
        flights |> 
        +  group_by(dest) |> 
        +  summarize(delay = mean(arr_delay))
        +#> Warning: Missing values are always removed in SQL aggregation functions.
        +#> Use `na.rm = TRUE` to silence this warning
        +#> This warning is displayed once every 8 hours.
        +#> # Source:   SQL [?? x 2]
        +#> # Database: DuckDB v0.9.1 [unknown@Linux 6.2.0-1015-azure:R 4.3.2/:memory:]
        +#>   dest  delay
        +#>   <chr> <dbl>
        +#> 1 SFO    2.67
        +#> 2 SJU    2.52
        +#> 3 SNA   -7.87
        +#> 4 SRQ    3.08
        +#> 5 CHS   10.6 
        +#> 6 SAN    3.14
        +#> # ℹ more rows
        +
        +

        If you want to learn more about how NULLs work, you might enjoy “Three valued logic” by Markus Winand.

        +

        In general, you can work with NULLs using the functions you’d use for NAs in R:

        +
        +
        flights |> 
        +  filter(!is.na(dep_delay)) |> 
        +  show_query()
        +#> <SQL>
        +#> SELECT flights.*
        +#> FROM flights
        +#> WHERE (NOT((dep_delay IS NULL)))
        +
        +

        This SQL query illustrates one of the drawbacks of dbplyr: while the SQL is correct, it isn’t as simple as you might write by hand. In this case, you could drop the parentheses and use a special operator that’s easier to read:

        +
        WHERE "dep_delay" IS NOT NULL
        +

        Note that if you filter() a variable that you created using a summarize, dbplyr will generate a HAVING clause, rather than a WHERE clause. This is a one of the idiosyncrasies of SQL: WHERE is evaluated before SELECT and GROUP BY, so SQL needs another clause that’s evaluated afterwards.

        +
        +
        diamonds_db |> 
        +  group_by(cut) |> 
        +  summarize(n = n()) |> 
        +  filter(n > 100) |> 
        +  show_query()
        +#> <SQL>
        +#> SELECT cut, COUNT(*) AS n
        +#> FROM diamonds
        +#> GROUP BY cut
        +#> HAVING (COUNT(*) > 100.0)
        +
        +

        +21.5.6 ORDER BY

        +

        Ordering rows involves a straightforward translation from arrange() to the ORDER BY clause:

        +
        +
        flights |> 
        +  arrange(year, month, day, desc(dep_delay)) |> 
        +  show_query()
        +#> <SQL>
        +#> SELECT flights.*
        +#> FROM flights
        +#> ORDER BY "year", "month", "day", dep_delay DESC
        +
        +

        Notice how desc() is translated to DESC: this is one of the many dplyr functions whose name was directly inspired by SQL.

        +

        +21.5.7 Subqueries

        +

        Sometimes it’s not possible to translate a dplyr pipeline into a single SELECT statement and you need to use a subquery. A subquery is just a query used as a data source in the FROM clause, instead of the usual table.

        +

        dbplyr typically uses subqueries to work around limitations of SQL. For example, expressions in the SELECT clause can’t refer to columns that were just created. That means that the following (silly) dplyr pipeline needs to happen in two steps: the first (inner) query computes year1 and then the second (outer) query can compute year2.

        +
        +
        flights |> 
        +  mutate(
        +    year1 = year + 1,
        +    year2 = year1 + 1
        +  ) |> 
        +  show_query()
        +#> <SQL>
        +#> SELECT q01.*, year1 + 1.0 AS year2
        +#> FROM (
        +#>   SELECT flights.*, "year" + 1.0 AS year1
        +#>   FROM flights
        +#> ) q01
        +
        +

        You’ll also see this if you attempted to filter() a variable that you just created. Remember, even though WHERE is written after SELECT, it’s evaluated before it, so we need a subquery in this (silly) example:

        +
        +
        flights |> 
        +  mutate(year1 = year + 1) |> 
        +  filter(year1 == 2014) |> 
        +  show_query()
        +#> <SQL>
        +#> SELECT q01.*
        +#> FROM (
        +#>   SELECT flights.*, "year" + 1.0 AS year1
        +#>   FROM flights
        +#> ) q01
        +#> WHERE (year1 = 2014.0)
        +
        +

        Sometimes dbplyr will create a subquery where it’s not needed because it doesn’t yet know how to optimize that translation. As dbplyr improves over time, these cases will get rarer but will probably never go away.

        +

        +21.5.8 Joins

        +

        If you’re familiar with dplyr’s joins, SQL joins are very similar. Here’s a simple example:

        +
        +
        flights |> 
        +  left_join(planes |> rename(year_built = year), by = "tailnum") |> 
        +  show_query()
        +#> <SQL>
        +#> SELECT
        +#>   flights.*,
        +#>   planes."year" AS year_built,
        +#>   "type",
        +#>   manufacturer,
        +#>   model,
        +#>   engines,
        +#>   seats,
        +#>   speed,
        +#>   engine
        +#> FROM flights
        +#> LEFT JOIN planes
        +#>   ON (flights.tailnum = planes.tailnum)
        +
        +

        The main thing to notice here is the syntax: SQL joins use sub-clauses of the FROM clause to bring in additional tables, using ON to define how the tables are related.

        +

        dplyr’s names for these functions are so closely connected to SQL that you can easily guess the equivalent SQL for inner_join(), right_join(), and full_join():

        +
        SELECT flights.*, "type", manufacturer, model, engines, seats, speed
        +FROM flights
        +INNER JOIN planes ON (flights.tailnum = planes.tailnum)
        +
        +SELECT flights.*, "type", manufacturer, model, engines, seats, speed
        +FROM flights
        +RIGHT JOIN planes ON (flights.tailnum = planes.tailnum)
        +
        +SELECT flights.*, "type", manufacturer, model, engines, seats, speed
        +FROM flights
        +FULL JOIN planes ON (flights.tailnum = planes.tailnum)
        +

        You’re likely to need many joins when working with data from a database. That’s because database tables are often stored in a highly normalized form, where each “fact” is stored in a single place and to keep a complete dataset for analysis you need to navigate a complex network of tables connected by primary and foreign keys. If you hit this scenario, the dm package, by Tobias Schieferdecker, Kirill Müller, and Darko Bergant, is a life saver. It can automatically determine the connections between tables using the constraints that DBAs often supply, visualize the connections so you can see what’s going on, and generate the joins you need to connect one table to another.

        +

        +21.5.9 Other verbs

        +

        dbplyr also translates other verbs like distinct(), slice_*(), and intersect(), and a growing selection of tidyr functions like pivot_longer() and pivot_wider(). The easiest way to see the full set of what’s currently available is to visit the dbplyr website: https://dbplyr.tidyverse.org/reference/.

        +

        +21.5.10 Exercises

        +
          +
        1. What is distinct() translated to? How about head()?

        2. +
        3. +

          Explain what each of the following SQL queries do and try recreate them using dbplyr.

          +
          SELECT * 
          +FROM flights
          +WHERE dep_delay < arr_delay
          +
          +SELECT *, distance / (air_time / 60) AS speed
          +FROM flights
          +
        4. +

        +21.6 Function translations

        +

        So far we’ve focused on the big picture of how dplyr verbs are translated to the clauses of a query. Now we’re going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g., what happens when you use mean(x) in a summarize()?

        +

        To help see what’s going on, we’ll use a couple of little helper functions that run a summarize() or mutate() and show the generated SQL. That will make it a little easier to explore a few variations and see how summaries and transformations can differ.

        +
        +
        summarize_query <- function(df, ...) {
        +  df |> 
        +    summarize(...) |> 
        +    show_query()
        +}
        +mutate_query <- function(df, ...) {
        +  df |> 
        +    mutate(..., .keep = "none") |> 
        +    show_query()
        +}
        +
        +

        Let’s dive in with some summaries! Looking at the code below you’ll notice that some summary functions, like mean(), have a relatively simple translation while others, like median(), are much more complex. The complexity is typically higher for operations that are common in statistics but less common in databases.

        +
        +
        flights |> 
        +  group_by(year, month, day) |>  
        +  summarize_query(
        +    mean = mean(arr_delay, na.rm = TRUE),
        +    median = median(arr_delay, na.rm = TRUE)
        +  )
        +#> `summarise()` has grouped output by "year" and "month". You can override
        +#> using the `.groups` argument.
        +#> <SQL>
        +#> SELECT
        +#>   "year",
        +#>   "month",
        +#>   "day",
        +#>   AVG(arr_delay) AS mean,
        +#>   PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY arr_delay) AS median
        +#> FROM flights
        +#> GROUP BY "year", "month", "day"
        +
        +

        The translation of summary functions becomes more complicated when you use them inside a mutate() because they have to turn into so-called window functions. In SQL, you turn an ordinary aggregation function into a window function by adding OVER after it:

        +
        +
        flights |> 
        +  group_by(year, month, day) |>  
        +  mutate_query(
        +    mean = mean(arr_delay, na.rm = TRUE),
        +  )
        +#> <SQL>
        +#> SELECT
        +#>   "year",
        +#>   "month",
        +#>   "day",
        +#>   AVG(arr_delay) OVER (PARTITION BY "year", "month", "day") AS mean
        +#> FROM flights
        +
        +

        In SQL, the GROUP BY clause is used exclusively for summaries so here you can see that the grouping has moved from the PARTITION BY argument to OVER.

        +

        Window functions include all functions that look forward or backwards, like lead() and lag() which look at the “previous” or “next” value respectively:

        +
        +
        flights |> 
        +  group_by(dest) |>  
        +  arrange(time_hour) |> 
        +  mutate_query(
        +    lead = lead(arr_delay),
        +    lag = lag(arr_delay)
        +  )
        +#> <SQL>
        +#> SELECT
        +#>   dest,
        +#>   LEAD(arr_delay, 1, NULL) OVER (PARTITION BY dest ORDER BY time_hour) AS lead,
        +#>   LAG(arr_delay, 1, NULL) OVER (PARTITION BY dest ORDER BY time_hour) AS lag
        +#> FROM flights
        +#> ORDER BY time_hour
        +
        +

        Here it’s important to arrange() the data, because SQL tables have no intrinsic order. In fact, if you don’t use arrange() you might get the rows back in a different order every time! Notice for window functions, the ordering information is repeated: the ORDER BY clause of the main query doesn’t automatically apply to window functions.

        +

        Another important SQL function is CASE WHEN. It’s used as the translation of if_else() and case_when(), the dplyr function that it directly inspired. Here are a couple of simple examples:

        +
        +
        flights |> 
        +  mutate_query(
        +    description = if_else(arr_delay > 0, "delayed", "on-time")
        +  )
        +#> <SQL>
        +#> SELECT CASE WHEN (arr_delay > 0.0) THEN 'delayed' WHEN NOT (arr_delay > 0.0) THEN 'on-time' END AS description
        +#> FROM flights
        +flights |> 
        +  mutate_query(
        +    description = 
        +      case_when(
        +        arr_delay < -5 ~ "early", 
        +        arr_delay < 5 ~ "on-time",
        +        arr_delay >= 5 ~ "late"
        +      )
        +  )
        +#> <SQL>
        +#> SELECT CASE
        +#> WHEN (arr_delay < -5.0) THEN 'early'
        +#> WHEN (arr_delay < 5.0) THEN 'on-time'
        +#> WHEN (arr_delay >= 5.0) THEN 'late'
        +#> END AS description
        +#> FROM flights
        +
        +

        CASE WHEN is also used for some other functions that don’t have a direct translation from R to SQL. A good example of this is cut():

        +
        +
        flights |> 
        +  mutate_query(
        +    description =  cut(
        +      arr_delay, 
        +      breaks = c(-Inf, -5, 5, Inf), 
        +      labels = c("early", "on-time", "late")
        +    )
        +  )
        +#> <SQL>
        +#> SELECT CASE
        +#> WHEN (arr_delay <= -5.0) THEN 'early'
        +#> WHEN (arr_delay <= 5.0) THEN 'on-time'
        +#> WHEN (arr_delay > 5.0) THEN 'late'
        +#> END AS description
        +#> FROM flights
        +
        +

        dbplyr also translates common string and date-time manipulation functions, which you can learn about in vignette("translation-function", package = "dbplyr"). dbplyr’s translations are certainly not perfect, and there are many R functions that aren’t translated yet, but dbplyr does a surprisingly good job covering the functions that you’ll use most of the time.

        +

        +21.7 Summary

        +

        In this chapter you learned how to access data from databases. We focused on dbplyr, a dplyr “backend” that allows you to write the dplyr code you’re familiar with, and have it be automatically translated to SQL. We used that translation to teach you a little SQL; it’s important to learn some SQL because it’s the most commonly used language for working with data and knowing some will make it easier for you to communicate with other data folks who don’t use R. If you’ve finished this chapter and would like to learn more about SQL. We have two recommendations:

        +
          +
        • +SQL for Data Scientists by Renée M. P. Teate is an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data you’re likely to encounter in real organizations.
        • +
        • +Practical SQL by Anthony DeBarros is written from the perspective of a data journalist (a data scientist specialized in telling compelling stories) and goes into more detail about getting your data into a database and running your own DBMS.
        • +
        +

        In the next chapter, we’ll learn about another dplyr backend for working with large data: arrow. Arrow is designed for working with large files on disk, and is a natural complement to databases.

        + + +

        +
          +
        1. SQL is either pronounced “s”-“q”-“l” or “sequel”.↩︎

        2. +
        3. Typically, this is the only function you’ll use from the client package, so we recommend using :: to pull out that one function, rather than loading the complete package with library().↩︎

        4. +
        5. At least, all the tables that you have permission to see.↩︎

        6. +
        7. Confusingly, depending on the context, SELECT is either a statement or a clause. To avoid this confusion, we’ll generally use SELECT query instead of SELECT statement.↩︎

        8. +
        9. Ok, technically, only the SELECT is required, since you can write queries like SELECT 1+1 to perform basic calculations. But if you want to work with data (as you always do!) you’ll also need a FROM clause.↩︎

        10. +
        11. This is no coincidence: the dplyr function name was inspired by the SQL clause.↩︎

        12. +
        +
        +
        +
        + + + \ No newline at end of file diff --git a/datetimes.html b/datetimes.html new file mode 100644 index 000000000..731ff7750 --- /dev/null +++ b/datetimes.html @@ -0,0 +1,1456 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 17  Dates and times + + + + + + + + + + + + + + + + + + + + + + + + +
        +
        + +
        + + + +
        +

        17  Dates and times

        +
        + + + +
        + + + + +
        + + +

        +17.1 Introduction

        +

        This chapter will show you how to work with dates and times in R. At first glance, dates and times seem simple. You use them all the time in your regular life, and they don’t seem to cause much confusion. However, the more you learn about dates and times, the more complicated they seem to get!

        +

        To warm up think about how many days there are in a year, and how many hours there are in a day. You probably remembered that most years have 365 days, but leap years have 366. Do you know the full rule for determining if a year is a leap year1? The number of hours in a day is a little less obvious: most days have 24 hours, but in places that use daylight saving time (DST), one day each year has 23 hours and another has 25.

        +

        Dates and times are hard because they have to reconcile two physical phenomena (the rotation of the Earth and its orbit around the sun) with a whole raft of geopolitical phenomena including months, time zones, and DST. This chapter won’t teach you every last detail about dates and times, but it will give you a solid grounding of practical skills that will help you with common data analysis challenges.

        +

        We’ll begin by showing you how to create date-times from various inputs, and then once you’ve got a date-time, how you can extract components like year, month, and day. We’ll then dive into the tricky topic of working with time spans, which come in a variety of flavors depending on what you’re trying to do. We’ll conclude with a brief discussion of the additional challenges posed by time zones.

        +

        +17.1.1 Prerequisites

        +

        This chapter will focus on the lubridate package, which makes it easier to work with dates and times in R. As of the latest tidyverse release, lubridate is part of core tidyverse. We will also need nycflights13 for practice data.

        + +

        +17.2 Creating date/times

        +

        There are three types of date/time data that refer to an instant in time:

        +
          +
        • A date. Tibbles print this as <date>.

        • +
        • A time within a day. Tibbles print this as <time>.

        • +
        • A date-time is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as <dttm>. Base R calls these POSIXct, but doesn’t exactly trip off the tongue.

        • +
        +

        In this chapter we are going to focus on dates and date-times as R doesn’t have a native class for storing times. If you need one, you can use the hms package.

        +

        You should always use the simplest possible data type that works for your needs. That means if you can use a date instead of a date-time, you should. Date-times are substantially more complicated because of the need to handle time zones, which we’ll come back to at the end of the chapter.

        +

        To get the current date or date-time you can use today() or now():

        +
        +
        today()
        +#> [1] "2023-11-17"
        +now()
        +#> [1] "2023-11-17 17:43:56 UTC"
        +
        +

        Otherwise, the following sections describe the four ways you’re likely to create a date/time:

        +
          +
        • While reading a file with readr.
        • +
        • From a string.
        • +
        • From individual date-time components.
        • +
        • From an existing date/time object.
        • +
        +

        +17.2.1 During import

        +

        If your CSV contains an ISO8601 date or date-time, you don’t need to do anything; readr will automatically recognize it:

        +
        +
        csv <- "
        +  date,datetime
        +  2022-01-02,2022-01-02 05:12
        +"
        +read_csv(csv)
        +#> # A tibble: 1 × 2
        +#>   date       datetime           
        +#>   <date>     <dttm>             
        +#> 1 2022-01-02 2022-01-02 05:12:00
        +
        +

        If you haven’t heard of ISO8601 before, it’s an international standard2 for writing dates where the components of a date are organized from biggest to smallest separated by -. For example, in ISO8601 May 3 2022 is 2022-05-03. ISO8601 dates can also include times, where hour, minute, and second are separated by :, and the date and time components are separated by either a T or a space. For example, you could write 4:26pm on May 3 2022 as either 2022-05-03 16:26 or 2022-05-03T16:26.

        +

        For other date-time formats, you’ll need to use col_types plus col_date() or col_datetime() along with a date-time format. The date-time format used by readr is a standard used across many programming languages, describing a date component with a % followed by a single character. For example, %Y-%m-%d specifies a date that’s a year, -, month (as number) -, day. Table Tabela 17.1 lists all the options.

        +
        + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
        Tabela 17.1: All date formats understood by readr
        TypeCodeMeaningExample
        Year%Y4 digit year2021
        %y2 digit year21
        Month%mNumber2
        %bAbbreviated nameFeb
        %BFull nameFebruary
        Day%dOne or two digits2
        %eTwo digits02
        Time%H24-hour hour13
        %I12-hour hour1
        %pAM/PMpm
        %MMinutes35
        %SSeconds45
        %OSSeconds with decimal component45.35
        %ZTime zone nameAmerica/Chicago
        %zOffset from UTC+0800
        Other%.Skip one non-digit:
        %*Skip any number of non-digits
        +
        +

        And this code shows a few options applied to a very ambiguous date:

        +
        +
        csv <- "
        +  date
        +  01/02/15
        +"
        +
        +read_csv(csv, col_types = cols(date = col_date("%m/%d/%y")))
        +#> # A tibble: 1 × 1
        +#>   date      
        +#>   <date>    
        +#> 1 2015-01-02
        +
        +read_csv(csv, col_types = cols(date = col_date("%d/%m/%y")))
        +#> # A tibble: 1 × 1
        +#>   date      
        +#>   <date>    
        +#> 1 2015-02-01
        +
        +read_csv(csv, col_types = cols(date = col_date("%y/%m/%d")))
        +#> # A tibble: 1 × 1
        +#>   date      
        +#>   <date>    
        +#> 1 2001-02-15
        +
        +

        Note that no matter how you specify the date format, it’s always displayed the same way once you get it into R.

        +

        If you’re using %b or %B and working with non-English dates, you’ll also need to provide a locale(). See the list of built-in languages in date_names_langs(), or create your own with date_names(),

        +

        +17.2.2 From strings

        +

        The date-time specification language is powerful, but requires careful analysis of the date format. An alternative approach is to use lubridate’s helpers which attempt to automatically determine the format once you specify the order of the component. To use them, identify the order in which year, month, and day appear in your dates, then arrange “y”, “m”, and “d” in the same order. That gives you the name of the lubridate function that will parse your date. For example:

        +
        +
        ymd("2017-01-31")
        +#> [1] "2017-01-31"
        +mdy("January 31st, 2017")
        +#> [1] "2017-01-31"
        +dmy("31-Jan-2017")
        +#> [1] "2017-01-31"
        +
        +

        ymd() and friends create dates. To create a date-time, add an underscore and one or more of “h”, “m”, and “s” to the name of the parsing function:

        +
        +
        ymd_hms("2017-01-31 20:11:59")
        +#> [1] "2017-01-31 20:11:59 UTC"
        +mdy_hm("01/31/2017 08:01")
        +#> [1] "2017-01-31 08:01:00 UTC"
        +
        +

        You can also force the creation of a date-time from a date by supplying a timezone:

        +
        +
        ymd("2017-01-31", tz = "UTC")
        +#> [1] "2017-01-31 UTC"
        +
        +

        Here I use the UTC3 timezone which you might also know as GMT, or Greenwich Mean Time, the time at 0° longitude4 . It doesn’t use daylight saving time, making it a bit easier to compute with .

        +

        +17.2.3 From individual components

        +

        Instead of a single string, sometimes you’ll have the individual components of the date-time spread across multiple columns. This is what we have in the flights data:

        +
        +
        flights |> 
        +  select(year, month, day, hour, minute)
        +#> # A tibble: 336,776 × 5
        +#>    year month   day  hour minute
        +#>   <int> <int> <int> <dbl>  <dbl>
        +#> 1  2013     1     1     5     15
        +#> 2  2013     1     1     5     29
        +#> 3  2013     1     1     5     40
        +#> 4  2013     1     1     5     45
        +#> 5  2013     1     1     6      0
        +#> 6  2013     1     1     5     58
        +#> # ℹ 336,770 more rows
        +
        +

        To create a date/time from this sort of input, use make_date() for dates, or make_datetime() for date-times:

        +
        +
        flights |> 
        +  select(year, month, day, hour, minute) |> 
        +  mutate(departure = make_datetime(year, month, day, hour, minute))
        +#> # A tibble: 336,776 × 6
        +#>    year month   day  hour minute departure          
        +#>   <int> <int> <int> <dbl>  <dbl> <dttm>             
        +#> 1  2013     1     1     5     15 2013-01-01 05:15:00
        +#> 2  2013     1     1     5     29 2013-01-01 05:29:00
        +#> 3  2013     1     1     5     40 2013-01-01 05:40:00
        +#> 4  2013     1     1     5     45 2013-01-01 05:45:00
        +#> 5  2013     1     1     6      0 2013-01-01 06:00:00
        +#> 6  2013     1     1     5     58 2013-01-01 05:58:00
        +#> # ℹ 336,770 more rows
        +
        +

        Let’s do the same thing for each of the four time columns in flights. The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components. Once we’ve created the date-time variables, we focus in on the variables we’ll explore in the rest of the chapter.

        +
        +
        make_datetime_100 <- function(year, month, day, time) {
        +  make_datetime(year, month, day, time %/% 100, time %% 100)
        +}
        +
        +flights_dt <- flights |> 
        +  filter(!is.na(dep_time), !is.na(arr_time)) |> 
        +  mutate(
        +    dep_time = make_datetime_100(year, month, day, dep_time),
        +    arr_time = make_datetime_100(year, month, day, arr_time),
        +    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
        +    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
        +  ) |> 
        +  select(origin, dest, ends_with("delay"), ends_with("time"))
        +
        +flights_dt
        +#> # A tibble: 328,063 × 9
        +#>   origin dest  dep_delay arr_delay dep_time            sched_dep_time     
        +#>   <chr>  <chr>     <dbl>     <dbl> <dttm>              <dttm>             
        +#> 1 EWR    IAH           2        11 2013-01-01 05:17:00 2013-01-01 05:15:00
        +#> 2 LGA    IAH           4        20 2013-01-01 05:33:00 2013-01-01 05:29:00
        +#> 3 JFK    MIA           2        33 2013-01-01 05:42:00 2013-01-01 05:40:00
        +#> 4 JFK    BQN          -1       -18 2013-01-01 05:44:00 2013-01-01 05:45:00
        +#> 5 LGA    ATL          -6       -25 2013-01-01 05:54:00 2013-01-01 06:00:00
        +#> 6 EWR    ORD          -4        12 2013-01-01 05:54:00 2013-01-01 05:58:00
        +#> # ℹ 328,057 more rows
        +#> # ℹ 3 more variables: arr_time <dttm>, sched_arr_time <dttm>, …
        +
        +

        With this data, we can visualize the distribution of departure times across the year:

        +
        +
        flights_dt |> 
        +  ggplot(aes(x = dep_time)) + 
        +  geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day
        +
        +

        A frequency polyon with departure time (Jan-Dec 2013) on the x-axis and number of flights on the y-axis (0-1000). The frequency polygon is binned by day so you see a time series of flights by day. The pattern is dominated by a weekly pattern; there are fewer flights on weekends. The are few days that stand out as having a surprisingly few flights in early February, early July, late November, and late December.

        +
        +
        +

        Or within a single day:

        +
        +
        flights_dt |> 
        +  filter(dep_time < ymd(20130102)) |> 
        +  ggplot(aes(x = dep_time)) + 
        +  geom_freqpoly(binwidth = 600) # 600 s = 10 minutes
        +
        +

        A frequency polygon with departure time (6am - midnight Jan 1) on the x-axis, number of flights on the y-axis (0-17), binned into 10 minute increments. It's hard to see much pattern because of high variability, but most bins have 8-12 flights, and there are markedly fewer flights before 6am and after 8pm.

        +
        +
        +

        Note that when you use date-times in a numeric context (like in a histogram), 1 means 1 second, so a binwidth of 86400 means one day. For dates, 1 means 1 day.

        +

        +17.2.4 From other types

        +

        You may want to switch between a date-time and a date. That’s the job of as_datetime() and as_date():

        +
        +
        as_datetime(today())
        +#> [1] "2023-11-17 UTC"
        +as_date(now())
        +#> [1] "2023-11-17"
        +
        +

        Sometimes you’ll get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in seconds, use as_datetime(); if it’s in days, use as_date().

        +
        +
        as_datetime(60 * 60 * 10)
        +#> [1] "1970-01-01 10:00:00 UTC"
        +as_date(365 * 10 + 2)
        +#> [1] "1980-01-01"
        +
        +

        +17.2.5 Exercises

        +
          +
        1. +

          What happens if you parse a string that contains invalid dates?

          +
          +
          ymd(c("2010-10-10", "bananas"))
          +
          +
        2. +
        3. What does the tzone argument to today() do? Why is it important?

        4. +
        5. +

          For each of the following date-times, show how you’d parse it using a readr column specification and a lubridate function.

          +
          +
          d1 <- "January 1, 2010"
          +d2 <- "2015-Mar-07"
          +d3 <- "06-Jun-2017"
          +d4 <- c("August 19 (2015)", "July 1 (2015)")
          +d5 <- "12/30/14" # Dec 30, 2014
          +t1 <- "1705"
          +t2 <- "11:15:10.12 PM"
          +
          +
        6. +

        +17.3 Date-time components

        +

        Now that you know how to get date-time data into R’s date-time data structures, let’s explore what you can do with them. This section will focus on the accessor functions that let you get and set individual components. The next section will look at how arithmetic works with date-times.

        +

        +17.3.1 Getting components

        +

        You can pull out individual parts of the date with the accessor functions year(), month(), mday() (day of the month), yday() (day of the year), wday() (day of the week), hour(), minute(), and second(). These are effectively the opposites of make_datetime().

        +
        +
        datetime <- ymd_hms("2026-07-08 12:34:56")
        +
        +year(datetime)
        +#> [1] 2026
        +month(datetime)
        +#> [1] 7
        +mday(datetime)
        +#> [1] 8
        +
        +yday(datetime)
        +#> [1] 189
        +wday(datetime)
        +#> [1] 4
        +
        +

        For month() and wday() you can set label = TRUE to return the abbreviated name of the month or day of the week. Set abbr = FALSE to return the full name.

        +
        +
        month(datetime, label = TRUE)
        +#> [1] Jul
        +#> 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
        +wday(datetime, label = TRUE, abbr = FALSE)
        +#> [1] Wednesday
        +#> 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
        +
        +

        We can use wday() to see that more flights depart during the week than on the weekend:

        +
        +
        flights_dt |> 
        +  mutate(wday = wday(dep_time, label = TRUE)) |> 
        +  ggplot(aes(x = wday)) +
        +  geom_bar()
        +
        +

        A bar chart with days of the week on the x-axis and number of flights on the y-axis. Monday-Friday have roughly the same number of flights, ~48,0000, decreasingly slightly over the course of the week. Sunday is a little lower (~45,000), and Saturday is much lower (~38,000).

        +
        +
        +

        We can also look at the average departure delay by minute within the hour. There’s an interesting pattern: flights leaving in minutes 20-30 and 50-60 have much lower delays than the rest of the hour!

        +
        +
        flights_dt |> 
        +  mutate(minute = minute(dep_time)) |> 
        +  group_by(minute) |> 
        +  summarize(
        +    avg_delay = mean(dep_delay, na.rm = TRUE),
        +    n = n()
        +  ) |> 
        +  ggplot(aes(x = minute, y = avg_delay)) +
        +  geom_line()
        +
        +

        A line chart with minute of actual departure (0-60) on the x-axis and average delay (4-20) on the y-axis. Average delay starts at (0, 12), steadily increases to (18, 20), then sharply drops, hitting at minimum at ~23 minute past the hour and 9 minutes of delay. It then increases again to (17, 35), and sharply decreases to (55, 4). It finishes off with an increase to (60, 9).

        +
        +
        +

        Interestingly, if we look at the scheduled departure time we don’t see such a strong pattern:

        +
        +
        sched_dep <- flights_dt |> 
        +  mutate(minute = minute(sched_dep_time)) |> 
        +  group_by(minute) |> 
        +  summarize(
        +    avg_delay = mean(arr_delay, na.rm = TRUE),
        +    n = n()
        +  )
        +
        +ggplot(sched_dep, aes(x = minute, y = avg_delay)) +
        +  geom_line()
        +
        +

        A line chart with minute of scheduled departure (0-60) on the x-axis and average delay (4-16). There is relatively little pattern, just a small suggestion that the average delay decreases from maybe 10 minutes to 8 minutes over the course of the hour.

        +
        +
        +

        So why do we see that pattern with the actual departure times? Well, like much data collected by humans, there’s a strong bias towards flights leaving at “nice” departure times, as Figura 17.1 shows. Always be alert for this sort of pattern whenever you work with data that involves human judgement!

        +
        +
        +
        +

        A line plot with departure minute (0-60) on the x-axis and number of flights (0-60000) on the y-axis. Most flights are scheduled to depart on either the hour (~60,000) or the half hour (~35,000). Otherwise, all most all flights are scheduled to depart on multiples of five, with a few extra at 15, 45, and 55 minutes.

        +
        Figura 17.1: A frequency polygon showing the number of flights scheduled to depart each hour. You can see a strong preference for round numbers like 0 and 30 and generally for numbers that are a multiple of five.
        +
        +
        +
        +

        +17.3.2 Rounding

        +

        An alternative approach to plotting individual components is to round the date to a nearby unit of time, with floor_date(), round_date(), and ceiling_date(). Each function takes a vector of dates to adjust and then the name of the unit to round down (floor), round up (ceiling), or round to. This, for example, allows us to plot the number of flights per week:

        +
        +
        flights_dt |> 
        +  count(week = floor_date(dep_time, "week")) |> 
        +  ggplot(aes(x = week, y = n)) +
        +  geom_line() + 
        +  geom_point()
        +
        +

        A line plot with week (Jan-Dec 2013) on the x-axis and number of flights (2,000-7,000) on the y-axis. The pattern is fairly flat from February to November with around 7,000 flights per week. There are far fewer flights on the first (approximately 4,500 flights) and last weeks of the year (approximately 2,500 flights).

        +
        +
        +

        You can use rounding to show the distribution of flights across the course of a day by computing the difference between dep_time and the earliest instant of that day:

        +
        +
        flights_dt |> 
        +  mutate(dep_hour = dep_time - floor_date(dep_time, "day")) |> 
        +  ggplot(aes(x = dep_hour)) +
        +  geom_freqpoly(binwidth = 60 * 30)
        +#> Don't know how to automatically pick scale for object of type <difftime>.
        +#> Defaulting to continuous.
        +
        +

        A line plot with depature time on the x-axis. This is units of seconds since midnight so it's hard to interpret.

        +
        +
        +

        Computing the difference between a pair of date-times yields a difftime (more on that in Seção 17.4.3). We can convert that to an hms object to get a more useful x-axis:

        +
        +
        flights_dt |> 
        +  mutate(dep_hour = hms::as_hms(dep_time - floor_date(dep_time, "day"))) |> 
        +  ggplot(aes(x = dep_hour)) +
        +  geom_freqpoly(binwidth = 60 * 30)
        +
        +

        A line plot with depature time (midnight to midnight) on the x-axis and number of flights on the y-axis (0 to 15,000). There are very few (<100) flights before 5am. The number of flights then rises rapidly to 12,000 / hour, peaking at 15,000 at 9am, before falling to around 8,000 / hour for 10am to 2pm. Number of flights then increases to around 12,000 per hour until 8pm, when they rapidly drop again.

        +
        +
        +

        +17.3.3 Modifying components

        +

        You can also use each accessor function to modify the components of a date/time. This doesn’t come up much in data analysis, but can be useful when cleaning data that has clearly incorrect dates.

        +
        +
        (datetime <- ymd_hms("2026-07-08 12:34:56"))
        +#> [1] "2026-07-08 12:34:56 UTC"
        +
        +year(datetime) <- 2030
        +datetime
        +#> [1] "2030-07-08 12:34:56 UTC"
        +month(datetime) <- 01
        +datetime
        +#> [1] "2030-01-08 12:34:56 UTC"
        +hour(datetime) <- hour(datetime) + 1
        +datetime
        +#> [1] "2030-01-08 13:34:56 UTC"
        +
        +

        Alternatively, rather than modifying an existing variable, you can create a new date-time with update(). This also allows you to set multiple values in one step:

        +
        +
        update(datetime, year = 2030, month = 2, mday = 2, hour = 2)
        +#> [1] "2030-02-02 02:34:56 UTC"
        +
        +

        If values are too big, they will roll-over:

        +
        +
        update(ymd("2023-02-01"), mday = 30)
        +#> [1] "2023-03-02"
        +update(ymd("2023-02-01"), hour = 400)
        +#> [1] "2023-02-17 16:00:00 UTC"
        +
        +

        +17.3.4 Exercises

        +
          +
        1. How does the distribution of flight times within a day change over the course of the year?

        2. +
        3. Compare dep_time, sched_dep_time and dep_delay. Are they consistent? Explain your findings.

        4. +
        5. Compare air_time with the duration between the departure and arrival. Explain your findings. (Hint: consider the location of the airport.)

        6. +
        7. How does the average delay time change over the course of a day? Should you use dep_time or sched_dep_time? Why?

        8. +
        9. On what day of the week should you leave if you want to minimise the chance of a delay?

        10. +
        11. What makes the distribution of diamonds$carat and flights$sched_dep_time similar?

        12. +
        13. Confirm our hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. Hint: create a binary variable that tells you whether or not a flight was delayed.

        14. +

        +17.4 Time spans

        +

        Next you’ll learn about how arithmetic with dates works, including subtraction, addition, and division. Along the way, you’ll learn about three important classes that represent time spans:

        +
          +
        • +Durations, which represent an exact number of seconds.
        • +
        • +Periods, which represent human units like weeks and months.
        • +
        • +Intervals, which represent a starting and ending point.
        • +
        +

        How do you pick between duration, periods, and intervals? As always, pick the simplest data structure that solves your problem. If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval.

        +

        +17.4.1 Durations

        +

        In R, when you subtract two dates, you get a difftime object:

        +
        +
        # How old is Hadley?
        +h_age <- today() - ymd("1979-10-14")
        +h_age
        +#> Time difference of 16105 days
        +
        +

        A difftime class object records a time span of seconds, minutes, hours, days, or weeks. This ambiguity can make difftimes a little painful to work with, so lubridate provides an alternative which always uses seconds: the duration.

        +
        +
        as.duration(h_age)
        +#> [1] "1391472000s (~44.09 years)"
        +
        +

        Durations come with a bunch of convenient constructors:

        +
        +
        dseconds(15)
        +#> [1] "15s"
        +dminutes(10)
        +#> [1] "600s (~10 minutes)"
        +dhours(c(12, 24))
        +#> [1] "43200s (~12 hours)" "86400s (~1 days)"
        +ddays(0:5)
        +#> [1] "0s"                "86400s (~1 days)"  "172800s (~2 days)"
        +#> [4] "259200s (~3 days)" "345600s (~4 days)" "432000s (~5 days)"
        +dweeks(3)
        +#> [1] "1814400s (~3 weeks)"
        +dyears(1)
        +#> [1] "31557600s (~1 years)"
        +
        +

        Durations always record the time span in seconds. Larger units are created by converting minutes, hours, days, weeks, and years to seconds: 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day, and 7 days in a week. Larger time units are more problematic. A year uses the “average” number of days in a year, i.e. 365.25. There’s no way to convert a month to a duration, because there’s just too much variation.

        +

        You can add and multiply durations:

        +
        +
        2 * dyears(1)
        +#> [1] "63115200s (~2 years)"
        +dyears(1) + dweeks(12) + dhours(15)
        +#> [1] "38869200s (~1.23 years)"
        +
        +

        You can add and subtract durations to and from days:

        +
        +
        tomorrow <- today() + ddays(1)
        +last_year <- today() - dyears(1)
        +
        +

        However, because durations represent an exact number of seconds, sometimes you might get an unexpected result:

        +
        +
        one_am <- ymd_hms("2026-03-08 01:00:00", tz = "America/New_York")
        +
        +one_am
        +#> [1] "2026-03-08 01:00:00 EST"
        +one_am + ddays(1)
        +#> [1] "2026-03-09 02:00:00 EDT"
        +
        +

        Why is one day after 1am March 8, 2am March 9? If you look carefully at the date you might also notice that the time zones have changed. March 8 only has 23 hours because it’s when DST starts, so if we add a full days worth of seconds we end up with a different time.

        +

        +17.4.2 Periods

        +

        To solve this problem, lubridate provides periods. Periods are time spans but don’t have a fixed length in seconds, instead they work with “human” times, like days and months. That allows them to work in a more intuitive way:

        +
        +
        one_am
        +#> [1] "2026-03-08 01:00:00 EST"
        +one_am + days(1)
        +#> [1] "2026-03-09 01:00:00 EDT"
        +
        +

        Like durations, periods can be created with a number of friendly constructor functions.

        +
        +
        hours(c(12, 24))
        +#> [1] "12H 0M 0S" "24H 0M 0S"
        +days(7)
        +#> [1] "7d 0H 0M 0S"
        +months(1:6)
        +#> [1] "1m 0d 0H 0M 0S" "2m 0d 0H 0M 0S" "3m 0d 0H 0M 0S" "4m 0d 0H 0M 0S"
        +#> [5] "5m 0d 0H 0M 0S" "6m 0d 0H 0M 0S"
        +
        +

        You can add and multiply periods:

        +
        +
        10 * (months(6) + days(1))
        +#> [1] "60m 10d 0H 0M 0S"
        +days(50) + hours(25) + minutes(2)
        +#> [1] "50d 25H 2M 0S"
        +
        +

        And of course, add them to dates. Compared to durations, periods are more likely to do what you expect:

        +
        +
        # A leap year
        +ymd("2024-01-01") + dyears(1)
        +#> [1] "2024-12-31 06:00:00 UTC"
        +ymd("2024-01-01") + years(1)
        +#> [1] "2025-01-01"
        +
        +# Daylight saving time
        +one_am + ddays(1)
        +#> [1] "2026-03-09 02:00:00 EDT"
        +one_am + days(1)
        +#> [1] "2026-03-09 01:00:00 EDT"
        +
        +

        Let’s use periods to fix an oddity related to our flight dates. Some planes appear to have arrived at their destination before they departed from New York City.

        +
        +
        flights_dt |> 
        +  filter(arr_time < dep_time) 
        +#> # A tibble: 10,633 × 9
        +#>   origin dest  dep_delay arr_delay dep_time            sched_dep_time     
        +#>   <chr>  <chr>     <dbl>     <dbl> <dttm>              <dttm>             
        +#> 1 EWR    BQN           9        -4 2013-01-01 19:29:00 2013-01-01 19:20:00
        +#> 2 JFK    DFW          59        NA 2013-01-01 19:39:00 2013-01-01 18:40:00
        +#> 3 EWR    TPA          -2         9 2013-01-01 20:58:00 2013-01-01 21:00:00
        +#> 4 EWR    SJU          -6       -12 2013-01-01 21:02:00 2013-01-01 21:08:00
        +#> 5 EWR    SFO          11       -14 2013-01-01 21:08:00 2013-01-01 20:57:00
        +#> 6 LGA    FLL         -10        -2 2013-01-01 21:20:00 2013-01-01 21:30:00
        +#> # ℹ 10,627 more rows
        +#> # ℹ 3 more variables: arr_time <dttm>, sched_arr_time <dttm>, …
        +
        +

        These are overnight flights. We used the same date information for both the departure and the arrival times, but these flights arrived on the following day. We can fix this by adding days(1) to the arrival time of each overnight flight.

        +
        +
        flights_dt <- flights_dt |> 
        +  mutate(
        +    overnight = arr_time < dep_time,
        +    arr_time = arr_time + days(overnight),
        +    sched_arr_time = sched_arr_time + days(overnight)
        +  )
        +
        +

        Now all of our flights obey the laws of physics.

        +
        +
        flights_dt |> 
        +  filter(arr_time < dep_time) 
        +#> # A tibble: 0 × 10
        +#> # ℹ 10 variables: origin <chr>, dest <chr>, dep_delay <dbl>,
        +#> #   arr_delay <dbl>, dep_time <dttm>, sched_dep_time <dttm>, …
        +
        +

        +17.4.3 Intervals

        +

        What does dyears(1) / ddays(365) return? It’s not quite one, because dyears() is defined as the number of seconds per average year, which is 365.25 days.

        +

        What does years(1) / days(1) return? Well, if the year was 2015 it should return 365, but if it was 2016, it should return 366! There’s not quite enough information for lubridate to give a single clear answer. What it does instead is give an estimate:

        +
        +
        years(1) / days(1)
        +#> [1] 365.25
        +
        +

        If you want a more accurate measurement, you’ll have to use an interval. An interval is a pair of starting and ending date times, or you can think of it as a duration with a starting point.

        +

        You can create an interval by writing start %--% end:

        +
        +
        y2023 <- ymd("2023-01-01") %--% ymd("2024-01-01")
        +y2024 <- ymd("2024-01-01") %--% ymd("2025-01-01")
        +
        +y2023
        +#> [1] 2023-01-01 UTC--2024-01-01 UTC
        +y2024
        +#> [1] 2024-01-01 UTC--2025-01-01 UTC
        +
        +

        You could then divide it by days() to find out how many days fit in the year:

        +
        +
        y2023 / days(1)
        +#> [1] 365
        +y2024 / days(1)
        +#> [1] 366
        +
        +

        +17.4.4 Exercises

        +
          +
        1. Explain days(!overnight) and days(overnight) to someone who has just started learning R. What is the key fact you need to know?

        2. +
        3. Create a vector of dates giving the first day of every month in 2015. Create a vector of dates giving the first day of every month in the current year.

        4. +
        5. Write a function that given your birthday (as a date), returns how old you are in years.

        6. +
        7. Why can’t (today() %--% (today() + years(1))) / months(1) work?

        8. +

        +17.5 Time zones

        +

        Time zones are an enormously complicated topic because of their interaction with geopolitical entities. Fortunately we don’t need to dig into all the details as they’re not all important for data analysis, but there are a few challenges we’ll need to tackle head on.

        + +

        The first challenge is that everyday names of time zones tend to be ambiguous. For example, if you’re American you’re probably familiar with EST, or Eastern Standard Time. However, both Australia and Canada also have EST! To avoid confusion, R uses the international standard IANA time zones. These use a consistent naming scheme {area}/{location}, typically in the form {continent}/{city} or {ocean}/{city}. Examples include “America/New_York”, “Europe/Paris”, and “Pacific/Auckland”.

        +

        You might wonder why the time zone uses a city, when typically you think of time zones as associated with a country or region within a country. This is because the IANA database has to record decades worth of time zone rules. Over the course of decades, countries change names (or break apart) fairly frequently, but city names tend to stay the same. Another problem is that the name needs to reflect not only the current behavior, but also the complete history. For example, there are time zones for both “America/New_York” and “America/Detroit”. These cities both currently use Eastern Standard Time but in 1969-1972 Michigan (the state in which Detroit is located), did not follow DST, so it needs a different name. It’s worth reading the raw time zone database (available at https://www.iana.org/time-zones) just to read some of these stories!

        +

        You can find out what R thinks your current time zone is with Sys.timezone():

        +
        +
        Sys.timezone()
        +#> [1] "UTC"
        +
        +

        (If R doesn’t know, you’ll get an NA.)

        +

        And see the complete list of all time zone names with OlsonNames():

        +
        +
        length(OlsonNames())
        +#> [1] 597
        +head(OlsonNames())
        +#> [1] "Africa/Abidjan"     "Africa/Accra"       "Africa/Addis_Ababa"
        +#> [4] "Africa/Algiers"     "Africa/Asmara"      "Africa/Asmera"
        +
        +

        In R, the time zone is an attribute of the date-time that only controls printing. For example, these three objects represent the same instant in time:

        +
        +
        x1 <- ymd_hms("2024-06-01 12:00:00", tz = "America/New_York")
        +x1
        +#> [1] "2024-06-01 12:00:00 EDT"
        +
        +x2 <- ymd_hms("2024-06-01 18:00:00", tz = "Europe/Copenhagen")
        +x2
        +#> [1] "2024-06-01 18:00:00 CEST"
        +
        +x3 <- ymd_hms("2024-06-02 04:00:00", tz = "Pacific/Auckland")
        +x3
        +#> [1] "2024-06-02 04:00:00 NZST"
        +
        +

        You can verify that they’re the same time using subtraction:

        +
        +
        x1 - x2
        +#> Time difference of 0 secs
        +x1 - x3
        +#> Time difference of 0 secs
        +
        +

        Unless otherwise specified, lubridate always uses UTC. UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and is roughly equivalent to GMT (Greenwich Mean Time). It does not have DST, which makes a convenient representation for computation. Operations that combine date-times, like c(), will often drop the time zone. In that case, the date-times will display in the time zone of the first element:

        +
        +
        x4 <- c(x1, x2, x3)
        +x4
        +#> [1] "2024-06-01 12:00:00 EDT" "2024-06-01 12:00:00 EDT"
        +#> [3] "2024-06-01 12:00:00 EDT"
        +
        +

        You can change the time zone in two ways:

        +
          +
        • +

          Keep the instant in time the same, and change how it’s displayed. Use this when the instant is correct, but you want a more natural display.

          +
          +
          x4a <- with_tz(x4, tzone = "Australia/Lord_Howe")
          +x4a
          +#> [1] "2024-06-02 02:30:00 +1030" "2024-06-02 02:30:00 +1030"
          +#> [3] "2024-06-02 02:30:00 +1030"
          +x4a - x4
          +#> Time differences in secs
          +#> [1] 0 0 0
          +
          +

          (This also illustrates another challenge of times zones: they’re not all integer hour offsets!)

          +
        • +
        • +

          Change the underlying instant in time. Use this when you have an instant that has been labelled with the incorrect time zone, and you need to fix it.

          +
          +
          x4b <- force_tz(x4, tzone = "Australia/Lord_Howe")
          +x4b
          +#> [1] "2024-06-01 12:00:00 +1030" "2024-06-01 12:00:00 +1030"
          +#> [3] "2024-06-01 12:00:00 +1030"
          +x4b - x4
          +#> Time differences in hours
          +#> [1] -14.5 -14.5 -14.5
          +
          +
        • +

        +17.6 Summary

        +

        This chapter has introduced you to the tools that lubridate provides to help you work with date-time data. Working with dates and times can seem harder than necessary, but hopefully this chapter has helped you see why — date-times are more complex than they seem at first glance, and handling every possible situation adds complexity. Even if your data never crosses a day light savings boundary or involves a leap year, the functions need to be able to handle it.

        +

        The next chapter gives a round up of missing values. You’ve seen them in a few places and have no doubt encounter in your own analysis, and it’s now time to provide a grab bag of useful techniques for dealing with them.

        + + +

        +
          +
        1. A year is a leap year if it’s divisible by 4, unless it’s also divisible by 100, except if it’s also divisible by 400. In other words, in every set of 400 years, there’s 97 leap years.↩︎

        2. +
        3. https://xkcd.com/1179/↩︎

        4. +
        5. You might wonder what UTC stands for. It’s a compromise between the English “Coordinated Universal Time” and French “Temps Universel Coordonné”.↩︎

        6. +
        7. No prizes for guessing which country came up with the longitude system.↩︎

        8. +
        +
        +
        +
        + + + \ No newline at end of file diff --git a/datetimes_files/figure-html/fig-human-rounding-1.png b/datetimes_files/figure-html/fig-human-rounding-1.png new file mode 100644 index 000000000..79ecbc2fa Binary files /dev/null and b/datetimes_files/figure-html/fig-human-rounding-1.png differ diff --git a/datetimes_files/figure-html/unnamed-chunk-12-1.png b/datetimes_files/figure-html/unnamed-chunk-12-1.png new file mode 100644 index 000000000..07c7dc00d Binary files /dev/null and b/datetimes_files/figure-html/unnamed-chunk-12-1.png differ diff --git a/datetimes_files/figure-html/unnamed-chunk-13-1.png b/datetimes_files/figure-html/unnamed-chunk-13-1.png new file mode 100644 index 000000000..521aac0bf Binary files /dev/null and b/datetimes_files/figure-html/unnamed-chunk-13-1.png differ diff --git a/datetimes_files/figure-html/unnamed-chunk-20-1.png b/datetimes_files/figure-html/unnamed-chunk-20-1.png new file mode 100644 index 000000000..6d3ba9773 Binary files /dev/null and b/datetimes_files/figure-html/unnamed-chunk-20-1.png differ diff --git a/datetimes_files/figure-html/unnamed-chunk-21-1.png b/datetimes_files/figure-html/unnamed-chunk-21-1.png new file mode 100644 index 000000000..f1d83d6b2 Binary files /dev/null and b/datetimes_files/figure-html/unnamed-chunk-21-1.png differ diff --git a/datetimes_files/figure-html/unnamed-chunk-22-1.png b/datetimes_files/figure-html/unnamed-chunk-22-1.png new file mode 100644 index 000000000..2189452a5 Binary files /dev/null and b/datetimes_files/figure-html/unnamed-chunk-22-1.png differ diff --git a/datetimes_files/figure-html/unnamed-chunk-24-1.png b/datetimes_files/figure-html/unnamed-chunk-24-1.png new file mode 100644 index 000000000..79f4a359e Binary files /dev/null and b/datetimes_files/figure-html/unnamed-chunk-24-1.png differ diff --git a/datetimes_files/figure-html/unnamed-chunk-25-1.png b/datetimes_files/figure-html/unnamed-chunk-25-1.png new file mode 100644 index 000000000..e7c8612aa Binary files /dev/null and b/datetimes_files/figure-html/unnamed-chunk-25-1.png differ diff --git a/datetimes_files/figure-html/unnamed-chunk-26-1.png b/datetimes_files/figure-html/unnamed-chunk-26-1.png new file mode 100644 index 000000000..4c99fdb0c Binary files /dev/null and b/datetimes_files/figure-html/unnamed-chunk-26-1.png differ diff --git a/diagrams/join/anti.png b/diagrams/join/anti.png new file mode 100644 index 000000000..150115721 Binary files /dev/null and b/diagrams/join/anti.png differ diff --git a/diagrams/join/closest.png b/diagrams/join/closest.png new file mode 100644 index 000000000..dfbc32ab2 Binary files /dev/null and b/diagrams/join/closest.png differ diff --git a/diagrams/join/cross.png b/diagrams/join/cross.png new file mode 100644 index 000000000..15fccc6fb Binary files /dev/null and b/diagrams/join/cross.png differ diff --git a/diagrams/join/full.png b/diagrams/join/full.png new file mode 100644 index 000000000..b0c63c1bc Binary files /dev/null and b/diagrams/join/full.png differ diff --git a/diagrams/join/gte.png b/diagrams/join/gte.png new file mode 100644 index 000000000..fdca9166a Binary files /dev/null and b/diagrams/join/gte.png differ diff --git a/diagrams/join/inner-both.png b/diagrams/join/inner-both.png new file mode 100644 index 000000000..1cc660459 Binary files /dev/null and b/diagrams/join/inner-both.png differ diff --git a/diagrams/join/inner.png b/diagrams/join/inner.png new file mode 100644 index 000000000..7c6f9a89d Binary files /dev/null and b/diagrams/join/inner.png differ diff --git a/diagrams/join/left.png b/diagrams/join/left.png new file mode 100644 index 000000000..4efb093f8 Binary files /dev/null and b/diagrams/join/left.png differ diff --git a/diagrams/join/lt.png b/diagrams/join/lt.png new file mode 100644 index 000000000..7c8b6a79d Binary files /dev/null and b/diagrams/join/lt.png differ diff --git a/diagrams/join/match-types.png b/diagrams/join/match-types.png new file mode 100644 index 000000000..1f9fe5386 Binary files /dev/null and b/diagrams/join/match-types.png differ diff --git a/diagrams/join/right.png b/diagrams/join/right.png new file mode 100644 index 000000000..5d8c6cdf2 Binary files /dev/null and b/diagrams/join/right.png differ diff --git a/diagrams/join/semi.png b/diagrams/join/semi.png new file mode 100644 index 000000000..b76f2115f Binary files /dev/null and b/diagrams/join/semi.png differ diff --git a/diagrams/join/setup.png b/diagrams/join/setup.png new file mode 100644 index 000000000..00332168d Binary files /dev/null and b/diagrams/join/setup.png differ diff --git a/diagrams/join/setup2.png b/diagrams/join/setup2.png new file mode 100644 index 000000000..cb0d82e33 Binary files /dev/null and b/diagrams/join/setup2.png differ diff --git a/diagrams/join/venn.png b/diagrams/join/venn.png new file mode 100644 index 000000000..c9d558f0b Binary files /dev/null and b/diagrams/join/venn.png differ diff --git a/diagrams/new-project.png b/diagrams/new-project.png new file mode 100644 index 000000000..9bcec1d9a Binary files /dev/null and b/diagrams/new-project.png differ diff --git a/diagrams/pepper.png b/diagrams/pepper.png new file mode 100644 index 000000000..effbfe027 Binary files /dev/null and b/diagrams/pepper.png differ diff --git a/diagrams/relational.png b/diagrams/relational.png new file mode 100644 index 000000000..40cc9b1c7 Binary files /dev/null and b/diagrams/relational.png differ diff --git a/diagrams/rstudio/clean-slate.png b/diagrams/rstudio/clean-slate.png new file mode 100644 index 000000000..b617b1807 Binary files /dev/null and b/diagrams/rstudio/clean-slate.png differ diff --git a/diagrams/rstudio/script.png b/diagrams/rstudio/script.png new file mode 100644 index 000000000..3ff427ace Binary files /dev/null and b/diagrams/rstudio/script.png differ diff --git a/diagrams/tidy-data/cell-values.png b/diagrams/tidy-data/cell-values.png new file mode 100644 index 000000000..0e1533082 Binary files /dev/null and b/diagrams/tidy-data/cell-values.png differ diff --git a/diagrams/tidy-data/column-names.png b/diagrams/tidy-data/column-names.png new file mode 100644 index 000000000..0b384de59 Binary files /dev/null and b/diagrams/tidy-data/column-names.png differ diff --git a/diagrams/tidy-data/multiple-names.png b/diagrams/tidy-data/multiple-names.png new file mode 100644 index 000000000..1dc13376d Binary files /dev/null and b/diagrams/tidy-data/multiple-names.png differ diff --git a/diagrams/tidy-data/names-and-values.png b/diagrams/tidy-data/names-and-values.png new file mode 100644 index 000000000..b17416eb3 Binary files /dev/null and b/diagrams/tidy-data/names-and-values.png differ diff --git a/diagrams/tidy-data/variables.png b/diagrams/tidy-data/variables.png new file mode 100644 index 000000000..72664ff3e Binary files /dev/null and b/diagrams/tidy-data/variables.png differ diff --git a/diagrams/transform.png b/diagrams/transform.png new file mode 100644 index 000000000..75032a269 Binary files /dev/null and b/diagrams/transform.png differ diff --git a/factors.html b/factors.html new file mode 100644 index 000000000..844f520bd --- /dev/null +++ b/factors.html @@ -0,0 +1,1076 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 16  Factors + + + + + + + + + + + + + + + + + + + + + + + + +
        +
        + +
        + + + +
        +

        16  Factors

        +
        + + + +
        + + + + +
        + + +

        +16.1 Introduction

        +

        Factors are used for categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.

        +

        We’ll start by motivating why factors are needed for data analysis1 and how you can create them with factor(). We’ll then introduce you to the gss_cat dataset which contains a bunch of categorical variables to experiment with. You’ll then use that dataset to practice modifying the order and values of factors, before we finish up with a discussion of ordered factors.

        +

        +16.1.1 Prerequisites

        +

        Base R provides some basic tools for creating and manipulating factors. We’ll supplement these with the forcats package, which is part of the core tidyverse. It provides tools for dealing with categorical variables (and it’s an anagram of factors!) using a wide range of helpers for working with factors.

        + +

        +16.2 Factor basics

        +

        Imagine that you have a variable that records month:

        +
        +
        x1 <- c("Dec", "Apr", "Jan", "Mar")
        +
        +

        Using a string to record this variable has two problems:

        +
          +
        1. +

          There are only twelve possible months, and there’s nothing saving you from typos:

          +
          +
          x2 <- c("Dec", "Apr", "Jam", "Mar")
          +
          +
        2. +
        3. +

          It doesn’t sort in a useful way:

          +
          +
          sort(x1)
          +#> [1] "Apr" "Dec" "Jan" "Mar"
          +
          +
        4. +
        +

        You can fix both of these problems with a factor. To create a factor you must start by creating a list of the valid levels:

        +
        +
        month_levels <- c(
        +  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
        +  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
        +)
        +
        +

        Now you can create a factor:

        +
        +
        y1 <- factor(x1, levels = month_levels)
        +y1
        +#> [1] Dec Apr Jan Mar
        +#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
        +
        +sort(y1)
        +#> [1] Jan Mar Apr Dec
        +#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
        +
        +

        And any values not in the level will be silently converted to NA:

        +
        +
        y2 <- factor(x2, levels = month_levels)
        +y2
        +#> [1] Dec  Apr  <NA> Mar 
        +#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
        +
        +

        This seems risky, so you might want to use forcats::fct() instead:

        +
        +
        y2 <- fct(x2, levels = month_levels)
        +#> Error in `fct()`:
        +#> ! All values of `x` must appear in `levels` or `na`
        +#> ℹ Missing level: "Jam"
        +
        +

        If you omit the levels, they’ll be taken from the data in alphabetical order:

        +
        +
        factor(x1)
        +#> [1] Dec Apr Jan Mar
        +#> Levels: Apr Dec Jan Mar
        +
        +

        Sorting alphabetically is slightly risky because not every computer will sort strings in the same way. So forcats::fct() orders by first appearance:

        +
        +
        fct(x1)
        +#> [1] Dec Apr Jan Mar
        +#> Levels: Dec Apr Jan Mar
        +
        +

        If you ever need to access the set of valid levels directly, you can do so with levels():

        +
        +
        levels(y2)
        +#>  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
        +
        +

        You can also create a factor when reading your data with readr with col_factor():

        +
        +
        csv <- "
        +month,value
        +Jan,12
        +Feb,56
        +Mar,12"
        +
        +df <- read_csv(csv, col_types = cols(month = col_factor(month_levels)))
        +df$month
        +#> [1] Jan Feb Mar
        +#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
        +
        +

        +16.3 General Social Survey

        +

        For the rest of this chapter, we’re going to use forcats::gss_cat. It’s a sample of data from the General Social Survey, a long-running US survey conducted by the independent research organization NORC at the University of Chicago. The survey has thousands of questions, so in gss_cat Hadley selected a handful that will illustrate some common challenges you’ll encounter when working with factors.

        +
        +
        gss_cat
        +#> # A tibble: 21,483 × 9
        +#>    year marital         age race  rincome        partyid           
        +#>   <int> <fct>         <int> <fct> <fct>          <fct>             
        +#> 1  2000 Never married    26 White $8000 to 9999  Ind,near rep      
        +#> 2  2000 Divorced         48 White $8000 to 9999  Not str republican
        +#> 3  2000 Widowed          67 White Not applicable Independent       
        +#> 4  2000 Never married    39 White Not applicable Ind,near rep      
        +#> 5  2000 Divorced         25 White Not applicable Not str democrat  
        +#> 6  2000 Married          25 White $20000 - 24999 Strong democrat   
        +#> # ℹ 21,477 more rows
        +#> # ℹ 3 more variables: relig <fct>, denom <fct>, tvhours <int>
        +
        +

        (Remember, since this dataset is provided by a package, you can get more information about the variables with ?gss_cat.)

        +

        When factors are stored in a tibble, you can’t see their levels so easily. One way to view them is with count():

        +
        +
        gss_cat |>
        +  count(race)
        +#> # A tibble: 3 × 2
        +#>   race      n
        +#>   <fct> <int>
        +#> 1 Other  1959
        +#> 2 Black  3129
        +#> 3 White 16395
        +
        +

        When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels. Those operations are described in the sections below.

        +

        +16.3.1 Exercise

        +
          +
        1. Explore the distribution of rincome (reported income). What makes the default bar chart hard to understand? How could you improve the plot?

        2. +
        3. What is the most common relig in this survey? What’s the most common partyid?

        4. +
        5. Which relig does denom (denomination) apply to? How can you find out with a table? How can you find out with a visualization?

        6. +

        +16.4 Modifying factor order

        +

        It’s often useful to change the order of the factor levels in a visualization. For example, imagine you want to explore the average number of hours spent watching TV per day across religions:

        +
        +
        relig_summary <- gss_cat |>
        +  group_by(relig) |>
        +  summarize(
        +    tvhours = mean(tvhours, na.rm = TRUE),
        +    n = n()
        +  )
        +
        +ggplot(relig_summary, aes(x = tvhours, y = relig)) + 
        +  geom_point()
        +
        +

        A scatterplot of with tvhours on the x-axis and religion on the y-axis. The y-axis is ordered seemingly aribtrarily making it hard to get any sense of overall pattern.

        +
        +
        +

        It is hard to read this plot because there’s no overall pattern. We can improve it by reordering the levels of relig using fct_reorder(). fct_reorder() takes three arguments:

        +
          +
        • +f, the factor whose levels you want to modify.
        • +
        • +x, a numeric vector that you want to use to reorder the levels.
        • +
        • Optionally, fun, a function that’s used if there are multiple values of x for each value of f. The default value is median.
        • +
        +
        +
        ggplot(relig_summary, aes(x = tvhours, y = fct_reorder(relig, tvhours))) +
        +  geom_point()
        +
        +

        The same scatterplot as above, but now the religion is displayed in increasing order of tvhours. "Other eastern" has the fewest tvhours under 2, and "Don't know" has the highest (over 5).

        +
        +
        +

        Reordering religion makes it much easier to see that people in the “Don’t know” category watch much more TV, and Hinduism & Other Eastern religions watch much less.

        +

        As you start making more complicated transformations, we recommend moving them out of aes() and into a separate mutate() step. For example, you could rewrite the plot above as:

        +
        +
        relig_summary |>
        +  mutate(
        +    relig = fct_reorder(relig, tvhours)
        +  ) |>
        +  ggplot(aes(x = tvhours, y = relig)) +
        +  geom_point()
        +
        +

        What if we create a similar plot looking at how average age varies across reported income level?

        +
        +
        rincome_summary <- gss_cat |>
        +  group_by(rincome) |>
        +  summarize(
        +    age = mean(age, na.rm = TRUE),
        +    n = n()
        +  )
        +
        +ggplot(rincome_summary, aes(x = age, y = fct_reorder(rincome, age))) + 
        +  geom_point()
        +
        +

        A scatterplot with age on the x-axis and income on the y-axis. Income has been reordered in order of average age which doesn't make much sense. One section of the y-axis goes from $6000-6999, then <$1000, then $8000-9999.

        +
        +
        +

        Here, arbitrarily reordering the levels isn’t a good idea! That’s because rincome already has a principled order that we shouldn’t mess with. Reserve fct_reorder() for factors whose levels are arbitrarily ordered.

        +

        However, it does make sense to pull “Not applicable” to the front with the other special levels. You can use fct_relevel(). It takes a factor, f, and then any number of levels that you want to move to the front of the line.

        +
        +
        ggplot(rincome_summary, aes(x = age, y = fct_relevel(rincome, "Not applicable"))) +
        +  geom_point()
        +
        +

        The same scatterplot but now "Not Applicable" is displayed at the bottom of the y-axis. Generally there is a positive association between income and age, and the income band with the highethst average age is "Not applicable".

        +
        +
        +

        Why do you think the average age for “Not applicable” is so high?

        +

        Another type of reordering is useful when you are coloring the lines on a plot. fct_reorder2(f, x, y) reorders the factor f by the y values associated with the largest x values. This makes the plot easier to read because the colors of the line at the far right of the plot will line up with the legend.

        +
        +
        by_age <- gss_cat |>
        +  filter(!is.na(age)) |> 
        +  count(age, marital) |>
        +  group_by(age) |>
        +  mutate(
        +    prop = n / sum(n)
        +  )
        +
        +ggplot(by_age, aes(x = age, y = prop, color = marital)) +
        +  geom_line(linewidth = 1) + 
        +  scale_color_brewer(palette = "Set1")
        +
        +ggplot(by_age, aes(x = age, y = prop, color = fct_reorder2(marital, age, prop))) +
        +  geom_line(linewidth = 1) +
        +  scale_color_brewer(palette = "Set1") + 
        +  labs(color = "marital") 
        +
        +
        +
        +

        A line plot with age on the x-axis and proportion on the y-axis. There is one line for each category of marital status: no answer, never married, separated, divorced, widowed, and married. It is a little hard to read the plot because the order of the legend is unrelated to the lines on the plot. Rearranging the legend makes the plot easier to read because the legend colors now match the order of the lines on the far right of the plot. You can see some unsurprising patterns: the proportion never married decreases with age, married forms an upside down U shape, and widowed starts off low but increases steeply after age 60.

        +
        +
        +

        A line plot with age on the x-axis and proportion on the y-axis. There is one line for each category of marital status: no answer, never married, separated, divorced, widowed, and married. It is a little hard to read the plot because the order of the legend is unrelated to the lines on the plot. Rearranging the legend makes the plot easier to read because the legend colors now match the order of the lines on the far right of the plot. You can see some unsurprising patterns: the proportion never married decreases with age, married forms an upside down U shape, and widowed starts off low but increases steeply after age 60.

        +
        +
        +
        +
        +

        Finally, for bar plots, you can use fct_infreq() to order levels in decreasing frequency: this is the simplest type of reordering because it doesn’t need any extra variables. Combine it with fct_rev() if you want them in increasing frequency so that in the bar plot largest values are on the right, not the left.

        +
        +
        gss_cat |>
        +  mutate(marital = marital |> fct_infreq() |> fct_rev()) |>
        +  ggplot(aes(x = marital)) +
        +  geom_bar()
        +
        +

        A bar char of marital status ordered in from least to most common: no answer (~0), separated (~1,000), widowed (~2,000), divorced (~3,000), never married (~5,000), married (~10,000).

        +
        +
        +

        +16.4.1 Exercises

        +
          +
        1. There are some suspiciously high numbers in tvhours. Is the mean a good summary?

        2. +
        3. For each factor in gss_cat identify whether the order of the levels is arbitrary or principled.

        4. +
        5. Why did moving “Not applicable” to the front of the levels move it to the bottom of the plot?

        6. +

        +16.5 Modifying factor levels

        +

        More powerful than changing the orders of the levels is changing their values. This allows you to clarify labels for publication, and collapse levels for high-level displays. The most general and powerful tool is fct_recode(). It allows you to recode, or change, the value of each level. For example, take the partyid variable from the gss_cat data frame:

        +
        +
        gss_cat |> count(partyid)
        +#> # A tibble: 10 × 2
        +#>   partyid                n
        +#>   <fct>              <int>
        +#> 1 No answer            154
        +#> 2 Don't know             1
        +#> 3 Other party          393
        +#> 4 Strong republican   2314
        +#> 5 Not str republican  3032
        +#> 6 Ind,near rep        1791
        +#> # ℹ 4 more rows
        +
        +

        The levels are terse and inconsistent. Let’s tweak them to be longer and use a parallel construction. Like most rename and recoding functions in the tidyverse, the new values go on the left and the old values go on the right:

        +
        +
        gss_cat |>
        +  mutate(
        +    partyid = fct_recode(partyid,
        +      "Republican, strong"    = "Strong republican",
        +      "Republican, weak"      = "Not str republican",
        +      "Independent, near rep" = "Ind,near rep",
        +      "Independent, near dem" = "Ind,near dem",
        +      "Democrat, weak"        = "Not str democrat",
        +      "Democrat, strong"      = "Strong democrat"
        +    )
        +  ) |>
        +  count(partyid)
        +#> # A tibble: 10 × 2
        +#>   partyid                   n
        +#>   <fct>                 <int>
        +#> 1 No answer               154
        +#> 2 Don't know                1
        +#> 3 Other party             393
        +#> 4 Republican, strong     2314
        +#> 5 Republican, weak       3032
        +#> 6 Independent, near rep  1791
        +#> # ℹ 4 more rows
        +
        +

        fct_recode() will leave the levels that aren’t explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn’t exist.

        +

        To combine groups, you can assign multiple old levels to the same new level:

        +
        +
        gss_cat |>
        +  mutate(
        +    partyid = fct_recode(partyid,
        +      "Republican, strong"    = "Strong republican",
        +      "Republican, weak"      = "Not str republican",
        +      "Independent, near rep" = "Ind,near rep",
        +      "Independent, near dem" = "Ind,near dem",
        +      "Democrat, weak"        = "Not str democrat",
        +      "Democrat, strong"      = "Strong democrat",
        +      "Other"                 = "No answer",
        +      "Other"                 = "Don't know",
        +      "Other"                 = "Other party"
        +    )
        +  )
        +
        +

        Use this technique with care: if you group together categories that are truly different you will end up with misleading results.

        +

        If you want to collapse a lot of levels, fct_collapse() is a useful variant of fct_recode(). For each new variable, you can provide a vector of old levels:

        +
        +
        gss_cat |>
        +  mutate(
        +    partyid = fct_collapse(partyid,
        +      "other" = c("No answer", "Don't know", "Other party"),
        +      "rep" = c("Strong republican", "Not str republican"),
        +      "ind" = c("Ind,near rep", "Independent", "Ind,near dem"),
        +      "dem" = c("Not str democrat", "Strong democrat")
        +    )
        +  ) |>
        +  count(partyid)
        +#> # A tibble: 4 × 2
        +#>   partyid     n
        +#>   <fct>   <int>
        +#> 1 other     548
        +#> 2 rep      5346
        +#> 3 ind      8409
        +#> 4 dem      7180
        +
        +

        Sometimes you just want to lump together the small groups to make a plot or table simpler. That’s the job of the fct_lump_*() family of functions. fct_lump_lowfreq() is a simple starting point that progressively lumps the smallest groups categories into “Other”, always keeping “Other” as the smallest category.

        +
        +
        gss_cat |>
        +  mutate(relig = fct_lump_lowfreq(relig)) |>
        +  count(relig)
        +#> # A tibble: 2 × 2
        +#>   relig          n
        +#>   <fct>      <int>
        +#> 1 Protestant 10846
        +#> 2 Other      10637
        +
        +

        In this case it’s not very helpful: it is true that the majority of Americans in this survey are Protestant, but we’d probably like to see some more details! Instead, we can use the fct_lump_n() to specify that we want exactly 10 groups:

        +
        +
        gss_cat |>
        +  mutate(relig = fct_lump_n(relig, n = 10)) |>
        +  count(relig, sort = TRUE)
        +#> # A tibble: 10 × 2
        +#>   relig          n
        +#>   <fct>      <int>
        +#> 1 Protestant 10846
        +#> 2 Catholic    5124
        +#> 3 None        3523
        +#> 4 Christian    689
        +#> 5 Other        458
        +#> 6 Jewish       388
        +#> # ℹ 4 more rows
        +
        +

        Read the documentation to learn about fct_lump_min() and fct_lump_prop() which are useful in other cases.

        +

        +16.5.1 Exercises

        +
          +
        1. How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?

        2. +
        3. How could you collapse rincome into a small set of categories?

        4. +
        5. Notice there are 9 groups (excluding other) in the fct_lump example above. Why not 10? (Hint: type ?fct_lump, and find the default for the argument other_level is “Other”.)

        6. +

        +16.6 Ordered factors

        +

        Before we go on, there’s a special type of factor that needs to be mentioned briefly: ordered factors. Ordered factors, created with ordered(), imply a strict ordering and equal distance between levels: the first level is “less than” the second level by the same amount that the second level is “less than” the third level, and so on. You can recognize them when printing because they use < between the factor levels:

        +
        +
        ordered(c("a", "b", "c"))
        +#> [1] a b c
        +#> Levels: a < b < c
        +
        +

        In practice, ordered() factors behave very similarly to regular factors. There are only two places where you might notice different behavior:

        +
          +
        • If you map an ordered factor to color or fill in ggplot2, it will default to scale_color_viridis()/scale_fill_viridis(), a color scale that implies a ranking.
        • +
        • If you use an ordered function in a linear model, it will use “polygonal contrasts”. These are mildly useful, but you are unlikely to have heard of them unless you have a PhD in Statistics, and even then you probably don’t routinely interpret them. If you want to learn more, we recommend vignette("contrasts", package = "faux") by Lisa DeBruine.
        • +
        +

        Given the arguable utility of these differences, we don’t generally recommend using ordered factors.

        +

        +16.7 Summary

        +

        This chapter introduced you to the handy forcats package for working with factors, introducing you to the most commonly used functions. forcats contains a wide range of other helpers that we didn’t have space to discuss here, so whenever you’re facing a factor analysis challenge that you haven’t encountered before, I highly recommend skimming the reference index to see if there’s a canned function that can help solve your problem.

        +

        If you want to learn more about factors after reading this chapter, we recommend reading Amelia McNamara and Nicholas Horton’s paper, Wrangling categorical data in R. This paper lays out some of the history discussed in stringsAsFactors: An unauthorized biography and stringsAsFactors = <sigh>, and compares the tidy approaches to categorical data outlined in this book with base R methods. An early version of the paper helped motivate and scope the forcats package; thanks Amelia & Nick!

        +

        In the next chapter we’ll switch gears to start learning about dates and times in R. Dates and times seem deceptively simple, but as you’ll soon see, the more you learn about them, the more complex they seem to get!

        + + +

        +
          +
        1. They’re also really important for modelling.↩︎

        2. +
        +
        +
        +
        + + + \ No newline at end of file diff --git a/factors_files/figure-html/unnamed-chunk-16-1.png b/factors_files/figure-html/unnamed-chunk-16-1.png new file mode 100644 index 000000000..54b59096d Binary files /dev/null and b/factors_files/figure-html/unnamed-chunk-16-1.png differ diff --git a/factors_files/figure-html/unnamed-chunk-17-1.png b/factors_files/figure-html/unnamed-chunk-17-1.png new file mode 100644 index 000000000..1f239e074 Binary files /dev/null and b/factors_files/figure-html/unnamed-chunk-17-1.png differ diff --git a/factors_files/figure-html/unnamed-chunk-19-1.png b/factors_files/figure-html/unnamed-chunk-19-1.png new file mode 100644 index 000000000..b0709cba6 Binary files /dev/null and b/factors_files/figure-html/unnamed-chunk-19-1.png differ diff --git a/factors_files/figure-html/unnamed-chunk-20-1.png b/factors_files/figure-html/unnamed-chunk-20-1.png new file mode 100644 index 000000000..85faabe4c Binary files /dev/null and b/factors_files/figure-html/unnamed-chunk-20-1.png differ diff --git a/factors_files/figure-html/unnamed-chunk-21-1.png b/factors_files/figure-html/unnamed-chunk-21-1.png new file mode 100644 index 000000000..9261d1ebf Binary files /dev/null and b/factors_files/figure-html/unnamed-chunk-21-1.png differ diff --git a/factors_files/figure-html/unnamed-chunk-21-2.png b/factors_files/figure-html/unnamed-chunk-21-2.png new file mode 100644 index 000000000..0e4d8e41d Binary files /dev/null and b/factors_files/figure-html/unnamed-chunk-21-2.png differ diff --git a/factors_files/figure-html/unnamed-chunk-22-1.png b/factors_files/figure-html/unnamed-chunk-22-1.png new file mode 100644 index 000000000..27c38932a Binary files /dev/null and b/factors_files/figure-html/unnamed-chunk-22-1.png differ diff --git a/functions.html b/functions.html new file mode 100644 index 000000000..ff675a319 --- /dev/null +++ b/functions.html @@ -0,0 +1,1486 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 25  Functions + + + + + + + + + + + + + + + + + + + + + + + + +
        +
        + +
        + + + +
        +

        25  Functions

        +
        + + + +
        + + + + +
        + + +

        +25.1 Introduction

        +

        One of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. Writing a function has four big advantages over using copy-and-paste:

        +
          +
        1. You can give a function an evocative name that makes your code easier to understand.

        2. +
        3. As requirements change, you only need to update code in one place, instead of many.

        4. +
        5. You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).

        6. +
        7. It makes it easier to reuse work from project-to-project, increasing your productivity over time.

        8. +
        +

        A good rule of thumb is to consider writing a function whenever you’ve copied and pasted a block of code more than twice (i.e. you now have three copies of the same code). In this chapter, you’ll learn about three useful types of functions:

        +
          +
        • Vector functions take one or more vectors as input and return a vector as output.
        • +
        • Data frame functions take a data frame as input and return a data frame as output.
        • +
        • Plot functions that take a data frame as input and return a plot as output.
        • +
        +

        Each of these sections includes many examples to help you generalize the patterns that you see. These examples wouldn’t be possible without the help of folks of twitter, and we encourage follow the links in the comment to see original inspirations. You might also want to read the original motivating tweets for general functions and plotting functions to see even more functions.

        +

        +25.1.1 Prerequisites

        +

        We’ll wrap up a variety of functions from around the tidyverse. We’ll also use nycflights13 as a source of familiar data to use our functions with.

        + +

        +25.2 Vector functions

        +

        We’ll begin with vector functions: functions that take one or more vectors and return a vector result. For example, take a look at this code. What does it do?

        +
        +
        df <- tibble(
        +  a = rnorm(5),
        +  b = rnorm(5),
        +  c = rnorm(5),
        +  d = rnorm(5),
        +)
        +
        +df |> mutate(
        +  a = (a - min(a, na.rm = TRUE)) / 
        +    (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
        +  b = (b - min(b, na.rm = TRUE)) / 
        +    (max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
        +  c = (c - min(c, na.rm = TRUE)) / 
        +    (max(c, na.rm = TRUE) - min(c, na.rm = TRUE)),
        +  d = (d - min(d, na.rm = TRUE)) / 
        +    (max(d, na.rm = TRUE) - min(d, na.rm = TRUE)),
        +)
        +#> # A tibble: 5 × 4
        +#>       a     b     c     d
        +#>   <dbl> <dbl> <dbl> <dbl>
        +#> 1 0.339  2.59 0.291 0    
        +#> 2 0.880  0    0.611 0.557
        +#> 3 0      1.37 1     0.752
        +#> 4 0.795  1.37 0     1    
        +#> 5 1      1.34 0.580 0.394
        +
        +

        You might be able to puzzle out that this rescales each column to have a range from 0 to 1. But did you spot the mistake? When Hadley wrote this code he made an error when copying-and-pasting and forgot to change an a to a b. Preventing this type of mistake is one very good reason to learn how to write functions.

        +

        +25.2.1 Writing a function

        +

        To write a function you need to first analyse your repeated code to figure what parts are constant and what parts vary. If we take the code above and pull it outside of mutate(), it’s a little easier to see the pattern because each repetition is now one line:

        +
        +
        (a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
        +(b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))
        +(c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
        +(d - min(d, na.rm = TRUE)) / (max(d, na.rm = TRUE) - min(d, na.rm = TRUE))  
        +
        +

        To make this a bit clearer we can replace the bit that varies with :

        +
        +
        (█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))
        +
        +

        To turn this into a function you need three things:

        +
          +
        1. A name. Here we’ll use rescale01 because this function rescales a vector to lie between 0 and 1.

        2. +
        3. The arguments. The arguments are things that vary across calls and our analysis above tells us that we have just one. We’ll call it x because this is the conventional name for a numeric vector.

        4. +
        5. The body. The body is the code that’s repeated across all the calls.

        6. +
        +

        Then you create a function by following the template:

        +
        +
        name <- function(arguments) {
        +  body
        +}
        +
        +

        For this case that leads to:

        +
        +
        rescale01 <- function(x) {
        +  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
        +}
        +
        +

        At this point you might test with a few simple inputs to make sure you’ve captured the logic correctly:

        +
        +
        rescale01(c(-10, 0, 10))
        +#> [1] 0.0 0.5 1.0
        +rescale01(c(1, 2, 3, NA, 5))
        +#> [1] 0.00 0.25 0.50   NA 1.00
        +
        +

        Then you can rewrite the call to mutate() as:

        +
        +
        df |> mutate(
        +  a = rescale01(a),
        +  b = rescale01(b),
        +  c = rescale01(c),
        +  d = rescale01(d),
        +)
        +#> # A tibble: 5 × 4
        +#>       a     b     c     d
        +#>   <dbl> <dbl> <dbl> <dbl>
        +#> 1 0.339 1     0.291 0    
        +#> 2 0.880 0     0.611 0.557
        +#> 3 0     0.530 1     0.752
        +#> 4 0.795 0.531 0     1    
        +#> 5 1     0.518 0.580 0.394
        +
        +

        (In Capítulo 26, you’ll learn how to use across() to reduce the duplication even further so all you need is df |> mutate(across(a:d, rescale01))).

        +

        +25.2.2 Improving our function

        +

        You might notice that the rescale01() function does some unnecessary work — instead of computing min() twice and max() once we could instead compute both the minimum and maximum in one step with range():

        +
        +
        rescale01 <- function(x) {
        +  rng <- range(x, na.rm = TRUE)
        +  (x - rng[1]) / (rng[2] - rng[1])
        +}
        +
        +

        Or you might try this function on a vector that includes an infinite value:

        +
        +
        x <- c(1:10, Inf)
        +rescale01(x)
        +#>  [1]   0   0   0   0   0   0   0   0   0   0 NaN
        +
        +

        That result is not particularly useful so we could ask range() to ignore infinite values:

        +
        +
        rescale01 <- function(x) {
        +  rng <- range(x, na.rm = TRUE, finite = TRUE)
        +  (x - rng[1]) / (rng[2] - rng[1])
        +}
        +
        +rescale01(x)
        +#>  [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
        +#>  [8] 0.7777778 0.8888889 1.0000000       Inf
        +
        +

        These changes illustrate an important benefit of functions: because we’ve moved the repeated code into a function, we only need to make the change in one place.

        +

        +25.2.3 Mutate functions

        +

        Now you’ve got the basic idea of functions, let’s take a look at a whole bunch of examples. We’ll start by looking at “mutate” functions, i.e. functions that work well inside of mutate() and filter() because they return an output of the same length as the input.

        +

        Let’s start with a simple variation of rescale01(). Maybe you want to compute the Z-score, rescaling a vector to have a mean of zero and a standard deviation of one:

        +
        +
        z_score <- function(x) {
        +  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
        +}
        +
        +

        Or maybe you want to wrap up a straightforward case_when() and give it a useful name. For example, this clamp() function ensures all values of a vector lie in between a minimum or a maximum:

        +
        +
        clamp <- function(x, min, max) {
        +  case_when(
        +    x < min ~ min,
        +    x > max ~ max,
        +    .default = x
        +  )
        +}
        +
        +clamp(1:10, min = 3, max = 7)
        +#>  [1] 3 3 3 4 5 6 7 7 7 7
        +
        +

        Of course functions don’t just need to work with numeric variables. You might want to do some repeated string manipulation. Maybe you need to make the first character upper case:

        +
        +
        first_upper <- function(x) {
        +  str_sub(x, 1, 1) <- str_to_upper(str_sub(x, 1, 1))
        +  x
        +}
        +
        +first_upper("hello")
        +#> [1] "Hello"
        +
        +

        Or maybe you want to strip percent signs, commas, and dollar signs from a string before converting it into a number:

        +
        +
        # https://twitter.com/NVlabormarket/status/1571939851922198530
        +clean_number <- function(x) {
        +  is_pct <- str_detect(x, "%")
        +  num <- x |> 
        +    str_remove_all("%") |> 
        +    str_remove_all(",") |> 
        +    str_remove_all(fixed("$")) |> 
        +    as.numeric()
        +  if_else(is_pct, num / 100, num)
        +}
        +
        +clean_number("$12,300")
        +#> [1] 12300
        +clean_number("45%")
        +#> [1] 0.45
        +
        +

        Sometimes your functions will be highly specialized for one data analysis step. For example, if you have a bunch of variables that record missing values as 997, 998, or 999, you might want to write a function to replace them with NA:

        +
        +
        fix_na <- function(x) {
        +  if_else(x %in% c(997, 998, 999), NA, x)
        +}
        +
        +

        We’ve focused on examples that take a single vector because we think they’re the most common. But there’s no reason that your function can’t take multiple vector inputs.

        +

        +25.2.4 Summary functions

        +

        Another important family of vector functions is summary functions, functions that return a single value for use in summarize(). Sometimes this can just be a matter of setting a default argument or two:

        +
        +
        commas <- function(x) {
        +  str_flatten(x, collapse = ", ", last = " and ")
        +}
        +
        +commas(c("cat", "dog", "pigeon"))
        +#> [1] "cat, dog and pigeon"
        +
        +

        Or you might wrap up a simple computation, like for the coefficient of variation, which divides the standard deviation by the mean:

        +
        +
        cv <- function(x, na.rm = FALSE) {
        +  sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
        +}
        +
        +cv(runif(100, min = 0, max = 50))
        +#> [1] 0.5196276
        +cv(runif(100, min = 0, max = 500))
        +#> [1] 0.5652554
        +
        +

        Or maybe you just want to make a common pattern easier to remember by giving it a memorable name:

        +
        +
        # https://twitter.com/gbganalyst/status/1571619641390252033
        +n_missing <- function(x) {
        +  sum(is.na(x))
        +} 
        +
        +

        You can also write functions with multiple vector inputs. For example, maybe you want to compute the mean absolute percentage error to help you compare model predictions with actual values:

        +
        +
        # https://twitter.com/neilgcurrie/status/1571607727255834625
        +mape <- function(actual, predicted) {
        +  sum(abs((actual - predicted) / actual)) / length(actual)
        +}
        +
        +
        +
        +
        + +
        +
        +RStudio +
        +
        +
        +

        Once you start writing functions, there are two RStudio shortcuts that are super useful:

        +
          +
        • To find the definition of a function that you’ve written, place the cursor on the name of the function and press F2.

        • +
        • To quickly jump to a function, press Ctrl + . to open the fuzzy file and function finder and type the first few letters of your function name. You can also navigate to files, Quarto sections, and more, making it a very handy navigation tool.

        • +
        +
        +
        +

        +25.2.5 Exercises

        +
          +
        1. +

          Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need?

          +
          +
          mean(is.na(x))
          +mean(is.na(y))
          +mean(is.na(z))
          +
          +x / sum(x, na.rm = TRUE)
          +y / sum(y, na.rm = TRUE)
          +z / sum(z, na.rm = TRUE)
          +
          +round(x / sum(x, na.rm = TRUE) * 100, 1)
          +round(y / sum(y, na.rm = TRUE) * 100, 1)
          +round(z / sum(z, na.rm = TRUE) * 100, 1)
          +
          +
        2. +
        3. In the second variant of rescale01(), infinite values are left unchanged. Can you rewrite rescale01() so that -Inf is mapped to 0, and Inf is mapped to 1?

        4. +
        5. Given a vector of birthdates, write a function to compute the age in years.

        6. +
        7. Write your own functions to compute the variance and skewness of a numeric vector. You can look up the definitions on Wikipedia or elsewhere.

        8. +
        9. Write both_na(), a summary function that takes two vectors of the same length and returns the number of positions that have an NA in both vectors.

        10. +
        11. +

          Read the documentation to figure out what the following functions do. Why are they useful even though they are so short?

          +
          +
          is_directory <- function(x) {
          +  file.info(x)$isdir
          +}
          +is_readable <- function(x) {
          +  file.access(x, 4) == 0
          +}
          +
          +
        12. +

        +25.3 Data frame functions

        +

        Vector functions are useful for pulling out code that’s repeated within a dplyr verb. But you’ll often also repeat the verbs themselves, particularly within a large pipeline. When you notice yourself copying and pasting multiple verbs multiple times, you might think about writing a data frame function. Data frame functions work like dplyr verbs: they take a data frame as the first argument, some extra arguments that say what to do with it, and return a data frame or a vector.

        +

        To let you write a function that uses dplyr verbs, we’ll first introduce you to the challenge of indirection and how you can overcome it with embracing, {{ }}. With this theory under your belt, we’ll then show you a bunch of examples to illustrate what you might do with it.

        +

        +25.3.1 Indirection and tidy evaluation

        +

        When you start writing functions that use dplyr verbs you rapidly hit the problem of indirection. Let’s illustrate the problem with a very simple function: grouped_mean(). The goal of this function is to compute the mean of mean_var grouped by group_var:

        +
        +
        grouped_mean <- function(df, group_var, mean_var) {
        +  df |> 
        +    group_by(group_var) |> 
        +    summarize(mean(mean_var))
        +}
        +
        +

        If we try and use it, we get an error:

        +
        +
        diamonds |> grouped_mean(cut, carat)
        +#> Error in `group_by()`:
        +#> ! Must group by variables found in `.data`.
        +#> ✖ Column `group_var` is not found.
        +
        +

        To make the problem a bit more clear, we can use a made up data frame:

        +
        +
        df <- tibble(
        +  mean_var = 1,
        +  group_var = "g",
        +  group = 1,
        +  x = 10,
        +  y = 100
        +)
        +
        +df |> grouped_mean(group, x)
        +#> # A tibble: 1 × 2
        +#>   group_var `mean(mean_var)`
        +#>   <chr>                <dbl>
        +#> 1 g                        1
        +df |> grouped_mean(group, y)
        +#> # A tibble: 1 × 2
        +#>   group_var `mean(mean_var)`
        +#>   <chr>                <dbl>
        +#> 1 g                        1
        +
        +

        Regardless of how we call grouped_mean() it always does df |> group_by(group_var) |> summarize(mean(mean_var)), instead of df |> group_by(group) |> summarize(mean(x)) or df |> group_by(group) |> summarize(mean(y)). This is a problem of indirection, and it arises because dplyr uses tidy evaluation to allow you to refer to the names of variables inside your data frame without any special treatment.

        +

        Tidy evaluation is great 95% of the time because it makes your data analyses very concise as you never have to say which data frame a variable comes from; it’s obvious from the context. The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function. Here we need some way to tell group_by() and summarize() not to treat group_var and mean_var as the name of the variables, but instead look inside them for the variable we actually want to use.

        +

        Tidy evaluation includes a solution to this problem called embracing 🤗. Embracing a variable means to wrap it in braces so (e.g.) var becomes {{ var }}. Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the literal variable name. One way to remember what’s happening is to think of {{ }} as looking down a tunnel — {{ var }} will make a dplyr function look inside of var rather than looking for a variable called var.

        +

        So to make grouped_mean() work, we need to surround group_var and mean_var with {{ }}:

        +
        +
        grouped_mean <- function(df, group_var, mean_var) {
        +  df |> 
        +    group_by({{ group_var }}) |> 
        +    summarize(mean({{ mean_var }}))
        +}
        +
        +df |> grouped_mean(group, x)
        +#> # A tibble: 1 × 2
        +#>   group `mean(x)`
        +#>   <dbl>     <dbl>
        +#> 1     1        10
        +
        +

        Success!

        +

        +25.3.2 When to embrace?

        +

        So the key challenge in writing data frame functions is figuring out which arguments need to be embraced. Fortunately, this is easy because you can look it up from the documentation 😄. There are two terms to look for in the docs which correspond to the two most common sub-types of tidy evaluation:

        + +

        Your intuition about which arguments use tidy evaluation should be good for many common functions — just think about whether you can compute (e.g., x + 1) or select (e.g., a:x).

        +

        In the following sections, we’ll explore the sorts of handy functions you might write once you understand embracing.

        +

        +25.3.3 Common use cases

        +

        If you commonly perform the same set of summaries when doing initial data exploration, you might consider wrapping them up in a helper function:

        +
        +
        summary6 <- function(data, var) {
        +  data |> summarize(
        +    min = min({{ var }}, na.rm = TRUE),
        +    mean = mean({{ var }}, na.rm = TRUE),
        +    median = median({{ var }}, na.rm = TRUE),
        +    max = max({{ var }}, na.rm = TRUE),
        +    n = n(),
        +    n_miss = sum(is.na({{ var }})),
        +    .groups = "drop"
        +  )
        +}
        +
        +diamonds |> summary6(carat)
        +#> # A tibble: 1 × 6
        +#>     min  mean median   max     n n_miss
        +#>   <dbl> <dbl>  <dbl> <dbl> <int>  <int>
        +#> 1   0.2 0.798    0.7  5.01 53940      0
        +
        +

        (Whenever you wrap summarize() in a helper, we think it’s good practice to set .groups = "drop" to both avoid the message and leave the data in an ungrouped state.)

        +

        The nice thing about this function is, because it wraps summarize(), you can use it on grouped data:

        +
        +
        diamonds |> 
        +  group_by(cut) |> 
        +  summary6(carat)
        +#> # A tibble: 5 × 7
        +#>   cut         min  mean median   max     n n_miss
        +#>   <ord>     <dbl> <dbl>  <dbl> <dbl> <int>  <int>
        +#> 1 Fair       0.22 1.05    1     5.01  1610      0
        +#> 2 Good       0.23 0.849   0.82  3.01  4906      0
        +#> 3 Very Good  0.2  0.806   0.71  4    12082      0
        +#> 4 Premium    0.2  0.892   0.86  4.01 13791      0
        +#> 5 Ideal      0.2  0.703   0.54  3.5  21551      0
        +
        +

        Furthermore, since the arguments to summarize are data-masking also means that the var argument to summary6() is data-masking. That means you can also summarize computed variables:

        +
        +
        diamonds |> 
        +  group_by(cut) |> 
        +  summary6(log10(carat))
        +#> # A tibble: 5 × 7
        +#>   cut          min    mean  median   max     n n_miss
        +#>   <ord>      <dbl>   <dbl>   <dbl> <dbl> <int>  <int>
        +#> 1 Fair      -0.658 -0.0273  0      0.700  1610      0
        +#> 2 Good      -0.638 -0.133  -0.0862 0.479  4906      0
        +#> 3 Very Good -0.699 -0.164  -0.149  0.602 12082      0
        +#> 4 Premium   -0.699 -0.125  -0.0655 0.603 13791      0
        +#> 5 Ideal     -0.699 -0.225  -0.268  0.544 21551      0
        +
        +

        To summarize multiple variables, you’ll need to wait until Seção 26.2, where you’ll learn how to use across().

        +

        Another popular summarize() helper function is a version of count() that also computes proportions:

        +
        +
        # https://twitter.com/Diabb6/status/1571635146658402309
        +count_prop <- function(df, var, sort = FALSE) {
        +  df |>
        +    count({{ var }}, sort = sort) |>
        +    mutate(prop = n / sum(n))
        +}
        +
        +diamonds |> count_prop(clarity)
        +#> # A tibble: 8 × 3
        +#>   clarity     n   prop
        +#>   <ord>   <int>  <dbl>
        +#> 1 I1        741 0.0137
        +#> 2 SI2      9194 0.170 
        +#> 3 SI1     13065 0.242 
        +#> 4 VS2     12258 0.227 
        +#> 5 VS1      8171 0.151 
        +#> 6 VVS2     5066 0.0939
        +#> # ℹ 2 more rows
        +
        +

        This function has three arguments: df, var, and sort, and only var needs to be embraced because it’s passed to count() which uses data-masking for all variables. Note that we use a default value for sort so that if the user doesn’t supply their own value it will default to FALSE.

        +

        Or maybe you want to find the sorted unique values of a variable for a subset of the data. Rather than supplying a variable and a value to do the filtering, we’ll allow the user to supply a condition:

        +
        +
        unique_where <- function(df, condition, var) {
        +  df |> 
        +    filter({{ condition }}) |> 
        +    distinct({{ var }}) |> 
        +    arrange({{ var }})
        +}
        +
        +# Find all the destinations in December
        +flights |> unique_where(month == 12, dest)
        +#> # A tibble: 96 × 1
        +#>   dest 
        +#>   <chr>
        +#> 1 ABQ  
        +#> 2 ALB  
        +#> 3 ATL  
        +#> 4 AUS  
        +#> 5 AVL  
        +#> 6 BDL  
        +#> # ℹ 90 more rows
        +
        +

        Here we embrace condition because it’s passed to filter() and var because it’s passed to distinct() and arrange().

        +

        We’ve made all these examples to take a data frame as the first argument, but if you’re working repeatedly with the same data, it can make sense to hardcode it. For example, the following function always works with the flights dataset and always selects time_hour, carrier, and flight since they form the compound primary key that allows you to identify a row.

        +
        +
        subset_flights <- function(rows, cols) {
        +  flights |> 
        +    filter({{ rows }}) |> 
        +    select(time_hour, carrier, flight, {{ cols }})
        +}
        +
        +

        +25.3.4 Data-masking vs. tidy-selection

        +

        Sometimes you want to select variables inside a function that uses data-masking. For example, imagine you want to write a count_missing() that counts the number of missing observations in rows. You might try writing something like:

        +
        +
        count_missing <- function(df, group_vars, x_var) {
        +  df |> 
        +    group_by({{ group_vars }}) |> 
        +    summarize(
        +      n_miss = sum(is.na({{ x_var }})),
        +      .groups = "drop"
        +    )
        +}
        +
        +flights |> 
        +  count_missing(c(year, month, day), dep_time)
        +#> Error in `group_by()`:
        +#> ℹ In argument: `c(year, month, day)`.
        +#> Caused by error:
        +#> ! `c(year, month, day)` must be size 336776 or 1, not 1010328.
        +
        +

        This doesn’t work because group_by() uses data-masking, not tidy-selection. We can work around that problem by using the handy pick() function, which allows you to use tidy-selection inside data-masking functions:

        +
        +
        count_missing <- function(df, group_vars, x_var) {
        +  df |> 
        +    group_by(pick({{ group_vars }})) |> 
        +    summarize(
        +      n_miss = sum(is.na({{ x_var }})),
        +      .groups = "drop"
        +  )
        +}
        +
        +flights |> 
        +  count_missing(c(year, month, day), dep_time)
        +#> # A tibble: 365 × 4
        +#>    year month   day n_miss
        +#>   <int> <int> <int>  <int>
        +#> 1  2013     1     1      4
        +#> 2  2013     1     2      8
        +#> 3  2013     1     3     10
        +#> 4  2013     1     4      6
        +#> 5  2013     1     5      3
        +#> 6  2013     1     6      1
        +#> # ℹ 359 more rows
        +
        +

        Another convenient use of pick() is to make a 2d table of counts. Here we count using all the variables in the rows and columns, then use pivot_wider() to rearrange the counts into a grid:

        +
        +
        # https://twitter.com/pollicipes/status/1571606508944719876
        +count_wide <- function(data, rows, cols) {
        +  data |> 
        +    count(pick(c({{ rows }}, {{ cols }}))) |> 
        +    pivot_wider(
        +      names_from = {{ cols }}, 
        +      values_from = n,
        +      names_sort = TRUE,
        +      values_fill = 0
        +    )
        +}
        +
        +diamonds |> count_wide(c(clarity, color), cut)
        +#> # A tibble: 56 × 7
        +#>   clarity color  Fair  Good `Very Good` Premium Ideal
        +#>   <ord>   <ord> <int> <int>       <int>   <int> <int>
        +#> 1 I1      D         4     8           5      12    13
        +#> 2 I1      E         9    23          22      30    18
        +#> 3 I1      F        35    19          13      34    42
        +#> 4 I1      G        53    19          16      46    16
        +#> 5 I1      H        52    14          12      46    38
        +#> 6 I1      I        34     9           8      24    17
        +#> # ℹ 50 more rows
        +
        +

        While our examples have mostly focused on dplyr, tidy evaluation also underpins tidyr, and if you look at the pivot_wider() docs you can see that names_from uses tidy-selection.

        +

        +25.3.5 Exercises

        +
          +
        1. +

          Using the datasets from nycflights13, write a function that:

          +
            +
          1. +

            Finds all flights that were cancelled (i.e. is.na(arr_time)) or delayed by more than an hour.

            +
            +
            flights |> filter_severe()
            +
            +
          2. +
          3. +

            Counts the number of cancelled flights and the number of flights delayed by more than an hour.

            +
            +
            flights |> group_by(dest) |> summarize_severe()
            +
            +
          4. +
          5. +

            Finds all flights that were cancelled or delayed by more than a user supplied number of hours:

            +
            +
            flights |> filter_severe(hours = 2)
            +
            +
          6. +
          7. +

            Summarizes the weather to compute the minimum, mean, and maximum, of a user supplied variable:

            +
            +
            weather |> summarize_weather(temp)
            +
            +
          8. +
          9. +

            Converts the user supplied variable that uses clock time (e.g., dep_time, arr_time, etc.) into a decimal time (i.e. hours + (minutes / 60)).

            +
            +
            flights |> standardize_time(sched_dep_time)
            +
            +
          10. +
          +
        2. +
        3. For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-selection: distinct(), count(), group_by(), rename_with(), slice_min(), slice_sample().

        4. +
        5. +

          Generalize the following function so that you can supply any number of variables to count.

          +
          +
          count_prop <- function(df, var, sort = FALSE) {
          +  df |>
          +    count({{ var }}, sort = sort) |>
          +    mutate(prop = n / sum(n))
          +}
          +
          +
        6. +

        +25.4 Plot functions

        +

        Instead of returning a data frame, you might want to return a plot. Fortunately, you can use the same techniques with ggplot2, because aes() is a data-masking function. For example, imagine that you’re making a lot of histograms:

        +
        +
        diamonds |> 
        +  ggplot(aes(x = carat)) +
        +  geom_histogram(binwidth = 0.1)
        +
        +diamonds |> 
        +  ggplot(aes(x = carat)) +
        +  geom_histogram(binwidth = 0.05)
        +
        +

        Wouldn’t it be nice if you could wrap this up into a histogram function? This is easy as pie once you know that aes() is a data-masking function and you need to embrace:

        +
        +
        histogram <- function(df, var, binwidth = NULL) {
        +  df |> 
        +    ggplot(aes(x = {{ var }})) + 
        +    geom_histogram(binwidth = binwidth)
        +}
        +
        +diamonds |> histogram(carat, 0.1)
        +
        +

        A histogram of carats of diamonds, ranging from 0 to 5, showing a unimodal, right-skewed distribution with a peak between 0 to 1 carats.

        +
        +
        +

        Note that histogram() returns a ggplot2 plot, meaning you can still add on additional components if you want. Just remember to switch from |> to +:

        +
        +
        diamonds |> 
        +  histogram(carat, 0.1) +
        +  labs(x = "Size (in carats)", y = "Number of diamonds")
        +
        +

        +25.4.1 More variables

        +

        It’s straightforward to add more variables to the mix. For example, maybe you want an easy way to eyeball whether or not a dataset is linear by overlaying a smooth line and a straight line:

        +
        +
        # https://twitter.com/tyler_js_smith/status/1574377116988104704
        +linearity_check <- function(df, x, y) {
        +  df |>
        +    ggplot(aes(x = {{ x }}, y = {{ y }})) +
        +    geom_point() +
        +    geom_smooth(method = "loess", formula = y ~ x, color = "red", se = FALSE) +
        +    geom_smooth(method = "lm", formula = y ~ x, color = "blue", se = FALSE) 
        +}
        +
        +starwars |> 
        +  filter(mass < 1000) |> 
        +  linearity_check(mass, height)
        +
        +

        Scatterplot of height vs. mass of StarWars characters showing a positive relationship. A smooth curve of the relationship is plotted in red, and the best fit line is ploted in blue.

        +
        +
        +

        Or maybe you want an alternative to colored scatterplots for very large datasets where overplotting is a problem:

        +
        +
        # https://twitter.com/ppaxisa/status/1574398423175921665
        +hex_plot <- function(df, x, y, z, bins = 20, fun = "mean") {
        +  df |> 
        +    ggplot(aes(x = {{ x }}, y = {{ y }}, z = {{ z }})) + 
        +    stat_summary_hex(
        +      aes(color = after_scale(fill)), # make border same color as fill
        +      bins = bins, 
        +      fun = fun,
        +    )
        +}
        +
        +diamonds |> hex_plot(carat, price, depth)
        +
        +

        Hex plot of price vs. carat of diamonds showing a positive relationship. There are more diamonds that are less than 2 carats than more than 2 carats.

        +
        +
        +

        +25.4.2 Combining with other tidyverse

        +

        Some of the most useful helpers combine a dash of data manipulation with ggplot2. For example, if you might want to do a vertical bar chart where you automatically sort the bars in frequency order using fct_infreq(). Since the bar chart is vertical, we also need to reverse the usual order to get the highest values at the top:

        +
        +
        sorted_bars <- function(df, var) {
        +  df |> 
        +    mutate({{ var }} := fct_rev(fct_infreq({{ var }})))  |>
        +    ggplot(aes(y = {{ var }})) +
        +    geom_bar()
        +}
        +
        +diamonds |> sorted_bars(clarity)
        +
        +

        Bar plot of clarify of diamonds, where clarity is on the y-axis and counts are on the x-axis, and the bars are ordered in order of frequency: SI1, VS2, SI2, VS1, VVS2, VVS1, IF, I1.

        +
        +
        +

        We have to use a new operator here, := (commonly referred to as the “walrus operator”), because we are generating the variable name based on user-supplied data. Variable names go on the left hand side of =, but R’s syntax doesn’t allow anything to the left of = except for a single literal name. To work around this problem, we use the special operator := which tidy evaluation treats in exactly the same way as =.

        +

        Or maybe you want to make it easy to draw a bar plot just for a subset of the data:

        +
        +
        conditional_bars <- function(df, condition, var) {
        +  df |> 
        +    filter({{ condition }}) |> 
        +    ggplot(aes(x = {{ var }})) + 
        +    geom_bar()
        +}
        +
        +diamonds |> conditional_bars(cut == "Good", clarity)
        +
        +

        Bar plot of clarity of diamonds. The most common is SI1, then SI2, then VS2, then VS1, then VVS2, then VVS1, then I1, then lastly IF.

        +
        +
        +

        You can also get creative and display data summaries in other ways. You can find a cool application at https://gist.github.com/GShotwell/b19ef520b6d56f61a830fabb3454965b; it uses the axis labels to display the highest value. As you learn more about ggplot2, the power of your functions will continue to increase.

        +

        We’ll finish with a more complicated case: labelling the plots you create.

        +

        +25.4.3 Labeling

        +

        Remember the histogram function we showed you earlier?

        +
        +
        histogram <- function(df, var, binwidth = NULL) {
        +  df |> 
        +    ggplot(aes(x = {{ var }})) + 
        +    geom_histogram(binwidth = binwidth)
        +}
        +
        +

        Wouldn’t it be nice if we could label the output with the variable and the bin width that was used? To do so, we’re going to have to go under the covers of tidy evaluation and use a function from the package we haven’t talked about yet: rlang. rlang is a low-level package that’s used by just about every other package in the tidyverse because it implements tidy evaluation (as well as many other useful tools).

        +

        To solve the labeling problem we can use rlang::englue(). This works similarly to str_glue(), so any value wrapped in { } will be inserted into the string. But it also understands {{ }}, which automatically inserts the appropriate variable name:

        +
        +
        histogram <- function(df, var, binwidth) {
        +  label <- rlang::englue("A histogram of {{var}} with binwidth {binwidth}")
        +  
        +  df |> 
        +    ggplot(aes(x = {{ var }})) + 
        +    geom_histogram(binwidth = binwidth) + 
        +    labs(title = label)
        +}
        +
        +diamonds |> histogram(carat, 0.1)
        +
        +

        Histogram of carats of diamonds, ranging from 0 to 5. The distribution is unimodal and right skewed with a peak between 0 to 1 carats.

        +
        +
        +

        You can use the same approach in any other place where you want to supply a string in a ggplot2 plot.

        +

        +25.4.4 Exercises

        +

        Build up a rich plotting function by incrementally implementing each of the steps below:

        +
          +
        1. Draw a scatterplot given dataset and x and y variables.

        2. +
        3. Add a line of best fit (i.e. a linear model with no standard errors).

        4. +
        5. Add a title.

        6. +

        +25.5 Style

        +

        R doesn’t care what your function or arguments are called but the names make a big difference for humans. Ideally, the name of your function will be short, but clearly evoke what the function does. That’s hard! But it’s better to be clear than short, as RStudio’s autocomplete makes it easy to type long names.

        +

        Generally, function names should be verbs, and arguments should be nouns. There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. mean() is better than compute_mean()), or accessing some property of an object (i.e. coef() is better than get_coefficients()). Use your best judgement and don’t be afraid to rename a function if you figure out a better name later.

        +
        +
        # Too short
        +f()
        +
        +# Not a verb, or descriptive
        +my_awesome_function()
        +
        +# Long, but clear
        +impute_missing()
        +collapse_years()
        +
        +

        R also doesn’t care about how you use white space in your functions but future readers will. Continue to follow the rules from Capítulo 4. Additionally, function() should always be followed by squiggly brackets ({}), and the contents should be indented by an additional two spaces. This makes it easier to see the hierarchy in your code by skimming the left-hand margin.

        +
        +
        # Missing extra two spaces
        +density <- function(color, facets, binwidth = 0.1) {
        +diamonds |> 
        +  ggplot(aes(x = carat, y = after_stat(density), color = {{ color }})) +
        +  geom_freqpoly(binwidth = binwidth) +
        +  facet_wrap(vars({{ facets }}))
        +}
        +
        +# Pipe indented incorrectly
        +density <- function(color, facets, binwidth = 0.1) {
        +  diamonds |> 
        +  ggplot(aes(x = carat, y = after_stat(density), color = {{ color }})) +
        +  geom_freqpoly(binwidth = binwidth) +
        +  facet_wrap(vars({{ facets }}))
        +}
        +
        +

        As you can see we recommend putting extra spaces inside of {{ }}. This makes it very obvious that something unusual is happening.

        +

        +25.5.1 Exercises

        +
          +
        1. +

          Read the source code for each of the following two functions, puzzle out what they do, and then brainstorm better names.

          +
          +
          f1 <- function(string, prefix) {
          +  str_sub(string, 1, str_length(prefix)) == prefix
          +}
          +
          +f3 <- function(x, y) {
          +  rep(y, length.out = length(x))
          +}
          +
          +
        2. +
        3. Take a function that you’ve written recently and spend 5 minutes brainstorming a better name for it and its arguments.

        4. +
        5. Make a case for why norm_r(), norm_d() etc. would be better than rnorm(), dnorm(). Make a case for the opposite. How could you make the names even clearer?

        6. +

        +25.6 Summary

        +

        In this chapter, you learned how to write functions for three useful scenarios: creating a vector, creating a data frame, or creating a plot. Along the way you saw many examples, which hopefully started to get your creative juices flowing, and gave you some ideas for where functions might help your analysis code.

        +

        We have only shown you the bare minimum to get started with functions and there’s much more to learn. A few places to learn more are:

        + +

        In the next chapter, we’ll dive into iteration which gives you further tools for reducing code duplication.

        + + +
        +
        +
        +
        + + + \ No newline at end of file diff --git a/functions_files/figure-html/unnamed-chunk-43-1.png b/functions_files/figure-html/unnamed-chunk-43-1.png new file mode 100644 index 000000000..ccf6c2308 Binary files /dev/null and b/functions_files/figure-html/unnamed-chunk-43-1.png differ diff --git a/functions_files/figure-html/unnamed-chunk-43-2.png b/functions_files/figure-html/unnamed-chunk-43-2.png new file mode 100644 index 000000000..588582388 Binary files /dev/null and b/functions_files/figure-html/unnamed-chunk-43-2.png differ diff --git a/functions_files/figure-html/unnamed-chunk-44-1.png b/functions_files/figure-html/unnamed-chunk-44-1.png new file mode 100644 index 000000000..ccf6c2308 Binary files /dev/null and b/functions_files/figure-html/unnamed-chunk-44-1.png differ diff --git a/functions_files/figure-html/unnamed-chunk-45-1.png b/functions_files/figure-html/unnamed-chunk-45-1.png new file mode 100644 index 000000000..87c9bdefb Binary files /dev/null and b/functions_files/figure-html/unnamed-chunk-45-1.png differ diff --git a/functions_files/figure-html/unnamed-chunk-46-1.png b/functions_files/figure-html/unnamed-chunk-46-1.png new file mode 100644 index 000000000..4dc6a36af Binary files /dev/null and b/functions_files/figure-html/unnamed-chunk-46-1.png differ diff --git a/functions_files/figure-html/unnamed-chunk-47-1.png b/functions_files/figure-html/unnamed-chunk-47-1.png new file mode 100644 index 000000000..412e05061 Binary files /dev/null and b/functions_files/figure-html/unnamed-chunk-47-1.png differ diff --git a/functions_files/figure-html/unnamed-chunk-48-1.png b/functions_files/figure-html/unnamed-chunk-48-1.png new file mode 100644 index 000000000..080628b2b Binary files /dev/null and b/functions_files/figure-html/unnamed-chunk-48-1.png differ diff --git a/functions_files/figure-html/unnamed-chunk-49-1.png b/functions_files/figure-html/unnamed-chunk-49-1.png new file mode 100644 index 000000000..80a673a95 Binary files /dev/null and b/functions_files/figure-html/unnamed-chunk-49-1.png differ diff --git a/functions_files/figure-html/unnamed-chunk-51-1.png b/functions_files/figure-html/unnamed-chunk-51-1.png new file mode 100644 index 000000000..fefcaf0c7 Binary files /dev/null and b/functions_files/figure-html/unnamed-chunk-51-1.png differ diff --git a/images/quarto-flow.png b/images/quarto-flow.png new file mode 100644 index 000000000..1c2900c7b Binary files /dev/null and b/images/quarto-flow.png differ diff --git a/images/tidy-1.png b/images/tidy-1.png new file mode 100644 index 000000000..4287d74c6 Binary files /dev/null and b/images/tidy-1.png differ diff --git a/images/visualization-grammar.png b/images/visualization-grammar.png new file mode 100644 index 000000000..f4e11c639 Binary files /dev/null and b/images/visualization-grammar.png differ diff --git a/images/visualization-stat-bar.png b/images/visualization-stat-bar.png new file mode 100644 index 000000000..2488b235d Binary files /dev/null and b/images/visualization-stat-bar.png differ diff --git a/images/visualization-themes.png b/images/visualization-themes.png new file mode 100644 index 000000000..816f2a95f Binary files /dev/null and b/images/visualization-themes.png differ diff --git a/import.html b/import.html index 37987dc2a..b97563016 100644 --- a/import.html +++ b/import.html @@ -27,8 +27,8 @@ - - + + @@ -133,29 +133,226 @@ 2  Fluxo de Trabalho: básico + + + + + + + + + + + + + + + + + + + + + +
        @@ -189,11 +386,11 @@

        Import

        In this part of the book you’ll learn how to access data stored in the following ways:

          -
        • In ?sec-import-spreadsheets, you’ll learn how to import data from Excel spreadsheets and Google Sheets.

        • -
        • In ?sec-import-databases, you’ll learn about getting data out of a database and into R (and you’ll also learn a little about how to get data out of R and into a database).

        • -
        • In ?sec-arrow, you’ll learn about Arrow, a powerful tool for working with out-of-memory data, particularly when it’s stored in the parquet format.

        • -
        • In ?sec-rectangling, you’ll learn how to work with hierarchical data, including the deeply nested lists produced by data stored in the JSON format.

        • -
        • In ?sec-scraping, you’ll learn web “scraping”, the art and science of extracting data from web pages.

        • +
        • In Capítulo 20, you’ll learn how to import data from Excel spreadsheets and Google Sheets.

        • +
        • In Capítulo 21, you’ll learn about getting data out of a database and into R (and you’ll also learn a little about how to get data out of R and into a database).

        • +
        • In Capítulo 22, you’ll learn about Arrow, a powerful tool for working with out-of-memory data, particularly when it’s stored in the parquet format.

        • +
        • In Capítulo 23, you’ll learn how to work with hierarchical data, including the deeply nested lists produced by data stored in the JSON format.

        • +
        • In Capítulo 24, you’ll learn web “scraping”, the art and science of extracting data from web pages.

        There are two important tidyverse packages that we don’t discuss here: haven and xml2. If you’re working with data from SPSS, Stata, and SAS files, check out the haven package, https://haven.tidyverse.org. If you’re working with XML data, check out the xml2 package, https://xml2.r-lib.org. Otherwise, you’ll need to do some research to figure which package you’ll need to use; google is your friend here 😃.

        @@ -432,13 +629,13 @@

        Import } }); diff --git a/index.html b/index.html index c2eb37f8b..e6d1a854f 100644 --- a/index.html +++ b/index.html @@ -147,28 +147,225 @@ 2  Fluxo de Trabalho: básico + + + + + + + @@ -217,7 +414,6 @@

        Boas-vindas

        Este website contém a tradução em andamento para Português da 2ª edição do livro “R for Data Science”.

        -

        Apenas os capítulos com versões traduzidas aparecerão neste site.

        Caso queira contribuir com a tradução deste livro, leia o Guia de contribuição com a tradução do livro.

        diff --git a/intro.html b/intro.html index d994ed45f..7658a245e 100644 --- a/intro.html +++ b/intro.html @@ -167,29 +167,226 @@ 2  Fluxo de Trabalho: básico + + + + + + + + + + + + + + + + + + + + + +
        @@ -352,7 +549,7 @@

        Introdu


        -
      1. Nota de tradução: tidy é um verbo em inglês que quer dizer “arrumar/organizar”. Tidy data é uma forma de organizar os dados, que será abordado no capítulo ?sec-data-tidy.↩︎

      2. +
      3. Nota de tradução: tidy é um verbo em inglês que quer dizer “arrumar/organizar”. Tidy data é uma forma de organizar os dados, que será abordado no capítulo Capítulo 5.↩︎

      4. Nota de tradução: Manipulação de dados é chamado em inglês de data wrangling, porque colocar seus dados em uma forma natural de trabalhar frequentemente parece uma luta (wrangle)!↩︎

      5. Nota de tradução: “Caber na memória” se refere à memória RAM (random access memory) do computador, cuja função é guardar temporariamente toda a informação que o computador precisa (por exemplo, as bases de dados importadas).↩︎

      6. Se você deseja uma visão abrangente de todos os recursos do RStudio, consulte o Guia de uso do RStudio em https://docs.posit.co/ide/user.↩︎

      7. diff --git a/iteration.html b/iteration.html new file mode 100644 index 000000000..726958cf5 --- /dev/null +++ b/iteration.html @@ -0,0 +1,1679 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 26  Iteration + + + + + + + + + + + + + + + + + + + + + + + + +
        +
        + +
        + + + +
        +

        26  Iteration

        +
        + + + +
        + + + + +
        + + +

        +26.1 Introduction

        +

        In this chapter, you’ll learn tools for iteration, repeatedly performing the same action on different objects. Iteration in R generally tends to look rather different from other programming languages because so much of it is implicit and we get it for free. For example, if you want to double a numeric vector x in R, you can just write 2 * x. In most other languages, you’d need to explicitly double each element of x using some sort of for loop.

        +

        This book has already given you a small but powerful number of tools that perform the same action for multiple “things”:

        + +

        Now it’s time to learn some more general tools, often called functional programming tools because they are built around functions that take other functions as inputs. Learning functional programming can easily veer into the abstract, but in this chapter we’ll keep things concrete by focusing on three common tasks: modifying multiple columns, reading multiple files, and saving multiple objects.

        +

        +26.1.1 Prerequisites

        +

        In this chapter, we’ll focus on tools provided by dplyr and purrr, both core members of the tidyverse. You’ve seen dplyr before, but purrr is new. We’re just going to use a couple of purrr functions in this chapter, but it’s a great package to explore as you improve your programming skills.

        + +

        +26.2 Modifying multiple columns

        +

        Imagine you have this simple tibble and you want to count the number of observations and compute the median of every column.

        +
        +
        df <- tibble(
        +  a = rnorm(10),
        +  b = rnorm(10),
        +  c = rnorm(10),
        +  d = rnorm(10)
        +)
        +
        +

        You could do it with copy-and-paste:

        +
        +
        df |> summarize(
        +  n = n(),
        +  a = median(a),
        +  b = median(b),
        +  c = median(c),
        +  d = median(d),
        +)
        +#> # A tibble: 1 × 5
        +#>       n      a      b       c     d
        +#>   <int>  <dbl>  <dbl>   <dbl> <dbl>
        +#> 1    10 -0.246 -0.287 -0.0567 0.144
        +
        +

        That breaks our rule of thumb to never copy and paste more than twice, and you can imagine that this will get very tedious if you have tens or even hundreds of columns. Instead, you can use across():

        +
        +
        df |> summarize(
        +  n = n(),
        +  across(a:d, median),
        +)
        +#> # A tibble: 1 × 5
        +#>       n      a      b       c     d
        +#>   <int>  <dbl>  <dbl>   <dbl> <dbl>
        +#> 1    10 -0.246 -0.287 -0.0567 0.144
        +
        +

        across() has three particularly important arguments, which we’ll discuss in detail in the following sections. You’ll use the first two every time you use across(): the first argument, .cols, specifies which columns you want to iterate over, and the second argument, .fns, specifies what to do with each column. You can use the .names argument when you need additional control over the names of output columns, which is particularly important when you use across() with mutate(). We’ll also discuss two important variations, if_any() and if_all(), which work with filter().

        +

        +26.2.1 Selecting columns with .cols +

        +

        The first argument to across(), .cols, selects the columns to transform. This uses the same specifications as select(), Seção 3.3.2, so you can use functions like starts_with() and ends_with() to select columns based on their name.

        +

        There are two additional selection techniques that are particularly useful for across(): everything() and where(). everything() is straightforward: it selects every (non-grouping) column:

        +
        +
        df <- tibble(
        +  grp = sample(2, 10, replace = TRUE),
        +  a = rnorm(10),
        +  b = rnorm(10),
        +  c = rnorm(10),
        +  d = rnorm(10)
        +)
        +
        +df |> 
        +  group_by(grp) |> 
        +  summarize(across(everything(), median))
        +#> # A tibble: 2 × 5
        +#>     grp       a       b     c     d
        +#>   <int>   <dbl>   <dbl> <dbl> <dbl>
        +#> 1     1 -0.0935 -0.0163 0.363 0.364
        +#> 2     2  0.312  -0.0576 0.208 0.565
        +
        +

        Note grouping columns (grp here) are not included in across(), because they’re automatically preserved by summarize().

        +

        where() allows you to select columns based on their type:

        +
          +
        • +where(is.numeric) selects all numeric columns.
        • +
        • +where(is.character) selects all string columns.
        • +
        • +where(is.Date) selects all date columns.
        • +
        • +where(is.POSIXct) selects all date-time columns.
        • +
        • +where(is.logical) selects all logical columns.
        • +
        +

        Just like other selectors, you can combine these with Boolean algebra. For example, !where(is.numeric) selects all non-numeric columns, and starts_with("a") & where(is.logical) selects all logical columns whose name starts with “a”.

        +

        +26.2.2 Calling a single function

        +

        The second argument to across() defines how each column will be transformed. In simple cases, as above, this will be a single existing function. This is a pretty special feature of R: we’re passing one function (median, mean, str_flatten, …) to another function (across). This is one of the features that makes R a functional programming language.

        +

        It’s important to note that we’re passing this function to across(), so across() can call it; we’re not calling it ourselves. That means the function name should never be followed by (). If you forget, you’ll get an error:

        +
        +
        df |> 
        +  group_by(grp) |> 
        +  summarize(across(everything(), median()))
        +#> Error in `summarize()`:
        +#> ℹ In argument: `across(everything(), median())`.
        +#> Caused by error in `median.default()`:
        +#> ! argument "x" is missing, with no default
        +
        +

        This error arises because you’re calling the function with no input, e.g.:

        +
        +
        median()
        +#> Error in median.default(): argument "x" is missing, with no default
        +
        +

        +26.2.3 Calling multiple functions

        +

        In more complex cases, you might want to supply additional arguments or perform multiple transformations. Let’s motivate this problem with a simple example: what happens if we have some missing values in our data? median() propagates those missing values, giving us a suboptimal output:

        +
        +
        rnorm_na <- function(n, n_na, mean = 0, sd = 1) {
        +  sample(c(rnorm(n - n_na, mean = mean, sd = sd), rep(NA, n_na)))
        +}
        +
        +df_miss <- tibble(
        +  a = rnorm_na(5, 1),
        +  b = rnorm_na(5, 1),
        +  c = rnorm_na(5, 2),
        +  d = rnorm(5)
        +)
        +df_miss |> 
        +  summarize(
        +    across(a:d, median),
        +    n = n()
        +  )
        +#> # A tibble: 1 × 5
        +#>       a     b     c     d     n
        +#>   <dbl> <dbl> <dbl> <dbl> <int>
        +#> 1    NA    NA    NA  1.15     5
        +
        +

        It would be nice if we could pass along na.rm = TRUE to median() to remove these missing values. To do so, instead of calling median() directly, we need to create a new function that calls median() with the desired arguments:

        +
        +
        df_miss |> 
        +  summarize(
        +    across(a:d, function(x) median(x, na.rm = TRUE)),
        +    n = n()
        +  )
        +#> # A tibble: 1 × 5
        +#>       a     b      c     d     n
        +#>   <dbl> <dbl>  <dbl> <dbl> <int>
        +#> 1 0.139 -1.11 -0.387  1.15     5
        +
        +

        This is a little verbose, so R comes with a handy shortcut: for this sort of throw away, or anonymous1, function you can replace function with \2:

        +
        +
        df_miss |> 
        +  summarize(
        +    across(a:d, \(x) median(x, na.rm = TRUE)),
        +    n = n()
        +  )
        +
        +

        In either case, across() effectively expands to the following code:

        +
        +
        df_miss |> 
        +  summarize(
        +    a = median(a, na.rm = TRUE),
        +    b = median(b, na.rm = TRUE),
        +    c = median(c, na.rm = TRUE),
        +    d = median(d, na.rm = TRUE),
        +    n = n()
        +  )
        +
        +

        When we remove the missing values from the median(), it would be nice to know just how many values were removed. We can find that out by supplying two functions to across(): one to compute the median and the other to count the missing values. You supply multiple functions by using a named list to .fns:

        +
        +
        df_miss |> 
        +  summarize(
        +    across(a:d, list(
        +      median = \(x) median(x, na.rm = TRUE),
        +      n_miss = \(x) sum(is.na(x))
        +    )),
        +    n = n()
        +  )
        +#> # A tibble: 1 × 9
        +#>   a_median a_n_miss b_median b_n_miss c_median c_n_miss d_median d_n_miss
        +#>      <dbl>    <int>    <dbl>    <int>    <dbl>    <int>    <dbl>    <int>
        +#> 1    0.139        1    -1.11        1   -0.387        2     1.15        0
        +#> # ℹ 1 more variable: n <int>
        +
        +

        If you look carefully, you might intuit that the columns are named using a glue specification (Seção 14.3.2) like {.col}_{.fn} where .col is the name of the original column and .fn is the name of the function. That’s not a coincidence! As you’ll learn in the next section, you can use .names argument to supply your own glue spec.

        +

        +26.2.4 Column names

        +

        The result of across() is named according to the specification provided in the .names argument. We could specify our own if we wanted the name of the function to come first3:

        +
        +
        df_miss |> 
        +  summarize(
        +    across(
        +      a:d,
        +      list(
        +        median = \(x) median(x, na.rm = TRUE),
        +        n_miss = \(x) sum(is.na(x))
        +      ),
        +      .names = "{.fn}_{.col}"
        +    ),
        +    n = n(),
        +  )
        +#> # A tibble: 1 × 9
        +#>   median_a n_miss_a median_b n_miss_b median_c n_miss_c median_d n_miss_d
        +#>      <dbl>    <int>    <dbl>    <int>    <dbl>    <int>    <dbl>    <int>
        +#> 1    0.139        1    -1.11        1   -0.387        2     1.15        0
        +#> # ℹ 1 more variable: n <int>
        +
        +

        The .names argument is particularly important when you use across() with mutate(). By default, the output of across() is given the same names as the inputs. This means that across() inside of mutate() will replace existing columns. For example, here we use coalesce() to replace NAs with 0:

        +
        +
        df_miss |> 
        +  mutate(
        +    across(a:d, \(x) coalesce(x, 0))
        +  )
        +#> # A tibble: 5 × 4
        +#>        a      b      c     d
        +#>    <dbl>  <dbl>  <dbl> <dbl>
        +#> 1  0.434 -1.25   0     1.60 
        +#> 2  0     -1.43  -0.297 0.776
        +#> 3 -0.156 -0.980  0     1.15 
        +#> 4 -2.61  -0.683 -0.785 2.13 
        +#> 5  1.11   0     -0.387 0.704
        +
        +

        If you’d like to instead create new columns, you can use the .names argument to give the output new names:

        +
        +
        df_miss |> 
        +  mutate(
        +    across(a:d, \(x) coalesce(x, 0), .names = "{.col}_na_zero")
        +  )
        +#> # A tibble: 5 × 8
        +#>        a      b      c     d a_na_zero b_na_zero c_na_zero d_na_zero
        +#>    <dbl>  <dbl>  <dbl> <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
        +#> 1  0.434 -1.25  NA     1.60      0.434    -1.25      0         1.60 
        +#> 2 NA     -1.43  -0.297 0.776     0        -1.43     -0.297     0.776
        +#> 3 -0.156 -0.980 NA     1.15     -0.156    -0.980     0         1.15 
        +#> 4 -2.61  -0.683 -0.785 2.13     -2.61     -0.683    -0.785     2.13 
        +#> 5  1.11  NA     -0.387 0.704     1.11      0        -0.387     0.704
        +
        +

        +26.2.5 Filtering

        +

        across() is a great match for summarize() and mutate() but it’s more awkward to use with filter(), because you usually combine multiple conditions with either | or &. It’s clear that across() can help to create multiple logical columns, but then what? So dplyr provides two variants of across() called if_any() and if_all():

        +
        +
        # same as df_miss |> filter(is.na(a) | is.na(b) | is.na(c) | is.na(d))
        +df_miss |> filter(if_any(a:d, is.na))
        +#> # A tibble: 4 × 4
        +#>        a      b      c     d
        +#>    <dbl>  <dbl>  <dbl> <dbl>
        +#> 1  0.434 -1.25  NA     1.60 
        +#> 2 NA     -1.43  -0.297 0.776
        +#> 3 -0.156 -0.980 NA     1.15 
        +#> 4  1.11  NA     -0.387 0.704
        +
        +# same as df_miss |> filter(is.na(a) & is.na(b) & is.na(c) & is.na(d))
        +df_miss |> filter(if_all(a:d, is.na))
        +#> # A tibble: 0 × 4
        +#> # ℹ 4 variables: a <dbl>, b <dbl>, c <dbl>, d <dbl>
        +
        +

        +26.2.6 across() in functions

        +

        across() is particularly useful to program with because it allows you to operate on multiple columns. For example, Jacob Scott uses this little helper which wraps a bunch of lubridate functions to expand all date columns into year, month, and day columns:

        +
        +
        expand_dates <- function(df) {
        +  df |> 
        +    mutate(
        +      across(where(is.Date), list(year = year, month = month, day = mday))
        +    )
        +}
        +
        +df_date <- tibble(
        +  name = c("Amy", "Bob"),
        +  date = ymd(c("2009-08-03", "2010-01-16"))
        +)
        +
        +df_date |> 
        +  expand_dates()
        +#> # A tibble: 2 × 5
        +#>   name  date       date_year date_month date_day
        +#>   <chr> <date>         <dbl>      <dbl>    <int>
        +#> 1 Amy   2009-08-03      2009          8        3
        +#> 2 Bob   2010-01-16      2010          1       16
        +
        +

        across() also makes it easy to supply multiple columns in a single argument because the first argument uses tidy-select; you just need to remember to embrace that argument, as we discussed in Seção 25.3.2. For example, this function will compute the means of numeric columns by default. But by supplying the second argument you can choose to summarize just selected columns:

        +
        +
        summarize_means <- function(df, summary_vars = where(is.numeric)) {
        +  df |> 
        +    summarize(
        +      across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)),
        +      n = n(),
        +      .groups = "drop"
        +    )
        +}
        +diamonds |> 
        +  group_by(cut) |> 
        +  summarize_means()
        +#> # A tibble: 5 × 9
        +#>   cut       carat depth table price     x     y     z     n
        +#>   <ord>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
        +#> 1 Fair      1.05   64.0  59.1 4359.  6.25  6.18  3.98  1610
        +#> 2 Good      0.849  62.4  58.7 3929.  5.84  5.85  3.64  4906
        +#> 3 Very Good 0.806  61.8  58.0 3982.  5.74  5.77  3.56 12082
        +#> 4 Premium   0.892  61.3  58.7 4584.  5.97  5.94  3.65 13791
        +#> 5 Ideal     0.703  61.7  56.0 3458.  5.51  5.52  3.40 21551
        +
        +diamonds |> 
        +  group_by(cut) |> 
        +  summarize_means(c(carat, x:z))
        +#> # A tibble: 5 × 6
        +#>   cut       carat     x     y     z     n
        +#>   <ord>     <dbl> <dbl> <dbl> <dbl> <int>
        +#> 1 Fair      1.05   6.25  6.18  3.98  1610
        +#> 2 Good      0.849  5.84  5.85  3.64  4906
        +#> 3 Very Good 0.806  5.74  5.77  3.56 12082
        +#> 4 Premium   0.892  5.97  5.94  3.65 13791
        +#> 5 Ideal     0.703  5.51  5.52  3.40 21551
        +
        +

        +26.2.7 Vs pivot_longer() +

        +

        Before we go on, it’s worth pointing out an interesting connection between across() and pivot_longer() (Seção 5.3). In many cases, you perform the same calculations by first pivoting the data and then performing the operations by group rather than by column. For example, take this multi-function summary:

        +
        +
        df |> 
        +  summarize(across(a:d, list(median = median, mean = mean)))
        +#> # A tibble: 1 × 8
        +#>   a_median a_mean b_median b_mean c_median c_mean d_median d_mean
        +#>      <dbl>  <dbl>    <dbl>  <dbl>    <dbl>  <dbl>    <dbl>  <dbl>
        +#> 1   0.0380  0.205  -0.0163 0.0910    0.260 0.0716    0.540  0.508
        +
        +

        We could compute the same values by pivoting longer and then summarizing:

        +
        +
        long <- df |> 
        +  pivot_longer(a:d) |> 
        +  group_by(name) |> 
        +  summarize(
        +    median = median(value),
        +    mean = mean(value)
        +  )
        +long
        +#> # A tibble: 4 × 3
        +#>   name   median   mean
        +#>   <chr>   <dbl>  <dbl>
        +#> 1 a      0.0380 0.205 
        +#> 2 b     -0.0163 0.0910
        +#> 3 c      0.260  0.0716
        +#> 4 d      0.540  0.508
        +
        +

        And if you wanted the same structure as across() you could pivot again:

        +
        +
        long |> 
        +  pivot_wider(
        +    names_from = name,
        +    values_from = c(median, mean),
        +    names_vary = "slowest",
        +    names_glue = "{name}_{.value}"
        +  )
        +#> # A tibble: 1 × 8
        +#>   a_median a_mean b_median b_mean c_median c_mean d_median d_mean
        +#>      <dbl>  <dbl>    <dbl>  <dbl>    <dbl>  <dbl>    <dbl>  <dbl>
        +#> 1   0.0380  0.205  -0.0163 0.0910    0.260 0.0716    0.540  0.508
        +
        +

        This is a useful technique to know about because sometimes you’ll hit a problem that’s not currently possible to solve with across(): when you have groups of columns that you want to compute with simultaneously. For example, imagine that our data frame contains both values and weights and we want to compute a weighted mean:

        +
        +
        df_paired <- tibble(
        +  a_val = rnorm(10),
        +  a_wts = runif(10),
        +  b_val = rnorm(10),
        +  b_wts = runif(10),
        +  c_val = rnorm(10),
        +  c_wts = runif(10),
        +  d_val = rnorm(10),
        +  d_wts = runif(10)
        +)
        +
        +

        There’s currently no way to do this with across()4, but it’s relatively straightforward with pivot_longer():

        +
        +
        df_long <- df_paired |> 
        +  pivot_longer(
        +    everything(), 
        +    names_to = c("group", ".value"), 
        +    names_sep = "_"
        +  )
        +df_long
        +#> # A tibble: 40 × 3
        +#>   group    val   wts
        +#>   <chr>  <dbl> <dbl>
        +#> 1 a      0.715 0.518
        +#> 2 b     -0.709 0.691
        +#> 3 c      0.718 0.216
        +#> 4 d     -0.217 0.733
        +#> 5 a     -1.09  0.979
        +#> 6 b     -0.209 0.675
        +#> # ℹ 34 more rows
        +
        +df_long |> 
        +  group_by(group) |> 
        +  summarize(mean = weighted.mean(val, wts))
        +#> # A tibble: 4 × 2
        +#>   group    mean
        +#>   <chr>   <dbl>
        +#> 1 a      0.126 
        +#> 2 b     -0.0704
        +#> 3 c     -0.360 
        +#> 4 d     -0.248
        +
        +

        If needed, you could pivot_wider() this back to the original form.

        +

        +26.2.8 Exercises

        +
          +
        1. +

          Practice your across() skills by:

          +
            +
          1. Computing the number of unique values in each column of palmerpenguins::penguins.

          2. +
          3. Computing the mean of every column in mtcars.

          4. +
          5. Grouping diamonds by cut, clarity, and color then counting the number of observations and computing the mean of each numeric column.

          6. +
          +
        2. +
        3. What happens if you use a list of functions in across(), but don’t name them? How is the output named?

        4. +
        5. Adjust expand_dates() to automatically remove the date columns after they’ve been expanded. Do you need to embrace any arguments?

        6. +
        7. +

          Explain what each step of the pipeline in this function does. What special feature of where() are we taking advantage of?

          +
          +
          show_missing <- function(df, group_vars, summary_vars = everything()) {
          +  df |> 
          +    group_by(pick({{ group_vars }})) |> 
          +    summarize(
          +      across({{ summary_vars }}, \(x) sum(is.na(x))),
          +      .groups = "drop"
          +    ) |>
          +    select(where(\(x) any(x > 0)))
          +}
          +nycflights13::flights |> show_missing(c(year, month, day))
          +
          +
        8. +

        +26.3 Reading multiple files

        +

        In the previous section, you learned how to use dplyr::across() to repeat a transformation on multiple columns. In this section, you’ll learn how to use purrr::map() to do something to every file in a directory. Let’s start with a little motivation: imagine you have a directory full of excel spreadsheets5 you want to read. You could do it with copy and paste:

        +
        +
        data2019 <- readxl::read_excel("data/y2019.xlsx")
        +data2020 <- readxl::read_excel("data/y2020.xlsx")
        +data2021 <- readxl::read_excel("data/y2021.xlsx")
        +data2022 <- readxl::read_excel("data/y2022.xlsx")
        +
        +

        And then use dplyr::bind_rows() to combine them all together:

        +
        +
        data <- bind_rows(data2019, data2020, data2021, data2022)
        +
        +

        You can imagine that this would get tedious quickly, especially if you had hundreds of files, not just four. The following sections show you how to automate this sort of task. There are three basic steps: use list.files() to list all the files in a directory, then use purrr::map() to read each of them into a list, then use purrr::list_rbind() to combine them into a single data frame. We’ll then discuss how you can handle situations of increasing heterogeneity, where you can’t do exactly the same thing to every file.

        +

        +26.3.1 Listing files in a directory

        +

        As the name suggests, list.files() lists the files in a directory. You’ll almost always use three arguments:

        +
          +
        • The first argument, path, is the directory to look in.

        • +
        • pattern is a regular expression used to filter the file names. The most common pattern is something like [.]xlsx$ or [.]csv$ to find all files with a specified extension.

        • +
        • full.names determines whether or not the directory name should be included in the output. You almost always want this to be TRUE.

        • +
        +

        To make our motivating example concrete, this book contains a folder with 12 excel spreadsheets containing data from the gapminder package. Each file contains one year’s worth of data for 142 countries. We can list them all with the appropriate call to list.files():

        +
        +
        paths <- list.files("data/gapminder", pattern = "[.]xlsx$", full.names = TRUE)
        +paths
        +#>  [1] "data/gapminder/1952.xlsx" "data/gapminder/1957.xlsx"
        +#>  [3] "data/gapminder/1962.xlsx" "data/gapminder/1967.xlsx"
        +#>  [5] "data/gapminder/1972.xlsx" "data/gapminder/1977.xlsx"
        +#>  [7] "data/gapminder/1982.xlsx" "data/gapminder/1987.xlsx"
        +#>  [9] "data/gapminder/1992.xlsx" "data/gapminder/1997.xlsx"
        +#> [11] "data/gapminder/2002.xlsx" "data/gapminder/2007.xlsx"
        +
        +

        +26.3.2 Lists

        +

        Now that we have these 12 paths, we could call read_excel() 12 times to get 12 data frames:

        +
        +
        gapminder_1952 <- readxl::read_excel("data/gapminder/1952.xlsx")
        +gapminder_1957 <- readxl::read_excel("data/gapminder/1957.xlsx")
        +gapminder_1962 <- readxl::read_excel("data/gapminder/1962.xlsx")
        + ...,
        +gapminder_2007 <- readxl::read_excel("data/gapminder/2007.xlsx")
        +
        +

        But putting each sheet into its own variable is going to make it hard to work with them a few steps down the road. Instead, they’ll be easier to work with if we put them into a single object. A list is the perfect tool for this job:

        +
        +
        files <- list(
        +  readxl::read_excel("data/gapminder/1952.xlsx"),
        +  readxl::read_excel("data/gapminder/1957.xlsx"),
        +  readxl::read_excel("data/gapminder/1962.xlsx"),
        +  ...,
        +  readxl::read_excel("data/gapminder/2007.xlsx")
        +)
        +
        +

        Now that you have these data frames in a list, how do you get one out? You can use files[[i]] to extract the ith element:

        +
        +
        files[[3]]
        +#> # A tibble: 142 × 5
        +#>   country     continent lifeExp      pop gdpPercap
        +#>   <chr>       <chr>       <dbl>    <dbl>     <dbl>
        +#> 1 Afghanistan Asia         32.0 10267083      853.
        +#> 2 Albania     Europe       64.8  1728137     2313.
        +#> 3 Algeria     Africa       48.3 11000948     2551.
        +#> 4 Angola      Africa       34    4826015     4269.
        +#> 5 Argentina   Americas     65.1 21283783     7133.
        +#> 6 Australia   Oceania      70.9 10794968    12217.
        +#> # ℹ 136 more rows
        +
        +

        We’ll come back to [[ in more detail in Seção 27.3.

        +

        +26.3.3 purrr::map() and list_rbind() +

        +

        The code to collect those data frames in a list “by hand” is basically just as tedious to type as code that reads the files one-by-one. Happily, we can use purrr::map() to make even better use of our paths vector. map() is similar toacross(), but instead of doing something to each column in a data frame, it does something to each element of a vector.map(x, f) is shorthand for:

        +
        +
        list(
        +  f(x[[1]]),
        +  f(x[[2]]),
        +  ...,
        +  f(x[[n]])
        +)
        +
        +

        So we can use map() to get a list of 12 data frames:

        +
        +
        files <- map(paths, readxl::read_excel)
        +length(files)
        +#> [1] 12
        +
        +files[[1]]
        +#> # A tibble: 142 × 5
        +#>   country     continent lifeExp      pop gdpPercap
        +#>   <chr>       <chr>       <dbl>    <dbl>     <dbl>
        +#> 1 Afghanistan Asia         28.8  8425333      779.
        +#> 2 Albania     Europe       55.2  1282697     1601.
        +#> 3 Algeria     Africa       43.1  9279525     2449.
        +#> 4 Angola      Africa       30.0  4232095     3521.
        +#> 5 Argentina   Americas     62.5 17876956     5911.
        +#> 6 Australia   Oceania      69.1  8691212    10040.
        +#> # ℹ 136 more rows
        +
        +

        (This is another data structure that doesn’t display particularly compactly with str() so you might want to load it into RStudio and inspect it with View()).

        +

        Now we can use purrr::list_rbind() to combine that list of data frames into a single data frame:

        +
        +
        list_rbind(files)
        +#> # A tibble: 1,704 × 5
        +#>   country     continent lifeExp      pop gdpPercap
        +#>   <chr>       <chr>       <dbl>    <dbl>     <dbl>
        +#> 1 Afghanistan Asia         28.8  8425333      779.
        +#> 2 Albania     Europe       55.2  1282697     1601.
        +#> 3 Algeria     Africa       43.1  9279525     2449.
        +#> 4 Angola      Africa       30.0  4232095     3521.
        +#> 5 Argentina   Americas     62.5 17876956     5911.
        +#> 6 Australia   Oceania      69.1  8691212    10040.
        +#> # ℹ 1,698 more rows
        +
        +

        Or we could do both steps at once in a pipeline:

        +
        +
        paths |> 
        +  map(readxl::read_excel) |> 
        +  list_rbind()
        +
        +

        What if we want to pass in extra arguments to read_excel()? We use the same technique that we used with across(). For example, it’s often useful to peak at the first few rows of the data with n_max = 1:

        +
        +
        paths |> 
        +  map(\(path) readxl::read_excel(path, n_max = 1)) |> 
        +  list_rbind()
        +#> # A tibble: 12 × 5
        +#>   country     continent lifeExp      pop gdpPercap
        +#>   <chr>       <chr>       <dbl>    <dbl>     <dbl>
        +#> 1 Afghanistan Asia         28.8  8425333      779.
        +#> 2 Afghanistan Asia         30.3  9240934      821.
        +#> 3 Afghanistan Asia         32.0 10267083      853.
        +#> 4 Afghanistan Asia         34.0 11537966      836.
        +#> 5 Afghanistan Asia         36.1 13079460      740.
        +#> 6 Afghanistan Asia         38.4 14880372      786.
        +#> # ℹ 6 more rows
        +
        +

        This makes it clear that something is missing: there’s no year column because that value is recorded in the path, not in the individual files. We’ll tackle that problem next.

        +

        +26.3.4 Data in the path

        +

        Sometimes the name of the file is data itself. In this example, the file name contains the year, which is not otherwise recorded in the individual files. To get that column into the final data frame, we need to do two things:

        +

        First, we name the vector of paths. The easiest way to do this is with the set_names() function, which can take a function. Here we use basename() to extract just the file name from the full path:

        +
        +
        paths |> set_names(basename) 
        +#>                  1952.xlsx                  1957.xlsx 
        +#> "data/gapminder/1952.xlsx" "data/gapminder/1957.xlsx" 
        +#>                  1962.xlsx                  1967.xlsx 
        +#> "data/gapminder/1962.xlsx" "data/gapminder/1967.xlsx" 
        +#>                  1972.xlsx                  1977.xlsx 
        +#> "data/gapminder/1972.xlsx" "data/gapminder/1977.xlsx" 
        +#>                  1982.xlsx                  1987.xlsx 
        +#> "data/gapminder/1982.xlsx" "data/gapminder/1987.xlsx" 
        +#>                  1992.xlsx                  1997.xlsx 
        +#> "data/gapminder/1992.xlsx" "data/gapminder/1997.xlsx" 
        +#>                  2002.xlsx                  2007.xlsx 
        +#> "data/gapminder/2002.xlsx" "data/gapminder/2007.xlsx"
        +
        +

        Those names are automatically carried along by all the map functions, so the list of data frames will have those same names:

        +
        +
        files <- paths |> 
        +  set_names(basename) |> 
        +  map(readxl::read_excel)
        +
        +

        That makes this call to map() shorthand for:

        +
        +
        files <- list(
        +  "1952.xlsx" = readxl::read_excel("data/gapminder/1952.xlsx"),
        +  "1957.xlsx" = readxl::read_excel("data/gapminder/1957.xlsx"),
        +  "1962.xlsx" = readxl::read_excel("data/gapminder/1962.xlsx"),
        +  ...,
        +  "2007.xlsx" = readxl::read_excel("data/gapminder/2007.xlsx")
        +)
        +
        +

        You can also use [[ to extract elements by name:

        +
        +
        files[["1962.xlsx"]]
        +#> # A tibble: 142 × 5
        +#>   country     continent lifeExp      pop gdpPercap
        +#>   <chr>       <chr>       <dbl>    <dbl>     <dbl>
        +#> 1 Afghanistan Asia         32.0 10267083      853.
        +#> 2 Albania     Europe       64.8  1728137     2313.
        +#> 3 Algeria     Africa       48.3 11000948     2551.
        +#> 4 Angola      Africa       34    4826015     4269.
        +#> 5 Argentina   Americas     65.1 21283783     7133.
        +#> 6 Australia   Oceania      70.9 10794968    12217.
        +#> # ℹ 136 more rows
        +
        +

        Then we use the names_to argument to list_rbind() to tell it to save the names into a new column called year then use readr::parse_number() to extract the number from the string.

        +
        +
        paths |> 
        +  set_names(basename) |> 
        +  map(readxl::read_excel) |> 
        +  list_rbind(names_to = "year") |> 
        +  mutate(year = parse_number(year))
        +#> # A tibble: 1,704 × 6
        +#>    year country     continent lifeExp      pop gdpPercap
        +#>   <dbl> <chr>       <chr>       <dbl>    <dbl>     <dbl>
        +#> 1  1952 Afghanistan Asia         28.8  8425333      779.
        +#> 2  1952 Albania     Europe       55.2  1282697     1601.
        +#> 3  1952 Algeria     Africa       43.1  9279525     2449.
        +#> 4  1952 Angola      Africa       30.0  4232095     3521.
        +#> 5  1952 Argentina   Americas     62.5 17876956     5911.
        +#> 6  1952 Australia   Oceania      69.1  8691212    10040.
        +#> # ℹ 1,698 more rows
        +
        +

        In more complicated cases, there might be other variables stored in the directory name, or maybe the file name contains multiple bits of data. In that case, use set_names() (without any arguments) to record the full path, and then use tidyr::separate_wider_delim() and friends to turn them into useful columns.

        +
        +
        paths |> 
        +  set_names() |> 
        +  map(readxl::read_excel) |> 
        +  list_rbind(names_to = "year") |> 
        +  separate_wider_delim(year, delim = "/", names = c(NA, "dir", "file")) |> 
        +  separate_wider_delim(file, delim = ".", names = c("file", "ext"))
        +#> # A tibble: 1,704 × 8
        +#>   dir       file  ext   country     continent lifeExp      pop gdpPercap
        +#>   <chr>     <chr> <chr> <chr>       <chr>       <dbl>    <dbl>     <dbl>
        +#> 1 gapminder 1952  xlsx  Afghanistan Asia         28.8  8425333      779.
        +#> 2 gapminder 1952  xlsx  Albania     Europe       55.2  1282697     1601.
        +#> 3 gapminder 1952  xlsx  Algeria     Africa       43.1  9279525     2449.
        +#> 4 gapminder 1952  xlsx  Angola      Africa       30.0  4232095     3521.
        +#> 5 gapminder 1952  xlsx  Argentina   Americas     62.5 17876956     5911.
        +#> 6 gapminder 1952  xlsx  Australia   Oceania      69.1  8691212    10040.
        +#> # ℹ 1,698 more rows
        +
        +

        +26.3.5 Save your work

        +

        Now that you’ve done all this hard work to get to a nice tidy data frame, it’s a great time to save your work:

        +
        +
        gapminder <- paths |> 
        +  set_names(basename) |> 
        +  map(readxl::read_excel) |> 
        +  list_rbind(names_to = "year") |> 
        +  mutate(year = parse_number(year))
        +
        +write_csv(gapminder, "gapminder.csv")
        +
        +

        Now when you come back to this problem in the future, you can read in a single csv file. For large and richer datasets, using parquet might be a better choice than .csv, as discussed in Seção 22.4.

        +

        If you’re working in a project, we suggest calling the file that does this sort of data prep work something like 0-cleanup.R. The 0 in the file name suggests that this should be run before anything else.

        +

        If your input data files change over time, you might consider learning a tool like targets to set up your data cleaning code to automatically re-run whenever one of the input files is modified.

        +

        +26.3.6 Many simple iterations

        +

        Here we’ve just loaded the data directly from disk, and were lucky enough to get a tidy dataset. In most cases, you’ll need to do some additional tidying, and you have two basic options: you can do one round of iteration with a complex function, or do multiple rounds of iteration with simple functions. In our experience most folks reach first for one complex iteration, but you’re often better by doing multiple simple iterations.

        +

        For example, imagine that you want to read in a bunch of files, filter out missing values, pivot, and then combine. One way to approach the problem is to write a function that takes a file and does all those steps then call map() once:

        +
        +
        process_file <- function(path) {
        +  df <- read_csv(path)
        +  
        +  df |> 
        +    filter(!is.na(id)) |> 
        +    mutate(id = tolower(id)) |> 
        +    pivot_longer(jan:dec, names_to = "month")
        +}
        +
        +paths |> 
        +  map(process_file) |> 
        +  list_rbind()
        +
        +

        Alternatively, you could perform each step of process_file() to every file:

        +
        +
        paths |> 
        +  map(read_csv) |> 
        +  map(\(df) df |> filter(!is.na(id))) |> 
        +  map(\(df) df |> mutate(id = tolower(id))) |> 
        +  map(\(df) df |> pivot_longer(jan:dec, names_to = "month")) |> 
        +  list_rbind()
        +
        +

        We recommend this approach because it stops you getting fixated on getting the first file right before moving on to the rest. By considering all of the data when doing tidying and cleaning, you’re more likely to think holistically and end up with a higher quality result.

        +

        In this particular example, there’s another optimization you could make, by binding all the data frames together earlier. Then you can rely on regular dplyr behavior:

        +
        +
        paths |> 
        +  map(read_csv) |> 
        +  list_rbind() |> 
        +  filter(!is.na(id)) |> 
        +  mutate(id = tolower(id)) |> 
        +  pivot_longer(jan:dec, names_to = "month")
        +
        +

        +26.3.7 Heterogeneous data

        +

        Unfortunately, sometimes it’s not possible to go from map() straight to list_rbind() because the data frames are so heterogeneous that list_rbind() either fails or yields a data frame that’s not very useful. In that case, it’s still useful to start by loading all of the files:

        +
        +
        files <- paths |> 
        +  map(readxl::read_excel) 
        +
        +

        Then a very useful strategy is to capture the structure of the data frames so that you can explore it using your data science skills. One way to do so is with this handy df_types function6 that returns a tibble with one row for each column:

        +
        +
        df_types <- function(df) {
        +  tibble(
        +    col_name = names(df), 
        +    col_type = map_chr(df, vctrs::vec_ptype_full),
        +    n_miss = map_int(df, \(x) sum(is.na(x)))
        +  )
        +}
        +
        +df_types(gapminder)
        +#> # A tibble: 6 × 3
        +#>   col_name  col_type  n_miss
        +#>   <chr>     <chr>      <int>
        +#> 1 year      double         0
        +#> 2 country   character      0
        +#> 3 continent character      0
        +#> 4 lifeExp   double         0
        +#> 5 pop       double         0
        +#> 6 gdpPercap double         0
        +
        +

        You can then apply this function to all of the files, and maybe do some pivoting to make it easier to see where the differences are. For example, this makes it easy to verify that the gapminder spreadsheets that we’ve been working with are all quite homogeneous:

        +
        +
        files |> 
        +  map(df_types) |> 
        +  list_rbind(names_to = "file_name") |> 
        +  select(-n_miss) |> 
        +  pivot_wider(names_from = col_name, values_from = col_type)
        +#> # A tibble: 12 × 6
        +#>   file_name country   continent lifeExp pop    gdpPercap
        +#>   <chr>     <chr>     <chr>     <chr>   <chr>  <chr>    
        +#> 1 1952.xlsx character character double  double double   
        +#> 2 1957.xlsx character character double  double double   
        +#> 3 1962.xlsx character character double  double double   
        +#> 4 1967.xlsx character character double  double double   
        +#> 5 1972.xlsx character character double  double double   
        +#> 6 1977.xlsx character character double  double double   
        +#> # ℹ 6 more rows
        +
        +

        If the files have heterogeneous formats, you might need to do more processing before you can successfully merge them. Unfortunately, we’re now going to leave you to figure that out on your own, but you might want to read about map_if() and map_at(). map_if() allows you to selectively modify elements of a list based on their values; map_at() allows you to selectively modify elements based on their names.

        +

        +26.3.8 Handling failures

        +

        Sometimes the structure of your data might be sufficiently wild that you can’t even read all the files with a single command. And then you’ll encounter one of the downsides of map(): it succeeds or fails as a whole. map() will either successfully read all of the files in a directory or fail with an error, reading zero files. This is annoying: why does one failure prevent you from accessing all the other successes?

        +

        Luckily, purrr comes with a helper to tackle this problem: possibly(). possibly() is what’s known as a function operator: it takes a function and returns a function with modified behavior. In particular, possibly() changes a function from erroring to returning a value that you specify:

        +
        +
        files <- paths |> 
        +  map(possibly(\(path) readxl::read_excel(path), NULL))
        +
        +data <- files |> list_rbind()
        +
        +

        This works particularly well here because list_rbind(), like many tidyverse functions, automatically ignores NULLs.

        +

        Now you have all the data that can be read easily, and it’s time to tackle the hard part of figuring out why some files failed to load and what to do about it. Start by getting the paths that failed:

        +
        +
        failed <- map_vec(files, is.null)
        +paths[failed]
        +#> character(0)
        +
        +

        Then call the import function again for each failure and figure out what went wrong.

        +

        +26.4 Saving multiple outputs

        +

        In the last section, you learned about map(), which is useful for reading multiple files into a single object. In this section, we’ll now explore sort of the opposite problem: how can you take one or more R objects and save it to one or more files? We’ll explore this challenge using three examples:

        +
          +
        • Saving multiple data frames into one database.
        • +
        • Saving multiple data frames into multiple .csv files.
        • +
        • Saving multiple plots to multiple .png files.
        • +
        +

        +26.4.1 Writing to a database

        +

        Sometimes when working with many files at once, it’s not possible to fit all your data into memory at once, and you can’t do map(files, read_csv). One approach to deal with this problem is to load your data into a database so you can access just the bits you need with dbplyr.

        +

        If you’re lucky, the database package you’re using will provide a handy function that takes a vector of paths and loads them all into the database. This is the case with duckdb’s duckdb_read_csv():

        +
        +
        con <- DBI::dbConnect(duckdb::duckdb())
        +duckdb::duckdb_read_csv(con, "gapminder", paths)
        +
        +

        This would work well here, but we don’t have csv files, instead we have excel spreadsheets. So we’re going to have to do it “by hand”. Learning to do it by hand will also help you when you have a bunch of csvs and the database that you’re working with doesn’t have one function that will load them all in.

        +

        We need to start by creating a table that will fill in with data. The easiest way to do this is by creating a template, a dummy data frame that contains all the columns we want, but only a sampling of the data. For the gapminder data, we can make that template by reading a single file and adding the year to it:

        +
        +
        template <- readxl::read_excel(paths[[1]])
        +template$year <- 1952
        +template
        +#> # A tibble: 142 × 6
        +#>   country     continent lifeExp      pop gdpPercap  year
        +#>   <chr>       <chr>       <dbl>    <dbl>     <dbl> <dbl>
        +#> 1 Afghanistan Asia         28.8  8425333      779.  1952
        +#> 2 Albania     Europe       55.2  1282697     1601.  1952
        +#> 3 Algeria     Africa       43.1  9279525     2449.  1952
        +#> 4 Angola      Africa       30.0  4232095     3521.  1952
        +#> 5 Argentina   Americas     62.5 17876956     5911.  1952
        +#> 6 Australia   Oceania      69.1  8691212    10040.  1952
        +#> # ℹ 136 more rows
        +
        +

        Now we can connect to the database, and use DBI::dbCreateTable() to turn our template into a database table:

        +
        +
        con <- DBI::dbConnect(duckdb::duckdb())
        +DBI::dbCreateTable(con, "gapminder", template)
        +
        +

        dbCreateTable() doesn’t use the data in template, just the variable names and types. So if we inspect the gapminder table now you’ll see that it’s empty but it has the variables we need with the types we expect:

        +
        +
        con |> tbl("gapminder")
        +#> # Source:   table<gapminder> [0 x 6]
        +#> # Database: DuckDB v0.9.1 [unknown@Linux 6.2.0-1015-azure:R 4.3.2/:memory:]
        +#> # ℹ 6 variables: country <chr>, continent <chr>, lifeExp <dbl>, pop <dbl>,
        +#> #   gdpPercap <dbl>, year <dbl>
        +
        +

        Next, we need a function that takes a single file path, reads it into R, and adds the result to the gapminder table. We can do that by combining read_excel() with DBI::dbAppendTable():

        +
        +
        append_file <- function(path) {
        +  df <- readxl::read_excel(path)
        +  df$year <- parse_number(basename(path))
        +  
        +  DBI::dbAppendTable(con, "gapminder", df)
        +}
        +
        +

        Now we need to call append_file() once for each element of paths. That’s certainly possible with map():

        +
        +
        paths |> map(append_file)
        +
        +

        But we don’t care about the output of append_file(), so instead of map() it’s slightly nicer to use walk(). walk() does exactly the same thing as map() but throws the output away:

        +
        +
        paths |> walk(append_file)
        +
        +

        Now we can see if we have all the data in our table:

        +
        +
        con |> 
        +  tbl("gapminder") |> 
        +  count(year)
        +#> # Source:   SQL [?? x 2]
        +#> # Database: DuckDB v0.9.1 [unknown@Linux 6.2.0-1015-azure:R 4.3.2/:memory:]
        +#>    year     n
        +#>   <dbl> <dbl>
        +#> 1  1967   142
        +#> 2  1977   142
        +#> 3  1987   142
        +#> 4  2007   142
        +#> 5  1952   142
        +#> 6  1957   142
        +#> # ℹ more rows
        +
        +

        +26.4.2 Writing csv files

        +

        The same basic principle applies if we want to write multiple csv files, one for each group. Let’s imagine that we want to take the ggplot2::diamonds data and save one csv file for each clarity. First we need to make those individual datasets. There are many ways you could do that, but there’s one way we particularly like: group_nest().

        +
        +
        by_clarity <- diamonds |> 
        +  group_nest(clarity)
        +
        +by_clarity
        +#> # A tibble: 8 × 2
        +#>   clarity               data
        +#>   <ord>   <list<tibble[,9]>>
        +#> 1 I1               [741 × 9]
        +#> 2 SI2            [9,194 × 9]
        +#> 3 SI1           [13,065 × 9]
        +#> 4 VS2           [12,258 × 9]
        +#> 5 VS1            [8,171 × 9]
        +#> 6 VVS2           [5,066 × 9]
        +#> # ℹ 2 more rows
        +
        +

        This gives us a new tibble with eight rows and two columns. clarity is our grouping variable and data is a list-column containing one tibble for each unique value of clarity:

        +
        +
        by_clarity$data[[1]]
        +#> # A tibble: 741 × 9
        +#>   carat cut       color depth table price     x     y     z
        +#>   <dbl> <ord>     <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
        +#> 1  0.32 Premium   E      60.9    58   345  4.38  4.42  2.68
        +#> 2  1.17 Very Good J      60.2    61  2774  6.83  6.9   4.13
        +#> 3  1.01 Premium   F      61.8    60  2781  6.39  6.36  3.94
        +#> 4  1.01 Fair      E      64.5    58  2788  6.29  6.21  4.03
        +#> 5  0.96 Ideal     F      60.7    55  2801  6.37  6.41  3.88
        +#> 6  1.04 Premium   G      62.2    58  2801  6.46  6.41  4   
        +#> # ℹ 735 more rows
        +
        +

        While we’re here, let’s create a column that gives the name of output file, using mutate() and str_glue():

        +
        +
        by_clarity <- by_clarity |> 
        +  mutate(path = str_glue("diamonds-{clarity}.csv"))
        +
        +by_clarity
        +#> # A tibble: 8 × 3
        +#>   clarity               data path             
        +#>   <ord>   <list<tibble[,9]>> <glue>           
        +#> 1 I1               [741 × 9] diamonds-I1.csv  
        +#> 2 SI2            [9,194 × 9] diamonds-SI2.csv 
        +#> 3 SI1           [13,065 × 9] diamonds-SI1.csv 
        +#> 4 VS2           [12,258 × 9] diamonds-VS2.csv 
        +#> 5 VS1            [8,171 × 9] diamonds-VS1.csv 
        +#> 6 VVS2           [5,066 × 9] diamonds-VVS2.csv
        +#> # ℹ 2 more rows
        +
        +

        So if we were going to save these data frames by hand, we might write something like:

        +
        +
        write_csv(by_clarity$data[[1]], by_clarity$path[[1]])
        +write_csv(by_clarity$data[[2]], by_clarity$path[[2]])
        +write_csv(by_clarity$data[[3]], by_clarity$path[[3]])
        +...
        +write_csv(by_clarity$by_clarity[[8]], by_clarity$path[[8]])
        +
        +

        This is a little different to our previous uses of map() because there are two arguments that are changing, not just one. That means we need a new function: map2(), which varies both the first and second arguments. And because we again don’t care about the output, we want walk2() rather than map2(). That gives us:

        +
        +
        walk2(by_clarity$data, by_clarity$path, write_csv)
        +
        +

        +26.4.3 Saving plots

        +

        We can take the same basic approach to create many plots. Let’s first make a function that draws the plot we want:

        +
        +
        carat_histogram <- function(df) {
        +  ggplot(df, aes(x = carat)) + geom_histogram(binwidth = 0.1)  
        +}
        +
        +carat_histogram(by_clarity$data[[1]])
        +
        +

        Histogram of carats of diamonds from the by_clarity dataset, ranging from 0 to 5 carats. The distribution is unimodal and right skewed with a peak around 1 carat.

        +
        +
        +

        Now we can use map() to create a list of many plots7 and their eventual file paths:

        +
        +
        by_clarity <- by_clarity |> 
        +  mutate(
        +    plot = map(data, carat_histogram),
        +    path = str_glue("clarity-{clarity}.png")
        +  )
        +
        +

        Then use walk2() with ggsave() to save each plot:

        +
        +
        walk2(
        +  by_clarity$path,
        +  by_clarity$plot,
        +  \(path, plot) ggsave(path, plot, width = 6, height = 6)
        +)
        +
        +

        This is shorthand for:

        +
        +
        ggsave(by_clarity$path[[1]], by_clarity$plot[[1]], width = 6, height = 6)
        +ggsave(by_clarity$path[[2]], by_clarity$plot[[2]], width = 6, height = 6)
        +ggsave(by_clarity$path[[3]], by_clarity$plot[[3]], width = 6, height = 6)
        +...
        +ggsave(by_clarity$path[[8]], by_clarity$plot[[8]], width = 6, height = 6)
        +
        + + +

        +26.5 Summary

        +

        In this chapter, you’ve seen how to use explicit iteration to solve three problems that come up frequently when doing data science: manipulating multiple columns, reading multiple files, and saving multiple outputs. But in general, iteration is a super power: if you know the right iteration technique, you can easily go from fixing one problem to fixing all the problems. Once you’ve mastered the techniques in this chapter, we highly recommend learning more by reading the Functionals chapter of Advanced R and consulting the purrr website.

        +

        If you know much about iteration in other languages, you might be surprised that we didn’t discuss the for loop. That’s because R’s orientation towards data analysis changes how we iterate: in most cases you can rely on an existing idiom to do something to each columns or each group. And when you can’t, you can often use a functional programming tool like map() that does something to each element of a list. However, you will see for loops in wild-caught code, so you’ll learn about them in the next chapter where we’ll discuss some important base R tools.

        + + +

        +
          +
        1. Anonymous, because we never explicitly gave it a name with <-. Another term programmers use for this is “lambda function”.↩︎

        2. +
        3. In older code you might see syntax that looks like ~ .x + 1. This is another way to write anonymous functions but it only works inside tidyverse functions and always uses the variable name .x. We now recommend the base syntax, \(x) x + 1.↩︎

        4. +
        5. You can’t currently change the order of the columns, but you could reorder them after the fact using relocate() or similar.↩︎

        6. +
        7. Maybe there will be one day, but currently we don’t see how.↩︎

        8. +
        9. If you instead had a directory of csv files with the same format, you can use the technique from Seção 7.4.↩︎

        10. +
        11. We’re not going to explain how it works, but if you look at the docs for the functions used, you should be able to puzzle it out.↩︎

        12. +
        13. You can print by_clarity$plot to get a crude animation — you’ll get one plot for each element of plots. NOTE: this didn’t happen for me.↩︎

        14. +
        +
        +
        +
        + + + \ No newline at end of file diff --git a/iteration_files/figure-html/unnamed-chunk-69-1.png b/iteration_files/figure-html/unnamed-chunk-69-1.png new file mode 100644 index 000000000..76380bb90 Binary files /dev/null and b/iteration_files/figure-html/unnamed-chunk-69-1.png differ diff --git a/joins.html b/joins.html new file mode 100644 index 000000000..c0650fecc --- /dev/null +++ b/joins.html @@ -0,0 +1,1559 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 19  Joins + + + + + + + + + + + + + + + + + + + + + + + + +
        +
        + +
        + + + +
        +

        19  Joins

        +
        + + + +
        + + + + +
        + + +

        +19.1 Introduction

        +

        It’s rare that a data analysis involves only a single data frame. Typically you have many data frames, and you must join them together to answer the questions that you’re interested in. This chapter will introduce you to two important types of joins:

        +
          +
        • Mutating joins, which add new variables to one data frame from matching observations in another.
        • +
        • Filtering joins, which filter observations from one data frame based on whether or not they match an observation in another.
        • +
        +

        We’ll begin by discussing keys, the variables used to connect a pair of data frames in a join. We cement the theory with an examination of the keys in the datasets from the nycflights13 package, then use that knowledge to start joining data frames together. Next we’ll discuss how joins work, focusing on their action on the rows. We’ll finish up with a discussion of non-equi joins, a family of joins that provide a more flexible way of matching keys than the default equality relationship.

        +

        +19.1.1 Prerequisites

        +

        In this chapter, we’ll explore the five related datasets from nycflights13 using the join functions from dplyr.

        + +

        +19.2 Keys

        +

        To understand joins, you need to first understand how two tables can be connected through a pair of keys, within each table. In this section, you’ll learn about the two types of key and see examples of both in the datasets of the nycflights13 package. You’ll also learn how to check that your keys are valid, and what to do if your table lacks a key.

        +

        +19.2.1 Primary and foreign keys

        +

        Every join involves a pair of keys: a primary key and a foreign key. A primary key is a variable or set of variables that uniquely identifies each observation. When more than one variable is needed, the key is called a compound key. For example, in nycflights13:

        +
          +
        • +

          airlines records two pieces of data about each airline: its carrier code and its full name. You can identify an airline with its two letter carrier code, making carrier the primary key.

          +
          +
          airlines
          +#> # A tibble: 16 × 2
          +#>   carrier name                    
          +#>   <chr>   <chr>                   
          +#> 1 9E      Endeavor Air Inc.       
          +#> 2 AA      American Airlines Inc.  
          +#> 3 AS      Alaska Airlines Inc.    
          +#> 4 B6      JetBlue Airways         
          +#> 5 DL      Delta Air Lines Inc.    
          +#> 6 EV      ExpressJet Airlines Inc.
          +#> # ℹ 10 more rows
          +
          +
        • +
        • +

          airports records data about each airport. You can identify each airport by its three letter airport code, making faa the primary key.

          +
          +
          airports
          +#> # A tibble: 1,458 × 8
          +#>   faa   name                            lat   lon   alt    tz dst  
          +#>   <chr> <chr>                         <dbl> <dbl> <dbl> <dbl> <chr>
          +#> 1 04G   Lansdowne Airport              41.1 -80.6  1044    -5 A    
          +#> 2 06A   Moton Field Municipal Airport  32.5 -85.7   264    -6 A    
          +#> 3 06C   Schaumburg Regional            42.0 -88.1   801    -6 A    
          +#> 4 06N   Randall Airport                41.4 -74.4   523    -5 A    
          +#> 5 09J   Jekyll Island Airport          31.1 -81.4    11    -5 A    
          +#> 6 0A9   Elizabethton Municipal Airpo…  36.4 -82.2  1593    -5 A    
          +#> # ℹ 1,452 more rows
          +#> # ℹ 1 more variable: tzone <chr>
          +
          +
        • +
        • +

          planes records data about each plane. You can identify a plane by its tail number, making tailnum the primary key.

          +
          +
          planes
          +#> # A tibble: 3,322 × 9
          +#>   tailnum  year type              manufacturer    model     engines
          +#>   <chr>   <int> <chr>             <chr>           <chr>       <int>
          +#> 1 N10156   2004 Fixed wing multi… EMBRAER         EMB-145XR       2
          +#> 2 N102UW   1998 Fixed wing multi… AIRBUS INDUSTR… A320-214        2
          +#> 3 N103US   1999 Fixed wing multi… AIRBUS INDUSTR… A320-214        2
          +#> 4 N104UW   1999 Fixed wing multi… AIRBUS INDUSTR… A320-214        2
          +#> 5 N10575   2002 Fixed wing multi… EMBRAER         EMB-145LR       2
          +#> 6 N105UW   1999 Fixed wing multi… AIRBUS INDUSTR… A320-214        2
          +#> # ℹ 3,316 more rows
          +#> # ℹ 3 more variables: seats <int>, speed <int>, engine <chr>
          +
          +
        • +
        • +

          weather records data about the weather at the origin airports. You can identify each observation by the combination of location and time, making origin and time_hour the compound primary key.

          +
          +
          weather
          +#> # A tibble: 26,115 × 15
          +#>   origin  year month   day  hour  temp  dewp humid wind_dir
          +#>   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>
          +#> 1 EWR     2013     1     1     1  39.0  26.1  59.4      270
          +#> 2 EWR     2013     1     1     2  39.0  27.0  61.6      250
          +#> 3 EWR     2013     1     1     3  39.0  28.0  64.4      240
          +#> 4 EWR     2013     1     1     4  39.9  28.0  62.2      250
          +#> 5 EWR     2013     1     1     5  39.0  28.0  64.4      260
          +#> 6 EWR     2013     1     1     6  37.9  28.0  67.2      240
          +#> # ℹ 26,109 more rows
          +#> # ℹ 6 more variables: wind_speed <dbl>, wind_gust <dbl>, …
          +
          +
        • +
        +

        A foreign key is a variable (or set of variables) that corresponds to a primary key in another table. For example:

        +
          +
        • +flights$tailnum is a foreign key that corresponds to the primary key planes$tailnum.
        • +
        • +flights$carrier is a foreign key that corresponds to the primary key airlines$carrier.
        • +
        • +flights$origin is a foreign key that corresponds to the primary key airports$faa.
        • +
        • +flights$dest is a foreign key that corresponds to the primary key airports$faa.
        • +
        • +flights$origin-flights$time_hour is a compound foreign key that corresponds to the compound primary key weather$origin-weather$time_hour.
        • +
        +

        These relationships are summarized visually in Figura 19.1.

        +
        +
        +
        +

        The relationships between airports, planes, flights, weather, and airlines datasets from the nycflights13 package. airports$faa connected to the flights$origin and flights$dest. planes$tailnum is connected to the flights$tailnum. weather$time_hour and weather$origin are jointly connected to flights$time_hour and flights$origin. airlines$carrier is connected to flights$carrier. There are no direct connections between airports, planes, airlines, and weather data frames.

        +
        Figura 19.1: Connections between all five data frames in the nycflights13 package. Variables making up a primary key are colored grey, and are connected to their corresponding foreign keys with arrows.
        +
        +
        +
        +

        You’ll notice a nice feature in the design of these keys: the primary and foreign keys almost always have the same names, which, as you’ll see shortly, will make your joining life much easier. It’s also worth noting the opposite relationship: almost every variable name used in multiple tables has the same meaning in each place. There’s only one exception: year means year of departure in flights and year of manufacturer in planes. This will become important when we start actually joining tables together.

        +

        +19.2.2 Checking primary keys

        +

        Now that that we’ve identified the primary keys in each table, it’s good practice to verify that they do indeed uniquely identify each observation. One way to do that is to count() the primary keys and look for entries where n is greater than one. This reveals that planes and weather both look good:

        +
        +
        planes |> 
        +  count(tailnum) |> 
        +  filter(n > 1)
        +#> # A tibble: 0 × 2
        +#> # ℹ 2 variables: tailnum <chr>, n <int>
        +
        +weather |> 
        +  count(time_hour, origin) |> 
        +  filter(n > 1)
        +#> # A tibble: 0 × 3
        +#> # ℹ 3 variables: time_hour <dttm>, origin <chr>, n <int>
        +
        +

        You should also check for missing values in your primary keys — if a value is missing then it can’t identify an observation!

        +
        +
        planes |> 
        +  filter(is.na(tailnum))
        +#> # A tibble: 0 × 9
        +#> # ℹ 9 variables: tailnum <chr>, year <int>, type <chr>, manufacturer <chr>,
        +#> #   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
        +
        +weather |> 
        +  filter(is.na(time_hour) | is.na(origin))
        +#> # A tibble: 0 × 15
        +#> # ℹ 15 variables: origin <chr>, year <int>, month <int>, day <int>,
        +#> #   hour <int>, temp <dbl>, dewp <dbl>, humid <dbl>, wind_dir <dbl>, …
        +
        +

        +19.2.3 Surrogate keys

        +

        So far we haven’t talked about the primary key for flights. It’s not super important here, because there are no data frames that use it as a foreign key, but it’s still useful to consider because it’s easier to work with observations if we have some way to describe them to others.

        +

        After a little thinking and experimentation, we determined that there are three variables that together uniquely identify each flight:

        +
        +
        flights |> 
        +  count(time_hour, carrier, flight) |> 
        +  filter(n > 1)
        +#> # A tibble: 0 × 4
        +#> # ℹ 4 variables: time_hour <dttm>, carrier <chr>, flight <int>, n <int>
        +
        +

        Does the absence of duplicates automatically make time_hour-carrier-flight a primary key? It’s certainly a good start, but it doesn’t guarantee it. For example, are altitude and latitude a good primary key for airports?

        +
        +
        airports |>
        +  count(alt, lat) |> 
        +  filter(n > 1)
        +#> # A tibble: 1 × 3
        +#>     alt   lat     n
        +#>   <dbl> <dbl> <int>
        +#> 1    13  40.6     2
        +
        +

        Identifying an airport by its altitude and latitude is clearly a bad idea, and in general it’s not possible to know from the data alone whether or not a combination of variables makes a good a primary key. But for flights, the combination of time_hour, carrier, and flight seems reasonable because it would be really confusing for an airline and its customers if there were multiple flights with the same flight number in the air at the same time.

        +

        That said, we might be better off introducing a simple numeric surrogate key using the row number:

        +
        +
        flights2 <- flights |> 
        +  mutate(id = row_number(), .before = 1)
        +flights2
        +#> # A tibble: 336,776 × 20
        +#>      id  year month   day dep_time sched_dep_time dep_delay arr_time
        +#>   <int> <int> <int> <int>    <int>          <int>     <dbl>    <int>
        +#> 1     1  2013     1     1      517            515         2      830
        +#> 2     2  2013     1     1      533            529         4      850
        +#> 3     3  2013     1     1      542            540         2      923
        +#> 4     4  2013     1     1      544            545        -1     1004
        +#> 5     5  2013     1     1      554            600        -6      812
        +#> 6     6  2013     1     1      554            558        -4      740
        +#> # ℹ 336,770 more rows
        +#> # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, …
        +
        +

        Surrogate keys can be particularly useful when communicating to other humans: it’s much easier to tell someone to take a look at flight 2001 than to say look at UA430 which departed 9am 2013-01-03.

        +

        +19.2.4 Exercises

        +
          +
        1. We forgot to draw the relationship between weather and airports in Figura 19.1. What is the relationship and how should it appear in the diagram?

        2. +
        3. weather only contains information for the three origin airports in NYC. If it contained weather records for all airports in the USA, what additional connection would it make to flights?

        4. +
        5. The year, month, day, hour, and origin variables almost form a compound key for weather, but there’s one hour that has duplicate observations. Can you figure out what’s special about that hour?

        6. +
        7. We know that some days of the year are special and fewer people than usual fly on them (e.g., Christmas eve and Christmas day). How might you represent that data as a data frame? What would be the primary key? How would it connect to the existing data frames?

        8. +
        9. Draw a diagram illustrating the connections between the Batting, People, and Salaries data frames in the Lahman package. Draw another diagram that shows the relationship between People, Managers, AwardsManagers. How would you characterize the relationship between the Batting, Pitching, and Fielding data frames?

        10. +

        +19.3 Basic joins

        +

        Now that you understand how data frames are connected via keys, we can start using joins to better understand the flights dataset. dplyr provides six join functions: left_join(), inner_join(), right_join(), full_join(), semi_join(), and anti_join(). They all have the same interface: they take a pair of data frames (x and y) and return a data frame. The order of the rows and columns in the output is primarily determined by x.

        +

        In this section, you’ll learn how to use one mutating join, left_join(), and two filtering joins, semi_join() and anti_join(). In the next section, you’ll learn exactly how these functions work, and about the remaining inner_join(), right_join() and full_join().

        +

        +19.3.1 Mutating joins

        +

        A mutating join allows you to combine variables from two data frames: it first matches observations by their keys, then copies across variables from one data frame to the other. Like mutate(), the join functions add variables to the right, so if your dataset has many variables, you won’t see the new ones. For these examples, we’ll make it easier to see what’s going on by creating a narrower dataset with just six variables1:

        +
        +
        flights2 <- flights |> 
        +  select(year, time_hour, origin, dest, tailnum, carrier)
        +flights2
        +#> # A tibble: 336,776 × 6
        +#>    year time_hour           origin dest  tailnum carrier
        +#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>  
        +#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA     
        +#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA     
        +#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA     
        +#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6     
        +#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL     
        +#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA     
        +#> # ℹ 336,770 more rows
        +
        +

        There are four types of mutating join, but there’s one that you’ll use almost all of the time: left_join(). It’s special because the output will always have the same rows as x, the data frame you’re joining to2. The primary use of left_join() is to add in additional metadata. For example, we can use left_join() to add the full airline name to the flights2 data:

        +
        +
        flights2 |>
        +  left_join(airlines)
        +#> Joining with `by = join_by(carrier)`
        +#> # A tibble: 336,776 × 7
        +#>    year time_hour           origin dest  tailnum carrier name                
        +#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>               
        +#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      United Air Lines In…
        +#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      United Air Lines In…
        +#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      American Airlines I…
        +#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      JetBlue Airways     
        +#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Delta Air Lines Inc.
        +#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      United Air Lines In…
        +#> # ℹ 336,770 more rows
        +
        +

        Or we could find out the temperature and wind speed when each plane departed:

        +
        +
        flights2 |> 
        +  left_join(weather |> select(origin, time_hour, temp, wind_speed))
        +#> Joining with `by = join_by(time_hour, origin)`
        +#> # A tibble: 336,776 × 8
        +#>    year time_hour           origin dest  tailnum carrier  temp wind_speed
        +#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <dbl>      <dbl>
        +#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA       39.0       12.7
        +#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA       39.9       15.0
        +#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA       39.0       15.0
        +#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6       39.0       15.0
        +#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL       39.9       16.1
        +#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA       39.0       12.7
        +#> # ℹ 336,770 more rows
        +
        +

        Or what size of plane was flying:

        +
        +
        flights2 |> 
        +  left_join(planes |> select(tailnum, type, engines, seats))
        +#> Joining with `by = join_by(tailnum)`
        +#> # A tibble: 336,776 × 9
        +#>    year time_hour           origin dest  tailnum carrier type                
        +#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>               
        +#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      Fixed wing multi en…
        +#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      Fixed wing multi en…
        +#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Fixed wing multi en…
        +#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      Fixed wing multi en…
        +#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Fixed wing multi en…
        +#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Fixed wing multi en…
        +#> # ℹ 336,770 more rows
        +#> # ℹ 2 more variables: engines <int>, seats <int>
        +
        +

        When left_join() fails to find a match for a row in x, it fills in the new variables with missing values. For example, there’s no information about the plane with tail number N3ALAA so the type, engines, and seats will be missing:

        +
        +
        flights2 |> 
        +  filter(tailnum == "N3ALAA") |> 
        +  left_join(planes |> select(tailnum, type, engines, seats))
        +#> Joining with `by = join_by(tailnum)`
        +#> # A tibble: 63 × 9
        +#>    year time_hour           origin dest  tailnum carrier type  engines seats
        +#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>   <int> <int>
        +#> 1  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
        +#> 2  2013 2013-01-02 18:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
        +#> 3  2013 2013-01-03 06:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
        +#> 4  2013 2013-01-07 19:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
        +#> 5  2013 2013-01-08 17:00:00 JFK    ORD   N3ALAA  AA      <NA>       NA    NA
        +#> 6  2013 2013-01-16 06:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
        +#> # ℹ 57 more rows
        +
        +

        We’ll come back to this problem a few times in the rest of the chapter.

        +

        +19.3.2 Specifying join keys

        +

        By default, left_join() will use all variables that appear in both data frames as the join key, the so called natural join. This is a useful heuristic, but it doesn’t always work. For example, what happens if we try to join flights2 with the complete planes dataset?

        +
        +
        flights2 |> 
        +  left_join(planes)
        +#> Joining with `by = join_by(year, tailnum)`
        +#> # A tibble: 336,776 × 13
        +#>    year time_hour           origin dest  tailnum carrier type  manufacturer
        +#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr> <chr>       
        +#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      <NA>  <NA>        
        +#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      <NA>  <NA>        
        +#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      <NA>  <NA>        
        +#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>  <NA>        
        +#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      <NA>  <NA>        
        +#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      <NA>  <NA>        
        +#> # ℹ 336,770 more rows
        +#> # ℹ 5 more variables: model <chr>, engines <int>, seats <int>, …
        +
        +

        We get a lot of missing matches because our join is trying to use tailnum and year as a compound key. Both flights and planes have a year column but they mean different things: flights$year is the year the flight occurred and planes$year is the year the plane was built. We only want to join on tailnum so we need to provide an explicit specification with join_by():

        +
        +
        flights2 |> 
        +  left_join(planes, join_by(tailnum))
        +#> # A tibble: 336,776 × 14
        +#>   year.x time_hour           origin dest  tailnum carrier year.y
        +#>    <int> <dttm>              <chr>  <chr> <chr>   <chr>    <int>
        +#> 1   2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA        1999
        +#> 2   2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA        1998
        +#> 3   2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA        1990
        +#> 4   2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6        2012
        +#> 5   2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL        1991
        +#> 6   2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA        2012
        +#> # ℹ 336,770 more rows
        +#> # ℹ 7 more variables: type <chr>, manufacturer <chr>, model <chr>, …
        +
        +

        Note that the year variables are disambiguated in the output with a suffix (year.x and year.y), which tells you whether the variable came from the x or y argument. You can override the default suffixes with the suffix argument.

        +

        join_by(tailnum) is short for join_by(tailnum == tailnum). It’s important to know about this fuller form for two reasons. Firstly, it describes the relationship between the two tables: the keys must be equal. That’s why this type of join is often called an equi join. You’ll learn about non-equi joins in Seção 19.5.

        +

        Secondly, it’s how you specify different join keys in each table. For example, there are two ways to join the flight2 and airports table: either by dest or origin:

        +
        +
        flights2 |> 
        +  left_join(airports, join_by(dest == faa))
        +#> # A tibble: 336,776 × 13
        +#>    year time_hour           origin dest  tailnum carrier name                
        +#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>               
        +#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      George Bush Interco…
        +#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      George Bush Interco…
        +#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Miami Intl          
        +#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>                
        +#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Hartsfield Jackson …
        +#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Chicago Ohare Intl  
        +#> # ℹ 336,770 more rows
        +#> # ℹ 6 more variables: lat <dbl>, lon <dbl>, alt <dbl>, tz <dbl>, …
        +
        +flights2 |> 
        +  left_join(airports, join_by(origin == faa))
        +#> # A tibble: 336,776 × 13
        +#>    year time_hour           origin dest  tailnum carrier name               
        +#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>              
        +#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      Newark Liberty Intl
        +#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      La Guardia         
        +#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      John F Kennedy Intl
        +#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      John F Kennedy Intl
        +#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      La Guardia         
        +#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Newark Liberty Intl
        +#> # ℹ 336,770 more rows
        +#> # ℹ 6 more variables: lat <dbl>, lon <dbl>, alt <dbl>, tz <dbl>, …
        +
        +

        In older code you might see a different way of specifying the join keys, using a character vector:

        +
          +
        • +by = "x" corresponds to join_by(x).
        • +
        • +by = c("a" = "x") corresponds to join_by(a == x).
        • +
        +

        Now that it exists, we prefer join_by() since it provides a clearer and more flexible specification.

        +

        inner_join(), right_join(), full_join() have the same interface as left_join(). The difference is which rows they keep: left join keeps all the rows in x, the right join keeps all rows in y, the full join keeps all rows in either x or y, and the inner join only keeps rows that occur in both x and y. We’ll come back to these in more detail later.

        +

        +19.3.3 Filtering joins

        +

        As you might guess the primary action of a filtering join is to filter the rows. There are two types: semi-joins and anti-joins. Semi-joins keep all rows in x that have a match in y. For example, we could use a semi-join to filter the airports dataset to show just the origin airports:

        +
        +
        airports |> 
        +  semi_join(flights2, join_by(faa == origin))
        +#> # A tibble: 3 × 8
        +#>   faa   name                  lat   lon   alt    tz dst   tzone           
        +#>   <chr> <chr>               <dbl> <dbl> <dbl> <dbl> <chr> <chr>           
        +#> 1 EWR   Newark Liberty Intl  40.7 -74.2    18    -5 A     America/New_York
        +#> 2 JFK   John F Kennedy Intl  40.6 -73.8    13    -5 A     America/New_York
        +#> 3 LGA   La Guardia           40.8 -73.9    22    -5 A     America/New_York
        +
        +

        Or just the destinations:

        +
        +
        airports |> 
        +  semi_join(flights2, join_by(faa == dest))
        +#> # A tibble: 101 × 8
        +#>   faa   name                     lat    lon   alt    tz dst   tzone          
        +#>   <chr> <chr>                  <dbl>  <dbl> <dbl> <dbl> <chr> <chr>          
        +#> 1 ABQ   Albuquerque Internati…  35.0 -107.   5355    -7 A     America/Denver 
        +#> 2 ACK   Nantucket Mem           41.3  -70.1    48    -5 A     America/New_Yo…
        +#> 3 ALB   Albany Intl             42.7  -73.8   285    -5 A     America/New_Yo…
        +#> 4 ANC   Ted Stevens Anchorage…  61.2 -150.    152    -9 A     America/Anchor…
        +#> 5 ATL   Hartsfield Jackson At…  33.6  -84.4  1026    -5 A     America/New_Yo…
        +#> 6 AUS   Austin Bergstrom Intl   30.2  -97.7   542    -6 A     America/Chicago
        +#> # ℹ 95 more rows
        +
        +

        Anti-joins are the opposite: they return all rows in x that don’t have a match in y. They’re useful for finding missing values that are implicit in the data, the topic of Seção 18.3. Implicitly missing values don’t show up as NAs but instead only exist as an absence. For example, we can find rows that are missing from airports by looking for flights that don’t have a matching destination airport:

        +
        +
        flights2 |> 
        +  anti_join(airports, join_by(dest == faa)) |> 
        +  distinct(dest)
        +#> # A tibble: 4 × 1
        +#>   dest 
        +#>   <chr>
        +#> 1 BQN  
        +#> 2 SJU  
        +#> 3 STT  
        +#> 4 PSE
        +
        +

        Or we can find which tailnums are missing from planes:

        +
        +
        flights2 |>
        +  anti_join(planes, join_by(tailnum)) |> 
        +  distinct(tailnum)
        +#> # A tibble: 722 × 1
        +#>   tailnum
        +#>   <chr>  
        +#> 1 N3ALAA 
        +#> 2 N3DUAA 
        +#> 3 N542MQ 
        +#> 4 N730MQ 
        +#> 5 N9EAMQ 
        +#> 6 N532UA 
        +#> # ℹ 716 more rows
        +
        +

        +19.3.4 Exercises

        +
          +
        1. Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the weather data. Can you see any patterns?

        2. +
        3. +

          Imagine you’ve found the top 10 most popular destinations using this code:

          +
          +
          top_dest <- flights2 |>
          +  count(dest, sort = TRUE) |>
          +  head(10)
          +
          +

          How can you find all flights to those destinations?

          +
        4. +
        5. Does every departing flight have corresponding weather data for that hour?

        6. +
        7. What do the tail numbers that don’t have a matching record in planes have in common? (Hint: one variable explains ~90% of the problems.)

        8. +
        9. Add a column to planes that lists every carrier that has flown that plane. You might expect that there’s an implicit relationship between plane and airline, because each plane is flown by a single airline. Confirm or reject this hypothesis using the tools you’ve learned in previous chapters.

        10. +
        11. Add the latitude and the longitude of the origin and destination airport to flights. Is it easier to rename the columns before or after the join?

        12. +
        13. +

          Compute the average delay by destination, then join on the airports data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:

          +
          +
          airports |>
          +  semi_join(flights, join_by(faa == dest)) |>
          +  ggplot(aes(x = lon, y = lat)) +
          +    borders("state") +
          +    geom_point() +
          +    coord_quickmap()
          +
          +

          You might want to use the size or color of the points to display the average delay for each airport.

          +
        14. +
        15. What happened on June 13 2013? Draw a map of the delays, and then use Google to cross-reference with the weather.

        16. +

        +19.4 How do joins work?

        +

        Now that you’ve used joins a few times it’s time to learn more about how they work, focusing on how each row in x matches rows in y. We’ll begin by introducing a visual representation of joins, using the simple tibbles defined below and shown in Figura 19.2. In these examples we’ll use a single key called key and a single value column (val_x and val_y), but the ideas all generalize to multiple keys and multiple values.

        +
        +
        x <- tribble(
        +  ~key, ~val_x,
        +     1, "x1",
        +     2, "x2",
        +     3, "x3"
        +)
        +y <- tribble(
        +  ~key, ~val_y,
        +     1, "y1",
        +     2, "y2",
        +     4, "y3"
        +)
        +
        +
        +
        +
        +

        x and y are two data frames with 2 columns and 3 rows, with contents as described in the text. The values of the keys are colored: 1 is green, 2 is purple, 3 is orange, and 4 is yellow.

        +
        Figura 19.2: Graphical representation of two simple tables. The colored key columns map background color to key value. The grey columns represent the “value” columns that are carried along for the ride.
        +
        +
        +
        +

        Figura 19.3 introduces the foundation for our visual representation. It shows all potential matches between x and y as the intersection between lines drawn from each row of x and each row of y. The rows and columns in the output are primarily determined by x, so the x table is horizontal and lines up with the output.

        +
        +
        +
        +

        x and y are placed at right-angles, with horizonal lines extending from x and vertical lines extending from y. There are 3 rows in x and 3 rows in y, which leads to nine intersections representing nine potential matches.

        +
        Figura 19.3: To understand how joins work, it’s useful to think of every possible match. Here we show that with a grid of connecting lines.
        +
        +
        +
        +

        To describe a specific type of join, we indicate matches with dots. The matches determine the rows in the output, a new data frame that contains the key, the x values, and the y values. For example, Figura 19.4 shows an inner join, where rows are retained if and only if the keys are equal.

        +
        +
        +
        +

        x and y are placed at right-angles with lines forming a grid of potential matches. Keys 1 and 2 appear in both x and y, so we get a match, indicated by a dot. Each dot corresponds to a row in the output, so the resulting joined data frame has two rows.

        +
        Figura 19.4: An inner join matches each row in x to the row in y that has the same value of key. Each match becomes a row in the output.
        +
        +
        +
        +

        We can apply the same principles to explain the outer joins, which keep observations that appear in at least one of the data frames. These joins work by adding an additional “virtual” observation to each data frame. This observation has a key that matches if no other key matches, and values filled with NA. There are three types of outer joins:

        +
          +
        • +

          A left join keeps all observations in x, Figura 19.5. Every row of x is preserved in the output because it can fall back to matching a row of NAs in y.

          +
          +
          +
          +

          Compared to the previous diagram showing an inner join, the y table gets a new virtual row containin NA that will match any row in x that didn't otherwise match. This means that the output now has three rows. For key = 3, which matches this virtual row, val_y takes value NA.

          +
          Figura 19.5: A visual representation of the left join where every row in x appears in the output.
          +
          +
          +
          +
        • +
        • +

          A right join keeps all observations in y, Figura 19.6. Every row of y is preserved in the output because it can fall back to matching a row of NAs in x. The output still matches x as much as possible; any extra rows from y are added to the end.

          +
          +
          +
          +

          Compared to the previous diagram showing an left join, the x table now gains a virtual row so that every row in y gets a match in x. val_x contains NA for the row in y that didn't match x.

          +
          Figura 19.6: A visual representation of the right join where every row of y appears in the output.
          +
          +
          +
          +
        • +
        • +

          A full join keeps all observations that appear in x or y, Figura 19.7. Every row of x and y is included in the output because both x and y have a fall back row of NAs. Again, the output starts with all rows from x, followed by the remaining unmatched y rows.

          +
          +
          +
          +

          Now both x and y have a virtual row that always matches. The result has 4 rows: keys 1, 2, 3, and 4 with all values from val_x and val_y, however key 2, val_y and key 4, val_x are NAs since those keys don't have a match in the other data frames.

          +
          Figura 19.7: A visual representation of the full join where every row in x and y appears in the output.
          +
          +
          +
          +
        • +
        +

        Another way to show how the types of outer join differ is with a Venn diagram, as in Figura 19.8. However, this is not a great representation because while it might jog your memory about which rows are preserved, it fails to illustrate what’s happening with the columns.

        +
        +
        +
        +

        Venn diagrams for inner, full, left, and right joins. Each join represented with two intersecting circles representing data frames x and y, with x on the right and y on the left. Shading indicates the result of the join.

        +
        Figura 19.8: Venn diagrams showing the difference between inner, left, right, and full joins.
        +
        +
        +
        +

        The joins shown here are the so-called equi joins, where rows match if the keys are equal. Equi joins are the most common type of join, so we’ll typically omit the equi prefix, and just say “inner join” rather than “equi inner join”. We’ll come back to non-equi joins in Seção 19.5.

        +

        +19.4.1 Row matching

        +

        So far we’ve explored what happens if a row in x matches zero or one row in y. What happens if it matches more than one row? To understand what’s going let’s first narrow our focus to the inner_join() and then draw a picture, Figura 19.9.

        +
        +
        +
        +

        A join diagram where x has key values 1, 2, and 3, and y has key values 1, 2, 2. The output has three rows because key 1 matches one row, key 2 matches two rows, and key 3 matches zero rows.

        +
        Figura 19.9: The three ways a row in x can match. x1 matches one row in y, x2 matches two rows in y, x3 matches zero rows in y. Note that while there are three rows in x and three rows in the output, there isn’t a direct correspondence between the rows.
        +
        +
        +
        +

        There are three possible outcomes for a row in x:

        +
          +
        • If it doesn’t match anything, it’s dropped.
        • +
        • If it matches 1 row in y, it’s preserved.
        • +
        • If it matches more than 1 row in y, it’s duplicated once for each match.
        • +
        +

        In principle, this means that there’s no guaranteed correspondence between the rows in the output and the rows in x, but in practice, this rarely causes problems. There is, however, one particularly dangerous case which can cause a combinatorial explosion of rows. Imagine joining the following two tables:

        +
        +
        df1 <- tibble(key = c(1, 2, 2), val_x = c("x1", "x2", "x3"))
        +df2 <- tibble(key = c(1, 2, 2), val_y = c("y1", "y2", "y3"))
        +
        +

        While the first row in df1 only matches one row in df2, the second and third rows both match two rows. This is sometimes called a many-to-many join, and will cause dplyr to emit a warning:

        +
        +
        df1 |> 
        +  inner_join(df2, join_by(key))
        +#> Warning in inner_join(df1, df2, join_by(key)): Detected an unexpected many-to-many relationship between `x` and `y`.
        +#> ℹ Row 2 of `x` matches multiple rows in `y`.
        +#> ℹ Row 2 of `y` matches multiple rows in `x`.
        +#> ℹ If a many-to-many relationship is expected, set `relationship =
        +#>   "many-to-many"` to silence this warning.
        +#> # A tibble: 5 × 3
        +#>     key val_x val_y
        +#>   <dbl> <chr> <chr>
        +#> 1     1 x1    y1   
        +#> 2     2 x2    y2   
        +#> 3     2 x2    y3   
        +#> 4     2 x3    y2   
        +#> 5     2 x3    y3
        +
        +

        If you are doing this deliberately, you can set relationship = "many-to-many", as the warning suggests.

        +

        +19.4.2 Filtering joins

        +

        The number of matches also determines the behavior of the filtering joins. The semi-join keeps rows in x that have one or more matches in y, as in Figura 19.10. The anti-join keeps rows in x that match zero rows in y, as in Figura 19.11. In both cases, only the existence of a match is important; it doesn’t matter how many times it matches. This means that filtering joins never duplicate rows like mutating joins do.

        +
        +
        +
        +

        A join diagram with old friends x and y. In a semi join, only the presence of a match matters so the output contains the same columns as x.

        +
        Figura 19.10: In a semi-join it only matters that there is a match; otherwise values in y don’t affect the output.
        +
        +
        +
        +
        +
        +
        +

        An anti-join is the inverse of a semi-join so matches are drawn with red lines indicating that they will be dropped from the output.

        +
        Figura 19.11: An anti-join is the inverse of a semi-join, dropping rows from x that have a match in y.
        +
        +
        +
        +

        +19.5 Non-equi joins

        +

        So far you’ve only seen equi joins, joins where the rows match if the x key equals the y key. Now we’re going to relax that restriction and discuss other ways of determining if a pair of rows match.

        +

        But before we can do that, we need to revisit a simplification we made above. In equi joins the x keys and y are always equal, so we only need to show one in the output. We can request that dplyr keep both keys with keep = TRUE, leading to the code below and the re-drawn inner_join() in Figura 19.12.

        +
        +
        x |> inner_join(y, join_by(key == key), keep = TRUE)
        +#> # A tibble: 2 × 4
        +#>   key.x val_x key.y val_y
        +#>   <dbl> <chr> <dbl> <chr>
        +#> 1     1 x1        1 y1   
        +#> 2     2 x2        2 y2
        +
        +
        +
        +
        +

        A join diagram showing an inner join betwen x and y. The result now includes four columns: key.x, val_x, key.y, and val_y. The values of key.x and key.y are identical, which is why we usually only show one.

        +
        Figura 19.12: An inner join showing both x and y keys in the output.
        +
        +
        +
        +

        When we move away from equi joins we’ll always show the keys, because the key values will often be different. For example, instead of matching only when the x$key and y$key are equal, we could match whenever the x$key is greater than or equal to the y$key, leading to Figura 19.13. dplyr’s join functions understand this distinction equi and non-equi joins so will always show both keys when you perform a non-equi join.

        +
        +
        +
        +

        A join diagram illustrating join_by(key >= key). The first row of x matches one row of y and the second and thirds rows each match two rows. This means the output has five rows containing each of the following (key.x, key.y) pairs: (1, 1), (2, 1), (2, 2), (3, 1), (3, 2).

        +
        Figura 19.13: A non-equi join where the x key must be greater than or equal to the y key. Many rows generate multiple matches.
        +
        +
        +
        +

        Non-equi join isn’t a particularly useful term because it only tells you what the join is not, not what it is. dplyr helps by identifying four particularly useful types of non-equi join:

        +
          +
        • +Cross joins match every pair of rows.
        • +
        • +Inequality joins use <, <=, >, and >= instead of ==.
        • +
        • +Rolling joins are similar to inequality joins but only find the closest match.
        • +
        • +Overlap joins are a special type of inequality join designed to work with ranges.
        • +
        +

        Each of these is described in more detail in the following sections.

        +

        +19.5.1 Cross joins

        +

        A cross join matches everything, as in Figura 19.14, generating the Cartesian product of rows. This means the output will have nrow(x) * nrow(y) rows.

        +
        +
        +
        +

        A join diagram showing a dot for every combination of x and y.

        +
        Figura 19.14: A cross join matches each row in x with every row in y.
        +
        +
        +
        +

        Cross joins are useful when generating permutations. For example, the code below generates every possible pair of names. Since we’re joining df to itself, this is sometimes called a self-join. Cross joins use a different join function because there’s no distinction between inner/left/right/full when you’re matching every row.

        +
        +
        df <- tibble(name = c("John", "Simon", "Tracy", "Max"))
        +df |> cross_join(df)
        +#> # A tibble: 16 × 2
        +#>   name.x name.y
        +#>   <chr>  <chr> 
        +#> 1 John   John  
        +#> 2 John   Simon 
        +#> 3 John   Tracy 
        +#> 4 John   Max   
        +#> 5 Simon  John  
        +#> 6 Simon  Simon 
        +#> # ℹ 10 more rows
        +
        +

        +19.5.2 Inequality joins

        +

        Inequality joins use <, <=, >=, or > to restrict the set of possible matches, as in Figura 19.13 and Figura 19.15.

        +
        +
        +
        +

        A diagram depicting an inequality join where a data frame x is joined by a data frame y where the key of x is less than the key of y, resulting in a triangular shape in the top-left corner.

        +
        Figura 19.15: An inequality join where x is joined to y on rows where the key of x is less than the key of y. This makes a triangular shape in the top-left corner.
        +
        +
        +
        +

        Inequality joins are extremely general, so general that it’s hard to come up with meaningful specific use cases. One small useful technique is to use them to restrict the cross join so that instead of generating all permutations, we generate all combinations:

        +
        +
        df <- tibble(id = 1:4, name = c("John", "Simon", "Tracy", "Max"))
        +
        +df |> inner_join(df, join_by(id < id))
        +#> # A tibble: 6 × 4
        +#>    id.x name.x  id.y name.y
        +#>   <int> <chr>  <int> <chr> 
        +#> 1     1 John       2 Simon 
        +#> 2     1 John       3 Tracy 
        +#> 3     1 John       4 Max   
        +#> 4     2 Simon      3 Tracy 
        +#> 5     2 Simon      4 Max   
        +#> 6     3 Tracy      4 Max
        +
        +

        +19.5.3 Rolling joins

        +

        Rolling joins are a special type of inequality join where instead of getting every row that satisfies the inequality, you get just the closest row, as in Figura 19.16. You can turn any inequality join into a rolling join by adding closest(). For example join_by(closest(x <= y)) matches the smallest y that’s greater than or equal to x, and join_by(closest(x > y)) matches the biggest y that’s less than x.

        +
        +
        +
        +

        A rolling join is a subset of an inequality join so some matches are grayed out indicating that they're not used because they're not the "closest".

        +
        Figura 19.16: A rolling join is similar to a greater-than-or-equal inequality join but only matches the first value.
        +
        +
        +
        +

        Rolling joins are particularly useful when you have two tables of dates that don’t perfectly line up and you want to find (e.g.) the closest date in table 1 that comes before (or after) some date in table 2.

        +

        For example, imagine that you’re in charge of the party planning commission for your office. Your company is rather cheap so instead of having individual parties, you only have a party once each quarter. The rules for determining when a party will be held are a little complex: parties are always on a Monday, you skip the first week of January since a lot of people are on holiday, and the first Monday of Q3 2022 is July 4, so that has to be pushed back a week. That leads to the following party days:

        +
        +
        parties <- tibble(
        +  q = 1:4,
        +  party = ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03"))
        +)
        +
        +

        Now imagine that you have a table of employee birthdays:

        +
        +
        set.seed(123)
        +employees <- tibble(
        +  name = sample(babynames::babynames$name, 100),
        +  birthday = ymd("2022-01-01") + (sample(365, 100, replace = TRUE) - 1)
        +)
        +employees
        +#> # A tibble: 100 × 2
        +#>   name     birthday  
        +#>   <chr>    <date>    
        +#> 1 Kemba    2022-01-22
        +#> 2 Orean    2022-06-26
        +#> 3 Kirstyn  2022-02-11
        +#> 4 Amparo   2022-11-11
        +#> 5 Belen    2022-03-25
        +#> 6 Rayshaun 2022-01-11
        +#> # ℹ 94 more rows
        +
        +

        And for each employee we want to find the first party date that comes after (or on) their birthday. We can express that with a rolling join:

        +
        +
        employees |> 
        +  left_join(parties, join_by(closest(birthday >= party)))
        +#> # A tibble: 100 × 4
        +#>   name     birthday       q party     
        +#>   <chr>    <date>     <int> <date>    
        +#> 1 Kemba    2022-01-22     1 2022-01-10
        +#> 2 Orean    2022-06-26     2 2022-04-04
        +#> 3 Kirstyn  2022-02-11     1 2022-01-10
        +#> 4 Amparo   2022-11-11     4 2022-10-03
        +#> 5 Belen    2022-03-25     1 2022-01-10
        +#> 6 Rayshaun 2022-01-11     1 2022-01-10
        +#> # ℹ 94 more rows
        +
        +

        There is, however, one problem with this approach: the folks with birthdays before January 10 don’t get a party:

        +
        +
        employees |> 
        +  anti_join(parties, join_by(closest(birthday >= party)))
        +#> # A tibble: 2 × 2
        +#>   name   birthday  
        +#>   <chr>  <date>    
        +#> 1 Maks   2022-01-07
        +#> 2 Nalani 2022-01-04
        +
        +

        To resolve that issue we’ll need to tackle the problem a different way, with overlap joins.

        +

        +19.5.4 Overlap joins

        +

        Overlap joins provide three helpers that use inequality joins to make it easier to work with intervals:

        +
          +
        • +between(x, y_lower, y_upper) is short for x >= y_lower, x <= y_upper.
        • +
        • +within(x_lower, x_upper, y_lower, y_upper) is short for x_lower >= y_lower, x_upper <= y_upper.
        • +
        • +overlaps(x_lower, x_upper, y_lower, y_upper) is short for x_lower <= y_upper, x_upper >= y_lower.
        • +
        +

        Let’s continue the birthday example to see how you might use them. There’s one problem with the strategy we used above: there’s no party preceding the birthdays Jan 1-9. So it might be better to be explicit about the date ranges that each party spans, and make a special case for those early birthdays:

        +
        +
        parties <- tibble(
        +  q = 1:4,
        +  party = ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
        +  start = ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
        +  end = ymd(c("2022-04-03", "2022-07-11", "2022-10-02", "2022-12-31"))
        +)
        +parties
        +#> # A tibble: 4 × 4
        +#>       q party      start      end       
        +#>   <int> <date>     <date>     <date>    
        +#> 1     1 2022-01-10 2022-01-01 2022-04-03
        +#> 2     2 2022-04-04 2022-04-04 2022-07-11
        +#> 3     3 2022-07-11 2022-07-11 2022-10-02
        +#> 4     4 2022-10-03 2022-10-03 2022-12-31
        +
        +

        Hadley is hopelessly bad at data entry so he also wanted to check that the party periods don’t overlap. One way to do this is by using a self-join to check if any start-end interval overlap with another:

        +
        +
        parties |> 
        +  inner_join(parties, join_by(overlaps(start, end, start, end), q < q)) |> 
        +  select(start.x, end.x, start.y, end.y)
        +#> # A tibble: 1 × 4
        +#>   start.x    end.x      start.y    end.y     
        +#>   <date>     <date>     <date>     <date>    
        +#> 1 2022-04-04 2022-07-11 2022-07-11 2022-10-02
        +
        +

        Ooops, there is an overlap, so let’s fix that problem and continue:

        +
        +
        parties <- tibble(
        +  q = 1:4,
        +  party = ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
        +  start = ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
        +  end = ymd(c("2022-04-03", "2022-07-10", "2022-10-02", "2022-12-31"))
        +)
        +
        +

        Now we can match each employee to their party. This is a good place to use unmatched = "error" because we want to quickly find out if any employees didn’t get assigned a party.

        +
        +
        employees |> 
        +  inner_join(parties, join_by(between(birthday, start, end)), unmatched = "error")
        +#> # A tibble: 100 × 6
        +#>   name     birthday       q party      start      end       
        +#>   <chr>    <date>     <int> <date>     <date>     <date>    
        +#> 1 Kemba    2022-01-22     1 2022-01-10 2022-01-01 2022-04-03
        +#> 2 Orean    2022-06-26     2 2022-04-04 2022-04-04 2022-07-10
        +#> 3 Kirstyn  2022-02-11     1 2022-01-10 2022-01-01 2022-04-03
        +#> 4 Amparo   2022-11-11     4 2022-10-03 2022-10-03 2022-12-31
        +#> 5 Belen    2022-03-25     1 2022-01-10 2022-01-01 2022-04-03
        +#> 6 Rayshaun 2022-01-11     1 2022-01-10 2022-01-01 2022-04-03
        +#> # ℹ 94 more rows
        +
        +

        +19.5.5 Exercises

        +
          +
        1. +

          Can you explain what’s happening with the keys in this equi join? Why are they different?

          +
          +
          x |> full_join(y, join_by(key == key))
          +#> # A tibble: 4 × 3
          +#>     key val_x val_y
          +#>   <dbl> <chr> <chr>
          +#> 1     1 x1    y1   
          +#> 2     2 x2    y2   
          +#> 3     3 x3    <NA> 
          +#> 4     4 <NA>  y3
          +
          +x |> full_join(y, join_by(key == key), keep = TRUE)
          +#> # A tibble: 4 × 4
          +#>   key.x val_x key.y val_y
          +#>   <dbl> <chr> <dbl> <chr>
          +#> 1     1 x1        1 y1   
          +#> 2     2 x2        2 y2   
          +#> 3     3 x3       NA <NA> 
          +#> 4    NA <NA>      4 y3
          +
          +
        2. +
        3. When finding if any party period overlapped with another party period we used q < q in the join_by()? Why? What happens if you remove this inequality?

        4. +

        +19.6 Summary

        +

        In this chapter, you’ve learned how to use mutating and filtering joins to combine data from a pair of data frames. Along the way you learned how to identify keys, and the difference between primary and foreign keys. You also understand how joins work and how to figure out how many rows the output will have. Finally, you’ve gained a glimpse into the power of non-equi joins and seen a few interesting use cases.

        +

        This chapter concludes the “Transform” part of the book where the focus was on the tools you could use with individual columns and tibbles. You learned about dplyr and base functions for working with logical vectors, numbers, and complete tables, stringr functions for working with strings, lubridate functions for working with date-times, and forcats functions for working with factors.

        +

        In the next part of the book, you’ll learn more about getting various types of data into R in a tidy form.

        + + +

        +
          +
        1. Remember that in RStudio you can also use View() to avoid this problem.↩︎

        2. +
        3. That’s not 100% true, but you’ll get a warning whenever it isn’t.↩︎

        4. +
        +
        +
        +
        + + + \ No newline at end of file diff --git a/layers.html b/layers.html new file mode 100644 index 000000000..4b8c65343 --- /dev/null +++ b/layers.html @@ -0,0 +1,1419 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 9  Layers + + + + + + + + + + + + + + + + + + + + + + + + +
        +
        + +
        + + + +
        +

        9  Layers

        +
        + + + +
        + + + + +
        + + +

        +9.1 Introduction

        +

        In Capítulo 1, you learned much more than just how to make scatterplots, bar charts, and boxplots. You learned a foundation that you can use to make any type of plot with ggplot2.

        +

        In this chapter, you’ll expand on that foundation as you learn about the layered grammar of graphics. We’ll start with a deeper dive into aesthetic mappings, geometric objects, and facets. Then, you will learn about statistical transformations ggplot2 makes under the hood when creating a plot. These transformations are used to calculate new values to plot, such as the heights of bars in a bar plot or medians in a box plot. You will also learn about position adjustments, which modify how geoms are displayed in your plots. Finally, we’ll briefly introduce coordinate systems.

        +

        We will not cover every single function and option for each of these layers, but we will walk you through the most important and commonly used functionality provided by ggplot2 as well as introduce you to packages that extend ggplot2.

        +

        +9.1.1 Prerequisites

        +

        This chapter focuses on ggplot2. To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running this code:

        + +

        +9.2 Aesthetic mappings

        +
        +

        “The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey

        +
        +

        Remember that the mpg data frame bundled with the ggplot2 package contains 234 observations on 38 car models.

        +
        +
        mpg
        +#> # A tibble: 234 × 11
        +#>   manufacturer model displ  year   cyl trans      drv     cty   hwy fl   
        +#>   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr>
        +#> 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p    
        +#> 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p    
        +#> 3 audi         a4      2    2008     4 manual(m6) f        20    31 p    
        +#> 4 audi         a4      2    2008     4 auto(av)   f        21    30 p    
        +#> 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p    
        +#> 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p    
        +#> # ℹ 228 more rows
        +#> # ℹ 1 more variable: class <chr>
        +
        +

        Among the variables in mpg are:

        +
          +
        1. displ: A car’s engine size, in liters. A numerical variable.

        2. +
        3. hwy: A car’s fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance. A numerical variable.

        4. +
        5. class: Type of car. A categorical variable.

        6. +
        +

        Let’s start by visualizing the relationship between displ and hwy for various classes of cars. We can do this with a scatterplot where the numerical variables are mapped to the x and y aesthetics and the categorical variable is mapped to an aesthetic like color or shape.

        +
        +
        # Left
        +ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
        +  geom_point()
        +
        +# Right
        +ggplot(mpg, aes(x = displ, y = hwy, shape = class)) +
        +  geom_point()
        +#> Warning: The shape palette can deal with a maximum of 6 discrete values
        +#> because more than 6 becomes difficult to discriminate; you have 7.
        +#> Consider specifying shapes manually if you must have them.
        +#> Warning: Removed 62 rows containing missing values (`geom_point()`).
        +
        +
        +
        +

        Two scatterplots next to each other, both visualizing highway fuel efficiency versus engine size of cars and showing a negative association. In the plot on the left class is mapped to the color aesthetic, resulting in different colors for each class. In the plot on the right class is mapped the shape aesthetic, resulting in different plotting character shapes for each class, except for suv. Each plot comes with a legend that shows the mapping between color or shape and levels of the class variable.

        +
        +
        +

        Two scatterplots next to each other, both visualizing highway fuel efficiency versus engine size of cars and showing a negative association. In the plot on the left class is mapped to the color aesthetic, resulting in different colors for each class. In the plot on the right class is mapped the shape aesthetic, resulting in different plotting character shapes for each class, except for suv. Each plot comes with a legend that shows the mapping between color or shape and levels of the class variable.

        +
        +
        +
        +
        +

        When class is mapped to shape, we get two warnings:

        +
        +

        1: The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes difficult to discriminate; you have 7. Consider specifying shapes manually if you must have them.

        +

        2: Removed 62 rows containing missing values (geom_point()).

        +
        +

        Since ggplot2 will only use six shapes at a time, by default, additional groups will go unplotted when you use the shape aesthetic. The second warning is related – there are 62 SUVs in the dataset and they’re not plotted.

        +

        Similarly, we can map class to size or alpha aesthetics as well, which control the shape and the transparency of the points, respectively.

        +
        +
        # Left
        +ggplot(mpg, aes(x = displ, y = hwy, size = class)) +
        +  geom_point()
        +#> Warning: Using size for a discrete variable is not advised.
        +
        +# Right
        +ggplot(mpg, aes(x = displ, y = hwy, alpha = class)) +
        +  geom_point()
        +#> Warning: Using alpha for a discrete variable is not advised.
        +
        +
        +
        +

        Two scatterplots next to each other, both visualizing highway fuel efficiency versus engine size of cars and showing a negative association. In the plot on the left class is mapped to the size aesthetic, resulting in different sizes for each class. In the plot on the right class is mapped the alpha aesthetic, resulting in different alpha (transparency) levels for each class. Each plot comes with a legend that shows the mapping between size or alpha level and levels of the class variable.

        +
        +
        +

        Two scatterplots next to each other, both visualizing highway fuel efficiency versus engine size of cars and showing a negative association. In the plot on the left class is mapped to the size aesthetic, resulting in different sizes for each class. In the plot on the right class is mapped the alpha aesthetic, resulting in different alpha (transparency) levels for each class. Each plot comes with a legend that shows the mapping between size or alpha level and levels of the class variable.

        +
        +
        +
        +
        +

        Both of these produce warnings as well:

        +
        +

        Using alpha for a discrete variable is not advised.

        +
        +

        Mapping an unordered discrete (categorical) variable (class) to an ordered aesthetic (size or alpha) is generally not a good idea because it implies a ranking that does not in fact exist.

        +

        Once you map an aesthetic, ggplot2 takes care of the rest. It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values. For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. The axis line provides the same information as a legend; it explains the mapping between locations and values.

        +

        You can also set the visual properties of your geom manually as an argument of your geom function (outside of aes()) instead of relying on a variable mapping to determine the appearance. For example, we can make all of the points in our plot blue:

        +
        +
        ggplot(mpg, aes(x = displ, y = hwy)) + 
        +  geom_point(color = "blue")
        +
        +

        Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. All points are blue.

        +
        +
        +

        Here, the color doesn’t convey information about a variable, but only changes the appearance of the plot. You’ll need to pick a value that makes sense for that aesthetic:

        +
          +
        • The name of a color as a character string, e.g., color = "blue" +
        • +
        • The size of a point in mm, e.g., size = 1 +
        • +
        • The shape of a point as a number, e.g, shape = 1, as shown in Figura 9.1.
        • +
        +
        +
        +
        +

        Mapping between shapes and the numbers that represent them: 0 - square, 1 - circle, 2 - triangle point up, 3 - plus, 4 - cross, 5 - diamond, 6 - triangle point down, 7 - square cross, 8 - star, 9 - diamond plus, 10 - circle plus, 11 - triangles up and down, 12 - square plus, 13 - circle cross, 14 - square and triangle down, 15 - filled square, 16 - filled circle, 17 - filled triangle point-up, 18 - filled diamond, 19 - solid circle, 20 - bullet (smaller circle), 21 - filled circle blue, 22 - filled square blue, 23 - filled diamond blue, 24 - filled triangle point-up blue, 25 - filled triangle point down blue.

        +
        Figura 9.1: R has 25 built-in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the color and fill aesthetics. The hollow shapes (0–14) have a border determined by color; the solid shapes (15–20) are filled with color; the filled shapes (21–24) have a border of color and are filled with fill. Shapes are arranged to keep similar shapes next to each other.
        +
        +
        +
        +

        So far we have discussed aesthetics that we can map or set in a scatterplot, when using a point geom. You can learn more about all possible aesthetic mappings in the aesthetic specifications vignette at https://ggplot2.tidyverse.org/articles/ggplot2-specs.html.

        +

        The specific aesthetics you can use for a plot depend on the geom you use to represent the data. In the next section we dive deeper into geoms.

        +

        +9.2.1 Exercises

        +
          +
        1. Create a scatterplot of hwy vs. displ where the points are pink filled in triangles.

        2. +
        3. +

          Why did the following code not result in a plot with blue points?

          +
          +
          ggplot(mpg) + 
          +  geom_point(aes(x = displ, y = hwy, color = "blue"))
          +
          +
        4. +
        5. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)

        6. +
        7. What happens if you map an aesthetic to something other than a variable name, like aes(color = displ < 5)? Note, you’ll also need to specify x and y.

        8. +

        +9.3 Geometric objects

        +

        How are these two plots similar?

        +
        +
        +
        +

        There are two plots. The plot on the left is a scatterplot of highway fuel efficiency versus engine size of cars and the plot on the right shows a smooth curve that follows the trajectory of the relationship between these variables. A confidence interval around the smooth curve is also displayed.

        +
        +
        +

        There are two plots. The plot on the left is a scatterplot of highway fuel efficiency versus engine size of cars and the plot on the right shows a smooth curve that follows the trajectory of the relationship between these variables. A confidence interval around the smooth curve is also displayed.

        +
        +
        +
        +

        Both plots contain the same x variable, the same y variable, and both describe the same data. But the plots are not identical. Each plot uses a different geometric object, geom, to represent the data. The plot on the left uses the point geom, and the plot on the right uses the smooth geom, a smooth line fitted to the data.

        +

        To change the geom in your plot, change the geom function that you add to ggplot(). For instance, to make the plots above, you can use the following code:

        +
        +
        # Left
        +ggplot(mpg, aes(x = displ, y = hwy)) + 
        +  geom_point()
        +
        +# Right
        +ggplot(mpg, aes(x = displ, y = hwy)) + 
        +  geom_smooth()
        +#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
        +
        +

        Every geom function in ggplot2 takes a mapping argument, either defined locally in the geom layer or globally in the ggplot() layer. However, not every aesthetic works with every geom. You could set the shape of a point, but you couldn’t set the “shape” of a line. If you try, ggplot2 will silently ignore that aesthetic mapping. On the other hand, you could set the linetype of a line. geom_smooth() will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype.

        +
        +
        # Left
        +ggplot(mpg, aes(x = displ, y = hwy, shape = drv)) + 
        +  geom_smooth()
        +
        +# Right
        +ggplot(mpg, aes(x = displ, y = hwy, linetype = drv)) + 
        +  geom_smooth()
        +
        +
        +
        +

        Two plots of highway fuel efficiency versus engine size of cars. The data are represented with smooth curves. On the left, three smooth curves, all with the same linetype. On the right, three smooth curves with different line types (solid, dashed, or long dashed) for each type of drive train. In both plots, confidence intervals around the smooth curves are also displayed.

        +
        +
        +

        Two plots of highway fuel efficiency versus engine size of cars. The data are represented with smooth curves. On the left, three smooth curves, all with the same linetype. On the right, three smooth curves with different line types (solid, dashed, or long dashed) for each type of drive train. In both plots, confidence intervals around the smooth curves are also displayed.

        +
        +
        +
        +
        +

        Here, geom_smooth() separates the cars into three lines based on their drv value, which describes a car’s drive train. One line describes all of the points that have a 4 value, one line describes all of the points that have an f value, and one line describes all of the points that have an r value. Here, 4 stands for four-wheel drive, f for front-wheel drive, and r for rear-wheel drive.

        +

        If this sounds strange, we can make it clearer by overlaying the lines on top of the raw data and then coloring everything according to drv.

        +
        +
        ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + 
        +  geom_point() +
        +  geom_smooth(aes(linetype = drv))
        +
        +

        A plot of highway fuel efficiency versus engine size of cars. The data are represented with points (colored by drive train) as well as smooth curves (where line type is determined based on drive train as well). Confidence intervals around the smooth curves are also displayed.

        +
        +
        +

        Notice that this plot contains two geoms in the same graph.

        +

        Many geoms, like geom_smooth(), use a single geometric object to display multiple rows of data. For these geoms, you can set the group aesthetic to a categorical variable to draw multiple objects. ggplot2 will draw a separate object for each unique value of the grouping variable. In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the linetype example). It is convenient to rely on this feature because the group aesthetic by itself does not add a legend or distinguishing features to the geoms.

        +
        +
        # Left
        +ggplot(mpg, aes(x = displ, y = hwy)) +
        +  geom_smooth()
        +
        +# Middle
        +ggplot(mpg, aes(x = displ, y = hwy)) +
        +  geom_smooth(aes(group = drv))
        +
        +# Right
        +ggplot(mpg, aes(x = displ, y = hwy)) +
        +  geom_smooth(aes(color = drv), show.legend = FALSE)
        +
        +
        +
        +

        Three plots, each with highway fuel efficiency on the y-axis and engine size of cars, where data are represented by a smooth curve. The first plot only has these two variables, the center plot has three separate smooth curves for each level of drive train, and the right plot not only has the same three separate smooth curves for each level of drive train but these curves are plotted in different colors, with a legend explaining which color maps to which level. Confidence intervals around the smooth curves are also displayed.

        +
        +
        +

        Three plots, each with highway fuel efficiency on the y-axis and engine size of cars, where data are represented by a smooth curve. The first plot only has these two variables, the center plot has three separate smooth curves for each level of drive train, and the right plot not only has the same three separate smooth curves for each level of drive train but these curves are plotted in different colors, with a legend explaining which color maps to which level. Confidence intervals around the smooth curves are also displayed.

        +
        +
        +

        Three plots, each with highway fuel efficiency on the y-axis and engine size of cars, where data are represented by a smooth curve. The first plot only has these two variables, the center plot has three separate smooth curves for each level of drive train, and the right plot not only has the same three separate smooth curves for each level of drive train but these curves are plotted in different colors, with a legend explaining which color maps to which level. Confidence intervals around the smooth curves are also displayed.

        +
        +
        +
        +
        +

        If you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. It will use these mappings to extend or overwrite the global mappings for that layer only. This makes it possible to display different aesthetics in different layers.

        +
        +
        ggplot(mpg, aes(x = displ, y = hwy)) + 
        +  geom_point(aes(color = class)) + 
        +  geom_smooth()
        +
        +

        Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of cars is overlaid along with a confidence interval around it.

        +
        +
        +

        You can use the same idea to specify different data for each layer. Here, we use red points as well as open circles to highlight two-seater cars. The local data argument in geom_point() overrides the global data argument in ggplot() for that layer only.

        +
        +
        ggplot(mpg, aes(x = displ, y = hwy)) + 
        +  geom_point() + 
        +  geom_point(
        +    data = mpg |> filter(class == "2seater"), 
        +    color = "red"
        +  ) +
        +  geom_point(
        +    data = mpg |> filter(class == "2seater"), 
        +    shape = "circle open", size = 3, color = "red"
        +  )
        +
        +

        Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of subcompact cars is overlaid along with a confidence interval around it.

        +
        +
        +

        Geoms are the fundamental building blocks of ggplot2. You can completely transform the look of your plot by changing its geom, and different geoms can reveal different features of your data. For example, the histogram and density plot below reveal that the distribution of highway mileage is bimodal and right skewed while the boxplot reveals two potential outliers.

        +
        +
        # Left
        +ggplot(mpg, aes(x = hwy)) +
        +  geom_histogram(binwidth = 2)
        +
        +# Middle
        +ggplot(mpg, aes(x = hwy)) +
        +  geom_density()
        +
        +# Right
        +ggplot(mpg, aes(x = hwy)) +
        +  geom_boxplot()
        +
        +
        +
        +

        Three plots: histogram, density plot, and box plot of highway mileage.

        +
        +
        +

        Three plots: histogram, density plot, and box plot of highway mileage.

        +
        +
        +

        Three plots: histogram, density plot, and box plot of highway mileage.

        +
        +
        +
        +
        +

        ggplot2 provides more than 40 geoms but these don’t cover all possible plots one could make. If you need a different geom, we recommend looking into extension packages first to see if someone else has already implemented it (see https://exts.ggplot2.tidyverse.org/gallery/ for a sampling). For example, the ggridges package (https://wilkelab.org/ggridges) is useful for making ridgeline plots, which can be useful for visualizing the density of a numerical variable for different levels of a categorical variable. In the following plot not only did we use a new geom (geom_density_ridges()), but we have also mapped the same variable to multiple aesthetics (drv to y, fill, and color) as well as set an aesthetic (alpha = 0.5) to make the density curves transparent.

        +
        +
        library(ggridges)
        +
        +ggplot(mpg, aes(x = hwy, y = drv, fill = drv, color = drv)) +
        +  geom_density_ridges(alpha = 0.5, show.legend = FALSE)
        +#> Picking joint bandwidth of 1.28
        +
        +

        Density curves for highway mileage for cars with rear wheel, front wheel, and 4-wheel drives plotted separately. The distribution is bimodal and roughly symmetric for real and 4 wheel drive cars and unimodal and right skewed for front wheel drive cars.

        +
        +
        +

        The best place to get a comprehensive overview of all of the geoms ggplot2 offers, as well as all functions in the package, is the reference page: https://ggplot2.tidyverse.org/reference. To learn more about any single geom, use the help (e.g., ?geom_smooth).

        +

        +9.3.1 Exercises

        +
          +
        1. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

        2. +
        3. +

          Earlier in this chapter we used show.legend without explaining it:

          +
          +
          ggplot(mpg, aes(x = displ, y = hwy)) +
          +  geom_smooth(aes(color = drv), show.legend = FALSE)
          +
          +

          What does show.legend = FALSE do here? What happens if you remove it? Why do you think we used it earlier?

          +
        4. +
        5. What does the se argument to geom_smooth() do?

        6. +
        7. +

          Recreate the R code necessary to generate the following graphs. Note that wherever a categorical variable is used in the plot, it’s drv.

          +
          +
          +
          +

          There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border.

          +
          +
          +

          There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border.

          +
          +
          +
          +
          +

          There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border.

          +
          +
          +

          There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border.

          +
          +
          +
          +
          +

          There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border.

          +
          +
          +

          There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border.

          +
          +
          +
          +
        8. +

        +9.4 Facets

        +

        In Capítulo 1 you learned about faceting with facet_wrap(), which splits a plot into subplots that each display one subset of the data based on a categorical variable.

        +
        +
        ggplot(mpg, aes(x = displ, y = hwy)) + 
        +  geom_point() + 
        +  facet_wrap(~cyl)
        +
        +

        Scatterplot of highway fuel efficiency versus engine size of cars, faceted by class, with facets spanning two rows.

        +
        +
        +

        To facet your plot with the combination of two variables, switch from facet_wrap() to facet_grid(). The first argument of facet_grid() is also a formula, but now it’s a double sided formula: rows ~ cols.

        +
        +
        ggplot(mpg, aes(x = displ, y = hwy)) + 
        +  geom_point() + 
        +  facet_grid(drv ~ cyl)
        +
        +

        Scatterplot of highway fuel efficiency versus engine size of cars, faceted by number of cylinders across rows and by type of drive train across columns. This results in a 4x3 grid of 12 facets. Some of these facets have no observations: 5 cylinders and 4 wheel drive, 4 or 5 cylinders and front wheel drive.

        +
        +
        +

        By default each of the facets share the same scale and range for x and y axes. This is useful when you want to compare data across facets but it can be limiting when you want to visualize the relationship within each facet better. Setting the scales argument in a faceting function to "free" will allow for different axis scales across both rows and columns, "free_x" will allow for different scales across rows, and "free_y" will allow for different scales across columns.

        +
        +
        ggplot(mpg, aes(x = displ, y = hwy)) + 
        +  geom_point() + 
        +  facet_grid(drv ~ cyl, scales = "free_y")
        +
        +

        Scatterplot of highway fuel efficiency versus engine size of cars, faceted by number of cylinders across rows and by type of drive train across columns. This results in a 4x3 grid of 12 facets. Some of these facets have no observations: 5 cylinders and 4 wheel drive, 4 or 5 cylinders and front wheel drive. Facets within a row share the same y-scale and facets within a column share the same x-scale.

        +
        +
        +

        +9.4.1 Exercises

        +
          +
        1. What happens if you facet on a continuous variable?

        2. +
        3. +

          What do the empty cells in the plot above with facet_grid(drv ~ cyl) mean? Run the following code. How do they relate to the resulting plot?

          +
          +
          ggplot(mpg) + 
          +  geom_point(aes(x = drv, y = cyl))
          +
          +
        4. +
        5. +

          What plots does the following code make? What does . do?

          +
          +
          ggplot(mpg) + 
          +  geom_point(aes(x = displ, y = hwy)) +
          +  facet_grid(drv ~ .)
          +
          +ggplot(mpg) + 
          +  geom_point(aes(x = displ, y = hwy)) +
          +  facet_grid(. ~ cyl)
          +
          +
        6. +
        7. +

          Take the first faceted plot in this section:

          +
          +
          ggplot(mpg) + 
          +  geom_point(aes(x = displ, y = hwy)) + 
          +  facet_wrap(~ class, nrow = 2)
          +
          +

          What are the advantages to using faceting instead of the color aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

          +
        8. +
        9. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?

        10. +
        11. +

          Which of the following plots makes it easier to compare engine size (displ) across cars with different drive trains? What does this say about when to place a faceting variable across rows or columns?

          +
          +
          ggplot(mpg, aes(x = displ)) + 
          +  geom_histogram() + 
          +  facet_grid(drv ~ .)
          +
          +ggplot(mpg, aes(x = displ)) + 
          +  geom_histogram() +
          +  facet_grid(. ~ drv)
          +
          +
        12. +
        13. +

          Recreate the following plot using facet_wrap() instead of facet_grid(). How do the positions of the facet labels change?

          +
          +
          ggplot(mpg) + 
          +  geom_point(aes(x = displ, y = hwy)) +
          +  facet_grid(drv ~ .)
          +
          +
        14. +

        +9.5 Statistical transformations

        +

        Consider a basic bar chart, drawn with geom_bar() or geom_col(). The following chart displays the total number of diamonds in the diamonds dataset, grouped by cut. The diamonds dataset is in the ggplot2 package and contains information on ~54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.

        +
        +
        ggplot(diamonds, aes(x = cut)) + 
        +  geom_bar()
        +
        +

        Bar chart of number of each cut of diamond. There are roughly 1500 Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut diamonds.

        +
        +
        +

        On the x-axis, the chart displays cut, a variable from diamonds. On the y-axis, it displays count, but count is not a variable in diamonds! Where does count come from? Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:

        +
          +
        • Bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.

        • +
        • Smoothers fit a model to your data and then plot predictions from the model.

        • +
        • Boxplots compute the five-number summary of the distribution and then display that summary as a specially formatted box.

        • +
        +

        The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation. Figura 9.2 shows how this process works with geom_bar().

        +
        +
        +
        +

        A figure demonstrating three steps of creating a bar chart. Step 1. geom_bar() begins with the diamonds data set. Step 2. geom_bar() transforms the data with the count stat, which returns a data set of cut values and counts. Step 3. geom_bar() uses the transformed data to build the plot. cut is mapped to the x-axis, count is mapped to the y-axis.

        +
        Figura 9.2: When creating a bar chart we first start with the raw data, then aggregate it to count the number of observations in each bar, and finally map those computed variables to plot aesthetics.
        +
        +
        +
        +

        You can learn which stat a geom uses by inspecting the default value for the stat argument. For example, ?geom_bar shows that the default value for stat is “count”, which means that geom_bar() uses stat_count(). stat_count() is documented on the same page as geom_bar(). If you scroll down, the section called “Computed variables” explains that it computes two new variables: count and prop.

        +

        Every geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation. However, there are three reasons why you might need to use a stat explicitly:

        +
          +
        1. +

          You might want to override the default stat. In the code below, we change the stat of geom_bar() from count (the default) to identity. This lets us map the height of the bars to the raw values of a y variable.

          +
          +
          diamonds |>
          +  count(cut) |>
          +  ggplot(aes(x = cut, y = n)) +
          +  geom_bar(stat = "identity")
          +
          +

          Bar chart of number of each cut of diamond. There are roughly 1500 Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut diamonds.

          +
          +
          +
        2. +
        3. +

          You might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportions, rather than counts:

          +
          +
          ggplot(diamonds, aes(x = cut, y = after_stat(prop), group = 1)) + 
          +  geom_bar()
          +
          +

          Bar chart of proportion of each cut of diamond. Roughly, Fair diamonds make up 0.03, Good 0.09, Very Good 0.22, Premium 0.26, and Ideal 0.40.

          +
          +
          +

          To find the possible variables that can be computed by the stat, look for the section titled “computed variables” in the help for geom_bar().

          +
        4. +
        5. +

          You might want to draw greater attention to the statistical transformation in your code. For example, you might use stat_summary(), which summarizes the y values for each unique x value, to draw attention to the summary that you’re computing:

          +
          +
          ggplot(diamonds) + 
          +  stat_summary(
          +    aes(x = cut, y = depth),
          +    fun.min = min,
          +    fun.max = max,
          +    fun = median
          +  )
          +
          +

          A plot with depth on the y-axis and cut on the x-axis (with levels fair, good, very good, premium, and ideal) of diamonds. For each level of cut, vertical lines extend from minimum to maximum depth for diamonds in that cut category, and the median depth is indicated on the line with a point.

          +
          +
          +
        6. +
        +

        ggplot2 provides more than 20 stats for you to use. Each stat is a function, so you can get help in the usual way, e.g., ?stat_bin.

        +

        +9.5.1 Exercises

        +
          +
        1. What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?

        2. +
        3. What does geom_col() do? How is it different from geom_bar()?

        4. +
        5. Most geoms and stats come in pairs that are almost always used in concert. Make a list of all the pairs. What do they have in common? (Hint: Read through the documentation.)

        6. +
        7. What variables does stat_smooth() compute? What arguments control its behavior?

        8. +
        9. +

          In our proportion bar chart, we needed to set group = 1. Why? In other words, what is the problem with these two graphs?

          +
          +
          ggplot(diamonds, aes(x = cut, y = after_stat(prop))) + 
          +  geom_bar()
          +ggplot(diamonds, aes(x = cut, fill = color, y = after_stat(prop))) + 
          +  geom_bar()
          +
          +
        10. +

        +9.6 Position adjustments

        +

        There’s one more piece of magic associated with bar charts. You can color a bar chart using either the color aesthetic, or, more usefully, the fill aesthetic:

        +
        +
        # Left
        +ggplot(mpg, aes(x = drv, color = drv)) + 
        +  geom_bar()
        +
        +# Right
        +ggplot(mpg, aes(x = drv, fill = drv)) + 
        +  geom_bar()
        +
        +
        +
        +

        Two bar charts of drive types of cars. In the first plot, the bars have colored borders. In the second plot, they're filled with colors. Heights of the bars correspond to the number of cars in each cut category.

        +
        +
        +

        Two bar charts of drive types of cars. In the first plot, the bars have colored borders. In the second plot, they're filled with colors. Heights of the bars correspond to the number of cars in each cut category.

        +
        +
        +
        +
        +

        Note what happens if you map the fill aesthetic to another variable, like class: the bars are automatically stacked. Each colored rectangle represents a combination of drv and class.

        +
        +
        ggplot(mpg, aes(x = drv, fill = class)) + 
        +  geom_bar()
        +
        +

        Segmented bar chart of drive types of cars, where each bar is filled with colors for the classes of cars. Heights of the bars correspond to the number of cars in each drive category, and heights of the colored segments are proportional to the number of cars with a given class level within a given drive type level.

        +
        +
        +

        The stacking is performed automatically using the position adjustment specified by the position argument. If you don’t want a stacked bar chart, you can use one of three other options: "identity", "dodge" or "fill".

        +
          +
        • +

          position = "identity" will place each object exactly where it falls in the context of the graph. This is not very useful for bars, because it overlaps them. To see that overlapping we either need to make the bars slightly transparent by setting alpha to a small value, or completely transparent by setting fill = NA.

          +
          +
          # Left
          +ggplot(mpg, aes(x = drv, fill = class)) + 
          +  geom_bar(alpha = 1/5, position = "identity")
          +
          +# Right
          +ggplot(mpg, aes(x = drv, color = class)) + 
          +  geom_bar(fill = NA, position = "identity")
          +
          +
          +
          +

          Segmented bar chart of drive types of cars, where each bar is filled with colors for the classes of cars. Heights of the bars correspond to the number of cars in each drive category, and heights of the colored segments are proportional to the number of cars with a given class level within a given drive type level. However the segments overlap. In the first plot the bars are filled with transparent colors and in the second plot they are only outlined with color.

          +
          +
          +

          Segmented bar chart of drive types of cars, where each bar is filled with colors for the classes of cars. Heights of the bars correspond to the number of cars in each drive category, and heights of the colored segments are proportional to the number of cars with a given class level within a given drive type level. However the segments overlap. In the first plot the bars are filled with transparent colors and in the second plot they are only outlined with color.

          +
          +
          +
          +
          +

          The identity position adjustment is more useful for 2d geoms, like points, where it is the default.

          +
        • +
        • position = "fill" works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.

        • +
        • +

          position = "dodge" places overlapping objects directly beside one another. This makes it easier to compare individual values.

          +
          +
          # Left
          +ggplot(mpg, aes(x = drv, fill = class)) + 
          +  geom_bar(position = "fill")
          +
          +# Right
          +ggplot(mpg, aes(x = drv, fill = class)) + 
          +  geom_bar(position = "dodge")
          +
          +
          +
          +

          On the left, segmented bar chart of drive types of cars, where each bar is filled with colors for the levels of class. Height of each bar is 1 and heights of the colored segments represent the proportions of cars with a given class level within a given drive type. On the right, dodged bar chart of drive types of cars. Dodged bars are grouped by levels of drive type. Within each group bars represent each level of class. Some classes are represented within some drive types and not represented in others, resulting in unequal number of bars within each group. Heights of these bars represent the number of cars with a given level of drive type and class.

          +
          +
          +

          On the left, segmented bar chart of drive types of cars, where each bar is filled with colors for the levels of class. Height of each bar is 1 and heights of the colored segments represent the proportions of cars with a given class level within a given drive type. On the right, dodged bar chart of drive types of cars. Dodged bars are grouped by levels of drive type. Within each group bars represent each level of class. Some classes are represented within some drive types and not represented in others, resulting in unequal number of bars within each group. Heights of these bars represent the number of cars with a given level of drive type and class.

          +
          +
          +
          +
          +
        • +
        +

        There’s one other type of adjustment that’s not useful for bar charts, but can be very useful for scatterplots. Recall our first scatterplot. Did you notice that the plot displays only 126 points, even though there are 234 observations in the dataset?

        +
        +
        +

        Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association.

        +
        +
        +

        The underlying values of hwy and displ are rounded so the points appear on a grid and many points overlap each other. This problem is known as overplotting. This arrangement makes it difficult to see the distribution of the data. Are the data points spread equally throughout the graph, or is there one special combination of hwy and displ that contains 109 values?

        +

        You can avoid this gridding by setting the position adjustment to “jitter”. position = "jitter" adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise.

        +
        +
        ggplot(mpg, aes(x = displ, y = hwy)) + 
        +  geom_point(position = "jitter")
        +
        +

        Jittered scatterplot of highway fuel efficiency versus engine size of cars. The plot shows a negative association.

        +
        +
        +

        Adding randomness seems like a strange way to improve your plot, but while it makes your graph less accurate at small scales, it makes your graph more revealing at large scales. Because this is such a useful operation, ggplot2 comes with a shorthand for geom_point(position = "jitter"): geom_jitter().

        +

        To learn more about a position adjustment, look up the help page associated with each adjustment: ?position_dodge, ?position_fill, ?position_identity, ?position_jitter, and ?position_stack.

        +

        +9.6.1 Exercises

        +
          +
        1. +

          What is the problem with the following plot? How could you improve it?

          +
          +
          ggplot(mpg, aes(x = cty, y = hwy)) + 
          +  geom_point()
          +
          +
        2. +
        3. +

          What, if anything, is the difference between the two plots? Why?

          +
          +
          ggplot(mpg, aes(x = displ, y = hwy)) +
          +  geom_point()
          +ggplot(mpg, aes(x = displ, y = hwy)) +
          +  geom_point(position = "identity")
          +
          +
        4. +
        5. What parameters to geom_jitter() control the amount of jittering?

        6. +
        7. Compare and contrast geom_jitter() with geom_count().

        8. +
        9. What’s the default position adjustment for geom_boxplot()? Create a visualization of the mpg dataset that demonstrates it.

        10. +

        +9.7 Coordinate systems

        +

        Coordinate systems are probably the most complicated part of ggplot2. The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are two other coordinate systems that are occasionally helpful.

        +
          +
        • +

          coord_quickmap() sets the aspect ratio correctly for geographic maps. This is very important if you’re plotting spatial data with ggplot2. We don’t have the space to discuss maps in this book, but you can learn more in the Maps chapter of ggplot2: Elegant graphics for data analysis.

          +
          +
          nz <- map_data("nz")
          +
          +ggplot(nz, aes(x = long, y = lat, group = group)) +
          +  geom_polygon(fill = "white", color = "black")
          +
          +ggplot(nz, aes(x = long, y = lat, group = group)) +
          +  geom_polygon(fill = "white", color = "black") +
          +  coord_quickmap()
          +
          +
          +
          +

          Two maps of the boundaries of New Zealand. In the first plot the aspect ratio is incorrect, in the second plot it is correct.

          +
          +
          +

          Two maps of the boundaries of New Zealand. In the first plot the aspect ratio is incorrect, in the second plot it is correct.

          +
          +
          +
          +
          +
        • +
        • +

          coord_polar() uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.

          +
          +
          bar <- ggplot(data = diamonds) + 
          +  geom_bar(
          +    mapping = aes(x = clarity, fill = clarity), 
          +    show.legend = FALSE,
          +    width = 1
          +  ) + 
          +  theme(aspect.ratio = 1)
          +
          +bar + coord_flip()
          +bar + coord_polar()
          +
          +
          +
          +

          There are two plots. On the left is a bar chart of clarity of diamonds, on the right is a Coxcomb chart of the same data.

          +
          +
          +

          There are two plots. On the left is a bar chart of clarity of diamonds, on the right is a Coxcomb chart of the same data.

          +
          +
          +
          +
          +
        • +
        +

        +9.7.1 Exercises

        +
          +
        1. Turn a stacked bar chart into a pie chart using coord_polar().

        2. +
        3. What’s the difference between coord_quickmap() and coord_map()?

        4. +
        5. +

          What does the following plot tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

          +
          +
          ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
          +  geom_point() + 
          +  geom_abline() +
          +  coord_fixed()
          +
          +
        6. +

        +9.8 The layered grammar of graphics

        +

        We can expand on the graphing template you learned in Seção 1.3 by adding position adjustments, stats, coordinate systems, and faceting:

        +
        ggplot(data = <DATA>) + 
        +  <GEOM_FUNCTION>(
        +     mapping = aes(<MAPPINGS>),
        +     stat = <STAT>, 
        +     position = <POSITION>
        +  ) +
        +  <COORDINATE_FUNCTION> +
        +  <FACET_FUNCTION>
        +

        Our new template takes seven parameters, the bracketed words that appear in the template. In practice, you rarely need to supply all seven parameters to make a graph because ggplot2 will provide useful defaults for everything except the data, the mappings, and the geom function.

        +

        The seven parameters in the template compose the grammar of graphics, a formal system for building plots. The grammar of graphics is based on the insight that you can uniquely describe any plot as a combination of a dataset, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, a faceting scheme, and a theme.

        +

        To see how this works, consider how you could build a basic plot from scratch: you could start with a dataset and then transform it into the information that you want to display (with a stat). Next, you could choose a geometric object to represent each observation in the transformed data. You could then use the aesthetic properties of the geoms to represent variables in the data. You would map the values of each variable to the levels of an aesthetic. These steps are illustrated in Figura 9.3. You’d then select a coordinate system to place the geoms into, using the location of the objects (which is itself an aesthetic property) to display the values of the x and y variables.

        +
        +
        +
        +

        A figure demonstrating the steps for going from raw data to table of frequencies where each row represents one level of cut and a count column shows how many diamonds are in that cut level. Then, these values are mapped to heights of bars.

        +
        Figura 9.3: Steps for going from raw data to a table of frequencies to a bar plot where the heights of the bar represent the frequencies.
        +
        +
        +
        +

        At this point, you would have a complete graph, but you could further adjust the positions of the geoms within the coordinate system (a position adjustment) or split the graph into subplots (faceting). You could also extend the plot by adding one or more additional layers, where each additional layer uses a dataset, a geom, a set of mappings, a stat, and a position adjustment.

        +

        You could use this method to build any plot that you imagine. In other words, you can use the code template that you’ve learned in this chapter to build hundreds of thousands of unique plots.

        +

        If you’d like to learn more about the theoretical underpinnings of ggplot2, you might enjoy reading “The Layered Grammar of Graphics”, the scientific paper that describes the theory of ggplot2 in detail.

        +

        +9.9 Summary

        +

        In this chapter you learned about the layered grammar of graphics starting with aesthetics and geometries to build a simple plot, facets for splitting the plot into subsets, statistics for understanding how geoms are calculated, position adjustments for controlling the fine details of position when geoms might otherwise overlap, and coordinate systems which allow you to fundamentally change what x and y mean. One layer we have not yet touched on is theme, which we will introduce in Seção 11.5.

        +

        Two very useful resources for getting an overview of the complete ggplot2 functionality are the ggplot2 cheatsheet (which you can find at https://posit.co/resources/cheatsheets) and the ggplot2 package website (https://ggplot2.tidyverse.org).

        +

        An important lesson you should take from this chapter is that when you feel the need for a geom that is not provided by ggplot2, it’s always a good idea to look into whether someone else has already solved your problem by creating a ggplot2 extension package that offers that geom.

        + + +
        +
        +
        +
        + + + \ No newline at end of file diff --git a/layers_files/figure-html/fig-shapes-1.png b/layers_files/figure-html/fig-shapes-1.png new file mode 100644 index 000000000..e0f7efa80 Binary files /dev/null and b/layers_files/figure-html/fig-shapes-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-10-1.png b/layers_files/figure-html/unnamed-chunk-10-1.png new file mode 100644 index 000000000..dff599615 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-10-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-10-2.png b/layers_files/figure-html/unnamed-chunk-10-2.png new file mode 100644 index 000000000..07cbde110 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-10-2.png differ diff --git a/layers_files/figure-html/unnamed-chunk-11-1.png b/layers_files/figure-html/unnamed-chunk-11-1.png new file mode 100644 index 000000000..e16007e54 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-11-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-11-2.png b/layers_files/figure-html/unnamed-chunk-11-2.png new file mode 100644 index 000000000..000b8f069 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-11-2.png differ diff --git a/layers_files/figure-html/unnamed-chunk-12-1.png b/layers_files/figure-html/unnamed-chunk-12-1.png new file mode 100644 index 000000000..b360852ad Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-12-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-13-1.png b/layers_files/figure-html/unnamed-chunk-13-1.png new file mode 100644 index 000000000..e1333ae58 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-13-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-13-2.png b/layers_files/figure-html/unnamed-chunk-13-2.png new file mode 100644 index 000000000..45519a9f4 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-13-2.png differ diff --git a/layers_files/figure-html/unnamed-chunk-13-3.png b/layers_files/figure-html/unnamed-chunk-13-3.png new file mode 100644 index 000000000..e55a0fc1b Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-13-3.png differ diff --git a/layers_files/figure-html/unnamed-chunk-14-1.png b/layers_files/figure-html/unnamed-chunk-14-1.png new file mode 100644 index 000000000..1c1658533 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-14-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-15-1.png b/layers_files/figure-html/unnamed-chunk-15-1.png new file mode 100644 index 000000000..ca2a21078 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-15-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-16-1.png b/layers_files/figure-html/unnamed-chunk-16-1.png new file mode 100644 index 000000000..4bd550b62 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-16-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-16-2.png b/layers_files/figure-html/unnamed-chunk-16-2.png new file mode 100644 index 000000000..738b284be Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-16-2.png differ diff --git a/layers_files/figure-html/unnamed-chunk-16-3.png b/layers_files/figure-html/unnamed-chunk-16-3.png new file mode 100644 index 000000000..30ca1fde0 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-16-3.png differ diff --git a/layers_files/figure-html/unnamed-chunk-17-1.png b/layers_files/figure-html/unnamed-chunk-17-1.png new file mode 100644 index 000000000..f344b7c5b Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-17-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-18-1.png b/layers_files/figure-html/unnamed-chunk-18-1.png new file mode 100644 index 000000000..9924dc04a Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-18-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-19-1.png b/layers_files/figure-html/unnamed-chunk-19-1.png new file mode 100644 index 000000000..b831cacd8 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-19-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-19-2.png b/layers_files/figure-html/unnamed-chunk-19-2.png new file mode 100644 index 000000000..3d00986a9 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-19-2.png differ diff --git a/layers_files/figure-html/unnamed-chunk-19-3.png b/layers_files/figure-html/unnamed-chunk-19-3.png new file mode 100644 index 000000000..4e93585f4 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-19-3.png differ diff --git a/layers_files/figure-html/unnamed-chunk-19-4.png b/layers_files/figure-html/unnamed-chunk-19-4.png new file mode 100644 index 000000000..545e91b11 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-19-4.png differ diff --git a/layers_files/figure-html/unnamed-chunk-19-5.png b/layers_files/figure-html/unnamed-chunk-19-5.png new file mode 100644 index 000000000..0efd57291 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-19-5.png differ diff --git a/layers_files/figure-html/unnamed-chunk-19-6.png b/layers_files/figure-html/unnamed-chunk-19-6.png new file mode 100644 index 000000000..fc2bd29f4 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-19-6.png differ diff --git a/layers_files/figure-html/unnamed-chunk-20-1.png b/layers_files/figure-html/unnamed-chunk-20-1.png new file mode 100644 index 000000000..4d4c6f2ba Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-20-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-21-1.png b/layers_files/figure-html/unnamed-chunk-21-1.png new file mode 100644 index 000000000..df666dabc Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-21-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-22-1.png b/layers_files/figure-html/unnamed-chunk-22-1.png new file mode 100644 index 000000000..08e937b02 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-22-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-23-1.png b/layers_files/figure-html/unnamed-chunk-23-1.png new file mode 100644 index 000000000..722699df6 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-23-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-24-1.png b/layers_files/figure-html/unnamed-chunk-24-1.png new file mode 100644 index 000000000..e2e27114c Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-24-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-24-2.png b/layers_files/figure-html/unnamed-chunk-24-2.png new file mode 100644 index 000000000..6d5a05cbb Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-24-2.png differ diff --git a/layers_files/figure-html/unnamed-chunk-25-1.png b/layers_files/figure-html/unnamed-chunk-25-1.png new file mode 100644 index 000000000..5b1320807 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-25-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-26-1.png b/layers_files/figure-html/unnamed-chunk-26-1.png new file mode 100644 index 000000000..cf74a5f57 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-26-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-26-2.png b/layers_files/figure-html/unnamed-chunk-26-2.png new file mode 100644 index 000000000..b9a5d926a Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-26-2.png differ diff --git a/layers_files/figure-html/unnamed-chunk-27-1.png b/layers_files/figure-html/unnamed-chunk-27-1.png new file mode 100644 index 000000000..e2e27114c Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-27-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-28-1.png b/layers_files/figure-html/unnamed-chunk-28-1.png new file mode 100644 index 000000000..835bc7bf7 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-28-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-30-1.png b/layers_files/figure-html/unnamed-chunk-30-1.png new file mode 100644 index 000000000..af72eaf13 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-30-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-31-1.png b/layers_files/figure-html/unnamed-chunk-31-1.png new file mode 100644 index 000000000..413e5e346 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-31-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-32-1.png b/layers_files/figure-html/unnamed-chunk-32-1.png new file mode 100644 index 000000000..c94bc0f39 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-32-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-33-1.png b/layers_files/figure-html/unnamed-chunk-33-1.png new file mode 100644 index 000000000..a0b86d5cd Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-33-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-33-2.png b/layers_files/figure-html/unnamed-chunk-33-2.png new file mode 100644 index 000000000..898587eeb Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-33-2.png differ diff --git a/layers_files/figure-html/unnamed-chunk-34-1.png b/layers_files/figure-html/unnamed-chunk-34-1.png new file mode 100644 index 000000000..ef9ae0712 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-34-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-34-2.png b/layers_files/figure-html/unnamed-chunk-34-2.png new file mode 100644 index 000000000..b6a01b18b Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-34-2.png differ diff --git a/layers_files/figure-html/unnamed-chunk-35-1.png b/layers_files/figure-html/unnamed-chunk-35-1.png new file mode 100644 index 000000000..967945e5d Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-35-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-36-1.png b/layers_files/figure-html/unnamed-chunk-36-1.png new file mode 100644 index 000000000..5ba3528d3 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-36-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-36-2.png b/layers_files/figure-html/unnamed-chunk-36-2.png new file mode 100644 index 000000000..19c69bbe6 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-36-2.png differ diff --git a/layers_files/figure-html/unnamed-chunk-37-1.png b/layers_files/figure-html/unnamed-chunk-37-1.png new file mode 100644 index 000000000..bd75743e0 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-37-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-37-2.png b/layers_files/figure-html/unnamed-chunk-37-2.png new file mode 100644 index 000000000..e808335d4 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-37-2.png differ diff --git a/layers_files/figure-html/unnamed-chunk-38-1.png b/layers_files/figure-html/unnamed-chunk-38-1.png new file mode 100644 index 000000000..dff599615 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-38-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-39-1.png b/layers_files/figure-html/unnamed-chunk-39-1.png new file mode 100644 index 000000000..beb6186cb Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-39-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-4-1.png b/layers_files/figure-html/unnamed-chunk-4-1.png new file mode 100644 index 000000000..e96e30c2e Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-4-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-4-2.png b/layers_files/figure-html/unnamed-chunk-4-2.png new file mode 100644 index 000000000..d2b48eb23 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-4-2.png differ diff --git a/layers_files/figure-html/unnamed-chunk-40-1.png b/layers_files/figure-html/unnamed-chunk-40-1.png new file mode 100644 index 000000000..3d75eec05 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-40-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-41-1.png b/layers_files/figure-html/unnamed-chunk-41-1.png new file mode 100644 index 000000000..dff599615 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-41-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-41-2.png b/layers_files/figure-html/unnamed-chunk-41-2.png new file mode 100644 index 000000000..dff599615 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-41-2.png differ diff --git a/layers_files/figure-html/unnamed-chunk-42-1.png b/layers_files/figure-html/unnamed-chunk-42-1.png new file mode 100644 index 000000000..353097344 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-42-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-42-2.png b/layers_files/figure-html/unnamed-chunk-42-2.png new file mode 100644 index 000000000..5667d6ac7 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-42-2.png differ diff --git a/layers_files/figure-html/unnamed-chunk-43-1.png b/layers_files/figure-html/unnamed-chunk-43-1.png new file mode 100644 index 000000000..22c6b20dc Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-43-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-43-2.png b/layers_files/figure-html/unnamed-chunk-43-2.png new file mode 100644 index 000000000..e4bfb8712 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-43-2.png differ diff --git a/layers_files/figure-html/unnamed-chunk-44-1.png b/layers_files/figure-html/unnamed-chunk-44-1.png new file mode 100644 index 000000000..a073573b4 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-44-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-5-1.png b/layers_files/figure-html/unnamed-chunk-5-1.png new file mode 100644 index 000000000..4a72eeba5 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-5-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-5-2.png b/layers_files/figure-html/unnamed-chunk-5-2.png new file mode 100644 index 000000000..cd4ebe2c4 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-5-2.png differ diff --git a/layers_files/figure-html/unnamed-chunk-6-1.png b/layers_files/figure-html/unnamed-chunk-6-1.png new file mode 100644 index 000000000..4cd8210df Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-6-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-8-1.png b/layers_files/figure-html/unnamed-chunk-8-1.png new file mode 100644 index 000000000..61546fa67 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-8-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-9-1.png b/layers_files/figure-html/unnamed-chunk-9-1.png new file mode 100644 index 000000000..161394c9e Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-9-1.png differ diff --git a/layers_files/figure-html/unnamed-chunk-9-2.png b/layers_files/figure-html/unnamed-chunk-9-2.png new file mode 100644 index 000000000..ce20d4e28 Binary files /dev/null and b/layers_files/figure-html/unnamed-chunk-9-2.png differ diff --git a/logicals.html b/logicals.html new file mode 100644 index 000000000..4b3ff739e --- /dev/null +++ b/logicals.html @@ -0,0 +1,1283 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 12  Logical vectors + + + + + + + + + + + + + + + + + + + + + + + + +
        +
        + +
        + + + +
        +

        12  Logical vectors

        +
        + + + +
        + + + + +
        + + +

        +12.1 Introduction

        +

        In this chapter, you’ll learn tools for working with logical vectors. Logical vectors are the simplest type of vector because each element can only be one of three possible values: TRUE, FALSE, and NA. It’s relatively rare to find logical vectors in your raw data, but you’ll create and manipulate them in the course of almost every analysis.

        +

        We’ll begin by discussing the most common way of creating logical vectors: with numeric comparisons. Then you’ll learn about how you can use Boolean algebra to combine different logical vectors, as well as some useful summaries. We’ll finish off with if_else() and case_when(), two useful functions for making conditional changes powered by logical vectors.

        +

        +12.1.1 Prerequisites

        +

        Most of the functions you’ll learn about in this chapter are provided by base R, so we don’t need the tidyverse, but we’ll still load it so we can use mutate(), filter(), and friends to work with data frames. We’ll also continue to draw examples from the nycflights13::flights dataset.

        + +

        However, as we start to cover more tools, there won’t always be a perfect real example. So we’ll start making up some dummy data with c():

        +
        +
        x <- c(1, 2, 3, 5, 7, 11, 13)
        +x * 2
        +#> [1]  2  4  6 10 14 22 26
        +
        +

        This makes it easier to explain individual functions at the cost of making it harder to see how it might apply to your data problems. Just remember that any manipulation we do to a free-floating vector, you can do to a variable inside a data frame with mutate() and friends.

        +
        +
        df <- tibble(x)
        +df |> 
        +  mutate(y = x * 2)
        +#> # A tibble: 7 × 2
        +#>       x     y
        +#>   <dbl> <dbl>
        +#> 1     1     2
        +#> 2     2     4
        +#> 3     3     6
        +#> 4     5    10
        +#> 5     7    14
        +#> 6    11    22
        +#> # ℹ 1 more row
        +
        +

        +12.2 Comparisons

        +

        A very common way to create a logical vector is via a numeric comparison with <, <=, >, >=, !=, and ==. So far, we’ve mostly created logical variables transiently within filter() — they are computed, used, and then thrown away. For example, the following filter finds all daytime departures that arrive roughly on time:

        +
        +
        flights |> 
        +  filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)
        +#> # A tibble: 172,286 × 19
        +#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
        +#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
        +#> 1  2013     1     1      601            600         1      844            850
        +#> 2  2013     1     1      602            610        -8      812            820
        +#> 3  2013     1     1      602            605        -3      821            805
        +#> 4  2013     1     1      606            610        -4      858            910
        +#> 5  2013     1     1      606            610        -4      837            845
        +#> 6  2013     1     1      607            607         0      858            915
        +#> # ℹ 172,280 more rows
        +#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
        +
        +

        It’s useful to know that this is a shortcut and you can explicitly create the underlying logical variables with mutate():

        +
        +
        flights |> 
        +  mutate(
        +    daytime = dep_time > 600 & dep_time < 2000,
        +    approx_ontime = abs(arr_delay) < 20,
        +    .keep = "used"
        +  )
        +#> # A tibble: 336,776 × 4
        +#>   dep_time arr_delay daytime approx_ontime
        +#>      <int>     <dbl> <lgl>   <lgl>        
        +#> 1      517        11 FALSE   TRUE         
        +#> 2      533        20 FALSE   FALSE        
        +#> 3      542        33 FALSE   FALSE        
        +#> 4      544       -18 FALSE   TRUE         
        +#> 5      554       -25 FALSE   FALSE        
        +#> 6      554        12 FALSE   TRUE         
        +#> # ℹ 336,770 more rows
        +
        +

        This is particularly useful for more complicated logic because naming the intermediate steps makes it easier to both read your code and check that each step has been computed correctly.

        +

        All up, the initial filter is equivalent to:

        +
        +
        flights |> 
        +  mutate(
        +    daytime = dep_time > 600 & dep_time < 2000,
        +    approx_ontime = abs(arr_delay) < 20,
        +  ) |> 
        +  filter(daytime & approx_ontime)
        +
        +

        +12.2.1 Floating point comparison

        +

        Beware of using == with numbers. For example, it looks like this vector contains the numbers 1 and 2:

        +
        +
        x <- c(1 / 49 * 49, sqrt(2) ^ 2)
        +x
        +#> [1] 1 2
        +
        +

        But if you test them for equality, you get FALSE:

        +
        +
        x == c(1, 2)
        +#> [1] FALSE FALSE
        +
        +

        What’s going on? Computers store numbers with a fixed number of decimal places so there’s no way to exactly represent 1/49 or sqrt(2) and subsequent computations will be very slightly off. We can see the exact values by calling print() with the digits1 argument:

        +
        +
        print(x, digits = 16)
        +#> [1] 0.9999999999999999 2.0000000000000004
        +
        +

        You can see why R defaults to rounding these numbers; they really are very close to what you expect.

        +

        Now that you’ve seen why == is failing, what can you do about it? One option is to use dplyr::near() which ignores small differences:

        +
        +
        near(x, c(1, 2))
        +#> [1] TRUE TRUE
        +
        +

        +12.2.2 Missing values

        +

        Missing values represent the unknown so they are “contagious”: almost any operation involving an unknown value will also be unknown:

        +
        +
        NA > 5
        +#> [1] NA
        +10 == NA
        +#> [1] NA
        +
        +

        The most confusing result is this one:

        +
        +
        NA == NA
        +#> [1] NA
        +
        +

        It’s easiest to understand why this is true if we artificially supply a little more context:

        +
        +
        # We don't know how old Mary is
        +age_mary <- NA
        +
        +# We don't know how old John is
        +age_john <- NA
        +
        +# Are Mary and John the same age?
        +age_mary == age_john
        +#> [1] NA
        +# We don't know!
        +
        +

        So if you want to find all flights where dep_time is missing, the following code doesn’t work because dep_time == NA will yield NA for every single row, and filter() automatically drops missing values:

        +
        +
        flights |> 
        +  filter(dep_time == NA)
        +#> # A tibble: 0 × 19
        +#> # ℹ 19 variables: year <int>, month <int>, day <int>, dep_time <int>,
        +#> #   sched_dep_time <int>, dep_delay <dbl>, arr_time <int>, …
        +
        +

        Instead we’ll need a new tool: is.na().

        +

        +12.2.3 is.na() +

        +

        is.na(x) works with any type of vector and returns TRUE for missing values and FALSE for everything else:

        +
        +
        is.na(c(TRUE, NA, FALSE))
        +#> [1] FALSE  TRUE FALSE
        +is.na(c(1, NA, 3))
        +#> [1] FALSE  TRUE FALSE
        +is.na(c("a", NA, "b"))
        +#> [1] FALSE  TRUE FALSE
        +
        +

        We can use is.na() to find all the rows with a missing dep_time:

        +
        +
        flights |> 
        +  filter(is.na(dep_time))
        +#> # A tibble: 8,255 × 19
        +#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
        +#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
        +#> 1  2013     1     1       NA           1630        NA       NA           1815
        +#> 2  2013     1     1       NA           1935        NA       NA           2240
        +#> 3  2013     1     1       NA           1500        NA       NA           1825
        +#> 4  2013     1     1       NA            600        NA       NA            901
        +#> 5  2013     1     2       NA           1540        NA       NA           1747
        +#> 6  2013     1     2       NA           1620        NA       NA           1746
        +#> # ℹ 8,249 more rows
        +#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
        +
        +

        is.na() can also be useful in arrange(). arrange() usually puts all the missing values at the end but you can override this default by first sorting by is.na():

        +
        +
        flights |> 
        +  filter(month == 1, day == 1) |> 
        +  arrange(dep_time)
        +#> # A tibble: 842 × 19
        +#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
        +#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
        +#> 1  2013     1     1      517            515         2      830            819
        +#> 2  2013     1     1      533            529         4      850            830
        +#> 3  2013     1     1      542            540         2      923            850
        +#> 4  2013     1     1      544            545        -1     1004           1022
        +#> 5  2013     1     1      554            600        -6      812            837
        +#> 6  2013     1     1      554            558        -4      740            728
        +#> # ℹ 836 more rows
        +#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
        +
        +flights |> 
        +  filter(month == 1, day == 1) |> 
        +  arrange(desc(is.na(dep_time)), dep_time)
        +#> # A tibble: 842 × 19
        +#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
        +#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
        +#> 1  2013     1     1       NA           1630        NA       NA           1815
        +#> 2  2013     1     1       NA           1935        NA       NA           2240
        +#> 3  2013     1     1       NA           1500        NA       NA           1825
        +#> 4  2013     1     1       NA            600        NA       NA            901
        +#> 5  2013     1     1      517            515         2      830            819
        +#> 6  2013     1     1      533            529         4      850            830
        +#> # ℹ 836 more rows
        +#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
        +
        +

        We’ll come back to cover missing values in more depth in Capítulo 18.

        +

        +12.2.4 Exercises

        +
          +
        1. How does dplyr::near() work? Type near to see the source code. Is sqrt(2)^2 near 2?
        2. +
        3. Use mutate(), is.na(), and count() together to describe how the missing values in dep_time, sched_dep_time and dep_delay are connected.
        4. +

        +12.3 Boolean algebra

        +

        Once you have multiple logical vectors, you can combine them together using Boolean algebra. In R, & is “and”, | is “or”, ! is “not”, and xor() is exclusive or2. For example, df |> filter(!is.na(x)) finds all rows where x is not missing and df |> filter(x < -10 | x > 0) finds all rows where x is smaller than -10 or bigger than 0. Figura 12.1 shows the complete set of Boolean operations and how they work.

        +
        +
        +
        +

        Six Venn diagrams, each explaining a given logical operator. The circles (sets) in each of the Venn diagrams represent x and y. 1. y & !x is y but none of x; x & y is the intersection of x and y; x & !y is x but none of y; x is all of x none of y; xor(x, y) is everything except the intersection of x and y; y is all of y and none of x; and x | y is everything.

        +
        Figura 12.1: The complete set of Boolean operations. x is the left-hand circle, y is the right-hand circle, and the shaded region show which parts each operator selects.
        +
        +
        +
        +

        As well as & and |, R also has && and ||. Don’t use them in dplyr functions! These are called short-circuiting operators and only ever return a single TRUE or FALSE. They’re important for programming, not data science.

        +

        +12.3.1 Missing values

        +

        The rules for missing values in Boolean algebra are a little tricky to explain because they seem inconsistent at first glance:

        +
        +
        df <- tibble(x = c(TRUE, FALSE, NA))
        +
        +df |> 
        +  mutate(
        +    and = x & NA,
        +    or = x | NA
        +  )
        +#> # A tibble: 3 × 3
        +#>   x     and   or   
        +#>   <lgl> <lgl> <lgl>
        +#> 1 TRUE  NA    TRUE 
        +#> 2 FALSE FALSE NA   
        +#> 3 NA    NA    NA
        +
        +

        To understand what’s going on, think about NA | TRUE (NA or TRUE). A missing value in a logical vector means that the value could either be TRUE or FALSE. TRUE | TRUE and FALSE | TRUE are both TRUE because at least one of them is TRUE. NA | TRUE must also be TRUE because NA can either be TRUE or FALSE. However, NA | FALSE is NA because we don’t know if NA is TRUE or FALSE. Similar reasoning applies with NA & FALSE.

        +

        +12.3.2 Order of operations

        +

        Note that the order of operations doesn’t work like English. Take the following code that finds all flights that departed in November or December:

        +
        +
        flights |> 
        +   filter(month == 11 | month == 12)
        +
        +

        You might be tempted to write it like you’d say in English: “Find all flights that departed in November or December.”:

        +
        +
        flights |> 
        +   filter(month == 11 | 12)
        +#> # A tibble: 336,776 × 19
        +#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
        +#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
        +#> 1  2013     1     1      517            515         2      830            819
        +#> 2  2013     1     1      533            529         4      850            830
        +#> 3  2013     1     1      542            540         2      923            850
        +#> 4  2013     1     1      544            545        -1     1004           1022
        +#> 5  2013     1     1      554            600        -6      812            837
        +#> 6  2013     1     1      554            558        -4      740            728
        +#> # ℹ 336,770 more rows
        +#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
        +
        +

        This code doesn’t error but it also doesn’t seem to have worked. What’s going on? Here, R first evaluates month == 11 creating a logical vector, which we call nov. It computes nov | 12. When you use a number with a logical operator it converts everything apart from 0 to TRUE, so this is equivalent to nov | TRUE which will always be TRUE, so every row will be selected:

        +
        +
        flights |> 
        +  mutate(
        +    nov = month == 11,
        +    final = nov | 12,
        +    .keep = "used"
        +  )
        +#> # A tibble: 336,776 × 3
        +#>   month nov   final
        +#>   <int> <lgl> <lgl>
        +#> 1     1 FALSE TRUE 
        +#> 2     1 FALSE TRUE 
        +#> 3     1 FALSE TRUE 
        +#> 4     1 FALSE TRUE 
        +#> 5     1 FALSE TRUE 
        +#> 6     1 FALSE TRUE 
        +#> # ℹ 336,770 more rows
        +
        +

        +12.3.3 %in% +

        +

        An easy way to avoid the problem of getting your ==s and |s in the right order is to use %in%. x %in% y returns a logical vector the same length as x that is TRUE whenever a value in x is anywhere in y .

        +
        +
        1:12 %in% c(1, 5, 11)
        +#>  [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
        +letters[1:10] %in% c("a", "e", "i", "o", "u")
        +#>  [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE
        +
        +

        So to find all flights in November and December we could write:

        +
        +
        flights |> 
        +  filter(month %in% c(11, 12))
        +
        +

        Note that %in% obeys different rules for NA to ==, as NA %in% NA is TRUE.

        +
        +
        c(1, 2, NA) == NA
        +#> [1] NA NA NA
        +c(1, 2, NA) %in% NA
        +#> [1] FALSE FALSE  TRUE
        +
        +

        This can make for a useful shortcut:

        +
        +
        flights |> 
        +  filter(dep_time %in% c(NA, 0800))
        +#> # A tibble: 8,803 × 19
        +#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
        +#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
        +#> 1  2013     1     1      800            800         0     1022           1014
        +#> 2  2013     1     1      800            810       -10      949            955
        +#> 3  2013     1     1       NA           1630        NA       NA           1815
        +#> 4  2013     1     1       NA           1935        NA       NA           2240
        +#> 5  2013     1     1       NA           1500        NA       NA           1825
        +#> 6  2013     1     1       NA            600        NA       NA            901
        +#> # ℹ 8,797 more rows
        +#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
        +
        +

        +12.3.4 Exercises

        +
          +
        1. Find all flights where arr_delay is missing but dep_delay is not. Find all flights where neither arr_time nor sched_arr_time are missing, but arr_delay is.
        2. +
        3. How many flights have a missing dep_time? What other variables are missing in these rows? What might these rows represent?
        4. +
        5. Assuming that a missing dep_time implies that a flight is cancelled, look at the number of cancelled flights per day. Is there a pattern? Is there a connection between the proportion of cancelled flights and the average delay of non-cancelled flights?
        6. +

        +12.4 Summaries

        +

        The following sections describe some useful techniques for summarizing logical vectors. As well as functions that only work specifically with logical vectors, you can also use functions that work with numeric vectors.

        +

        +12.4.1 Logical summaries

        +

        There are two main logical summaries: any() and all(). any(x) is the equivalent of |; it’ll return TRUE if there are any TRUE’s in x. all(x) is equivalent of &; it’ll return TRUE only if all values of x are TRUE’s. Like all summary functions, they’ll return NA if there are any missing values present, and as usual you can make the missing values go away with na.rm = TRUE.

        +

        For example, we could use all() and any() to find out if every flight was delayed on departure by at most an hour or if any flights were delayed on arrival by five hours or more. And using group_by() allows us to do that by day:

        +
        +
        flights |> 
        +  group_by(year, month, day) |> 
        +  summarize(
        +    all_delayed = all(dep_delay <= 60, na.rm = TRUE),
        +    any_long_delay = any(arr_delay >= 300, na.rm = TRUE),
        +    .groups = "drop"
        +  )
        +#> # A tibble: 365 × 5
        +#>    year month   day all_delayed any_long_delay
        +#>   <int> <int> <int> <lgl>       <lgl>         
        +#> 1  2013     1     1 FALSE       TRUE          
        +#> 2  2013     1     2 FALSE       TRUE          
        +#> 3  2013     1     3 FALSE       FALSE         
        +#> 4  2013     1     4 FALSE       FALSE         
        +#> 5  2013     1     5 FALSE       TRUE          
        +#> 6  2013     1     6 FALSE       FALSE         
        +#> # ℹ 359 more rows
        +
        +

        In most cases, however, any() and all() are a little too crude, and it would be nice to be able to get a little more detail about how many values are TRUE or FALSE. That leads us to the numeric summaries.

        +

        +12.4.2 Numeric summaries of logical vectors

        +

        When you use a logical vector in a numeric context, TRUE becomes 1 and FALSE becomes 0. This makes sum() and mean() very useful with logical vectors because sum(x) gives the number of TRUEs and mean(x) gives the proportion of TRUEs (because mean() is just sum() divided by length().

        +

        That, for example, allows us to see the proportion of flights that were delayed on departure by at most an hour and the number of flights that were delayed on arrival by five hours or more:

        +
        +
        flights |> 
        +  group_by(year, month, day) |> 
        +  summarize(
        +    all_delayed = mean(dep_delay <= 60, na.rm = TRUE),
        +    any_long_delay = sum(arr_delay >= 300, na.rm = TRUE),
        +    .groups = "drop"
        +  )
        +#> # A tibble: 365 × 5
        +#>    year month   day all_delayed any_long_delay
        +#>   <int> <int> <int>       <dbl>          <int>
        +#> 1  2013     1     1       0.939              3
        +#> 2  2013     1     2       0.914              3
        +#> 3  2013     1     3       0.941              0
        +#> 4  2013     1     4       0.953              0
        +#> 5  2013     1     5       0.964              1
        +#> 6  2013     1     6       0.959              0
        +#> # ℹ 359 more rows
        +
        +

        +12.4.3 Logical subsetting

        +

        There’s one final use for logical vectors in summaries: you can use a logical vector to filter a single variable to a subset of interest. This makes use of the base [ (pronounced subset) operator, which you’ll learn more about in Seção 27.2.

        +

        Imagine we wanted to look at the average delay just for flights that were actually delayed. One way to do so would be to first filter the flights and then calculate the average delay:

        +
        +
        flights |> 
        +  filter(arr_delay > 0) |> 
        +  group_by(year, month, day) |> 
        +  summarize(
        +    behind = mean(arr_delay),
        +    n = n(),
        +    .groups = "drop"
        +  )
        +#> # A tibble: 365 × 5
        +#>    year month   day behind     n
        +#>   <int> <int> <int>  <dbl> <int>
        +#> 1  2013     1     1   32.5   461
        +#> 2  2013     1     2   32.0   535
        +#> 3  2013     1     3   27.7   460
        +#> 4  2013     1     4   28.3   297
        +#> 5  2013     1     5   22.6   238
        +#> 6  2013     1     6   24.4   381
        +#> # ℹ 359 more rows
        +
        +

        This works, but what if we wanted to also compute the average delay for flights that arrived early? We’d need to perform a separate filter step, and then figure out how to combine the two data frames together3. Instead you could use [ to perform an inline filtering: arr_delay[arr_delay > 0] will yield only the positive arrival delays.

        +

        This leads to:

        +
        +
        flights |> 
        +  group_by(year, month, day) |> 
        +  summarize(
        +    behind = mean(arr_delay[arr_delay > 0], na.rm = TRUE),
        +    ahead = mean(arr_delay[arr_delay < 0], na.rm = TRUE),
        +    n = n(),
        +    .groups = "drop"
        +  )
        +#> # A tibble: 365 × 6
        +#>    year month   day behind ahead     n
        +#>   <int> <int> <int>  <dbl> <dbl> <int>
        +#> 1  2013     1     1   32.5 -12.5   842
        +#> 2  2013     1     2   32.0 -14.3   943
        +#> 3  2013     1     3   27.7 -18.2   914
        +#> 4  2013     1     4   28.3 -17.0   915
        +#> 5  2013     1     5   22.6 -14.0   720
        +#> 6  2013     1     6   24.4 -13.6   832
        +#> # ℹ 359 more rows
        +
        +

        Also note the difference in the group size: in the first chunk n() gives the number of delayed flights per day; in the second, n() gives the total number of flights.

        +

        +12.4.4 Exercises

        +
          +
        1. What will sum(is.na(x)) tell you? How about mean(is.na(x))?
        2. +
        3. What does prod() return when applied to a logical vector? What logical summary function is it equivalent to? What does min() return when applied to a logical vector? What logical summary function is it equivalent to? Read the documentation and perform a few experiments.
        4. +

        +12.5 Conditional transformations

        +

        One of the most powerful features of logical vectors are their use for conditional transformations, i.e. doing one thing for condition x, and something different for condition y. There are two important tools for this: if_else() and case_when().

        +

        +12.5.1 if_else() +

        +

        If you want to use one value when a condition is TRUE and another value when it’s FALSE, you can use dplyr::if_else()4. You’ll always use the first three argument of if_else(). The first argument, condition, is a logical vector, the second, true, gives the output when the condition is true, and the third, false, gives the output if the condition is false.

        +

        Let’s begin with a simple example of labeling a numeric vector as either “+ve” (positive) or “-ve” (negative):

        +
        +
        x <- c(-3:3, NA)
        +if_else(x > 0, "+ve", "-ve")
        +#> [1] "-ve" "-ve" "-ve" "-ve" "+ve" "+ve" "+ve" NA
        +
        +

        There’s an optional fourth argument, missing which will be used if the input is NA:

        +
        +
        if_else(x > 0, "+ve", "-ve", "???")
        +#> [1] "-ve" "-ve" "-ve" "-ve" "+ve" "+ve" "+ve" "???"
        +
        +

        You can also use vectors for the the true and false arguments. For example, this allows us to create a minimal implementation of abs():

        +
        +
        if_else(x < 0, -x, x)
        +#> [1]  3  2  1  0  1  2  3 NA
        +
        +

        So far all the arguments have used the same vectors, but you can of course mix and match. For example, you could implement a simple version of coalesce() like this:

        +
        +
        x1 <- c(NA, 1, 2, NA)
        +y1 <- c(3, NA, 4, 6)
        +if_else(is.na(x1), y1, x1)
        +#> [1] 3 1 2 6
        +
        +

        You might have noticed a small infelicity in our labeling example above: zero is neither positive nor negative. We could resolve this by adding an additional if_else():

        +
        +
        if_else(x == 0, "0", if_else(x < 0, "-ve", "+ve"), "???")
        +#> [1] "-ve" "-ve" "-ve" "0"   "+ve" "+ve" "+ve" "???"
        +
        +

        This is already a little hard to read, and you can imagine it would only get harder if you have more conditions. Instead, you can switch to dplyr::case_when().

        +

        +12.5.2 case_when() +

        +

        dplyr’s case_when() is inspired by SQL’s CASE statement and provides a flexible way of performing different computations for different conditions. It has a special syntax that unfortunately looks like nothing else you’ll use in the tidyverse. It takes pairs that look like condition ~ output. condition must be a logical vector; when it’s TRUE, output will be used.

        +

        This means we could recreate our previous nested if_else() as follows:

        +
        +
        x <- c(-3:3, NA)
        +case_when(
        +  x == 0   ~ "0",
        +  x < 0    ~ "-ve", 
        +  x > 0    ~ "+ve",
        +  is.na(x) ~ "???"
        +)
        +#> [1] "-ve" "-ve" "-ve" "0"   "+ve" "+ve" "+ve" "???"
        +
        +

        This is more code, but it’s also more explicit.

        +

        To explain how case_when() works, let’s explore some simpler cases. If none of the cases match, the output gets an NA:

        +
        +
        case_when(
        +  x < 0 ~ "-ve",
        +  x > 0 ~ "+ve"
        +)
        +#> [1] "-ve" "-ve" "-ve" NA    "+ve" "+ve" "+ve" NA
        +
        +

        Use .default if you want to create a “default”/catch all value:

        +
        +
        case_when(
        +  x < 0 ~ "-ve",
        +  x > 0 ~ "+ve",
        +  .default = "???"
        +)
        +#> [1] "-ve" "-ve" "-ve" "???" "+ve" "+ve" "+ve" "???"
        +
        +

        And note that if multiple conditions match, only the first will be used:

        +
        +
        case_when(
        +  x > 0 ~ "+ve",
        +  x > 2 ~ "big"
        +)
        +#> [1] NA    NA    NA    NA    "+ve" "+ve" "+ve" NA
        +
        +

        Just like with if_else() you can use variables on both sides of the ~ and you can mix and match variables as needed for your problem. For example, we could use case_when() to provide some human readable labels for the arrival delay:

        +
        +
        flights |> 
        +  mutate(
        +    status = case_when(
        +      is.na(arr_delay)      ~ "cancelled",
        +      arr_delay < -30       ~ "very early",
        +      arr_delay < -15       ~ "early",
        +      abs(arr_delay) <= 15  ~ "on time",
        +      arr_delay < 60        ~ "late",
        +      arr_delay < Inf       ~ "very late",
        +    ),
        +    .keep = "used"
        +  )
        +#> # A tibble: 336,776 × 2
        +#>   arr_delay status 
        +#>       <dbl> <chr>  
        +#> 1        11 on time
        +#> 2        20 late   
        +#> 3        33 late   
        +#> 4       -18 early  
        +#> 5       -25 early  
        +#> 6        12 on time
        +#> # ℹ 336,770 more rows
        +
        +

        Be wary when writing this sort of complex case_when() statement; my first two attempts used a mix of < and > and I kept accidentally creating overlapping conditions.

        +

        +12.5.3 Compatible types

        +

        Note that both if_else() and case_when() require compatible types in the output. If they’re not compatible, you’ll see errors like this:

        +
        +
        if_else(TRUE, "a", 1)
        +#> Error in `if_else()`:
        +#> ! Can't combine `true` <character> and `false` <double>.
        +
        +case_when(
        +  x < -1 ~ TRUE,  
        +  x > 0  ~ now()
        +)
        +#> Error in `case_when()`:
        +#> ! Can't combine `..1 (right)` <logical> and `..2 (right)` <datetime<local>>.
        +
        +

        Overall, relatively few types are compatible, because automatically converting one type of vector to another is a common source of errors. Here are the most important cases that are compatible:

        +
          +
        • Numeric and logical vectors are compatible, as we discussed in Seção 12.4.2.
        • +
        • Strings and factors (Capítulo 16) are compatible, because you can think of a factor as a string with a restricted set of values.
        • +
        • Dates and date-times, which we’ll discuss in Capítulo 17, are compatible because you can think of a date as a special case of date-time.
        • +
        • +NA, which is technically a logical vector, is compatible with everything because every vector has some way of representing a missing value.
        • +
        +

        We don’t expect you to memorize these rules, but they should become second nature over time because they are applied consistently throughout the tidyverse.

        +

        +12.5.4 Exercises

        +
          +
        1. A number is even if it’s divisible by two, which in R you can find out with x %% 2 == 0. Use this fact and if_else() to determine whether each number between 0 and 20 is even or odd.

        2. +
        3. Given a vector of days like x <- c("Monday", "Saturday", "Wednesday"), use an ifelse() statement to label them as weekends or weekdays.

        4. +
        5. Use ifelse() to compute the absolute value of a numeric vector called x.

        6. +
        7. Write a case_when() statement that uses the month and day columns from flights to label a selection of important US holidays (e.g., New Years Day, 4th of July, Thanksgiving, and Christmas). First create a logical column that is either TRUE or FALSE, and then create a character column that either gives the name of the holiday or is NA.

        8. +

        +12.6 Summary

        +

        The definition of a logical vector is simple because each value must be either TRUE, FALSE, or NA. But logical vectors provide a huge amount of power. In this chapter, you learned how to create logical vectors with >, <, <=, >=, ==, !=, and is.na(), how to combine them with !, &, and |, and how to summarize them with any(), all(), sum(), and mean(). You also learned the powerful if_else() and case_when() functions that allow you to return values depending on the value of a logical vector.

        +

        We’ll see logical vectors again and again in the following chapters. For example in Capítulo 14 you’ll learn about str_detect(x, pattern) which returns a logical vector that’s TRUE for the elements of x that match the pattern, and in Capítulo 17 you’ll create logical vectors from the comparison of dates and times. But for now, we’re going to move onto the next most important type of vector: numeric vectors.

        + + +

        +
          +
        1. R normally calls print for you (i.e. x is a shortcut for print(x)), but calling it explicitly is useful if you want to provide other arguments.↩︎

        2. +
        3. That is, xor(x, y) is true if x is true, or y is true, but not both. This is how we usually use “or” In English. “Both” is not usually an acceptable answer to the question “would you like ice cream or cake?”.↩︎

        4. +
        5. We’ll cover this in Capítulo 19.↩︎

        6. +
        7. dplyr’s if_else() is very similar to base R’s ifelse(). There are two main advantages of if_else()over ifelse(): you can choose what should happen to missing values, and if_else() is much more likely to give you a meaningful error if your variables have incompatible types.↩︎

        8. +
        +
        +
        +
        + + + \ No newline at end of file diff --git a/missing-values.html b/missing-values.html new file mode 100644 index 000000000..d4aa5665a --- /dev/null +++ b/missing-values.html @@ -0,0 +1,986 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 18  Missing values + + + + + + + + + + + + + + + + + + + + + + + + +
        +
        + +
        + + + +
        +

        18  Missing values

        +
        + + + +
        + + + + +
        + + +

        +18.1 Introduction

        +

        You’ve already learned the basics of missing values earlier in the book. You first saw them in Capítulo 1 where they resulted in a warning when making a plot as well as in Seção 3.5.2 where they interfered with computing summary statistics, and you learned about their infectious nature and how to check for their presence in Seção 12.2.2. Now we’ll come back to them in more depth, so you can learn more of the details.

        +

        We’ll start by discussing some general tools for working with missing values recorded as NAs. We’ll then explore the idea of implicitly missing values, values that are simply absent from your data, and show some tools you can use to make them explicit. We’ll finish off with a related discussion of empty groups, caused by factor levels that don’t appear in the data.

        +

        +18.1.1 Prerequisites

        +

        The functions for working with missing data mostly come from dplyr and tidyr, which are core members of the tidyverse.

        + +

        +18.2 Explicit missing values

        +

        To begin, let’s explore a few handy tools for creating or eliminating missing explicit values, i.e. cells where you see an NA.

        +

        +18.2.1 Last observation carried forward

        +

        A common use for missing values is as a data entry convenience. When data is entered by hand, missing values sometimes indicate that the value in the previous row has been repeated (or carried forward):

        +
        +
        treatment <- tribble(
        +  ~person,           ~treatment, ~response,
        +  "Derrick Whitmore", 1,         7,
        +  NA,                 2,         10,
        +  NA,                 3,         NA,
        +  "Katherine Burke",  1,         4
        +)
        +
        +

        You can fill in these missing values with tidyr::fill(). It works like select(), taking a set of columns:

        +
        +
        treatment |>
        +  fill(everything())
        +#> # A tibble: 4 × 3
        +#>   person           treatment response
        +#>   <chr>                <dbl>    <dbl>
        +#> 1 Derrick Whitmore         1        7
        +#> 2 Derrick Whitmore         2       10
        +#> 3 Derrick Whitmore         3       10
        +#> 4 Katherine Burke          1        4
        +
        +

        This treatment is sometimes called “last observation carried forward”, or locf for short. You can use the .direction argument to fill in missing values that have been generated in more exotic ways.

        +

        +18.2.2 Fixed values

        +

        Some times missing values represent some fixed and known value, most commonly 0. You can use dplyr::coalesce() to replace them:

        +
        +
        x <- c(1, 4, 5, 7, NA)
        +coalesce(x, 0)
        +#> [1] 1 4 5 7 0
        +
        +

        Sometimes you’ll hit the opposite problem where some concrete value actually represents a missing value. This typically arises in data generated by older software that doesn’t have a proper way to represent missing values, so it must instead use some special value like 99 or -999.

        +

        If possible, handle this when reading in the data, for example, by using the na argument to readr::read_csv(), e.g., read_csv(path, na = "99"). If you discover the problem later, or your data source doesn’t provide a way to handle it on read, you can use dplyr::na_if():

        +
        +
        x <- c(1, 4, 5, 7, -99)
        +na_if(x, -99)
        +#> [1]  1  4  5  7 NA
        +
        +

        +18.2.3 NaN

        +

        Before we continue, there’s one special type of missing value that you’ll encounter from time to time: a NaN (pronounced “nan”), or not a number. It’s not that important to know about because it generally behaves just like NA:

        +
        +
        x <- c(NA, NaN)
        +x * 10
        +#> [1]  NA NaN
        +x == 1
        +#> [1] NA NA
        +is.na(x)
        +#> [1] TRUE TRUE
        +
        +

        In the rare case you need to distinguish an NA from a NaN, you can use is.nan(x).

        +

        You’ll generally encounter a NaN when you perform a mathematical operation that has an indeterminate result:

        +
        +
        0 / 0 
        +#> [1] NaN
        +0 * Inf
        +#> [1] NaN
        +Inf - Inf
        +#> [1] NaN
        +sqrt(-1)
        +#> Warning in sqrt(-1): NaNs produced
        +#> [1] NaN
        +
        +

        +18.3 Implicit missing values

        +

        So far we’ve talked about missing values that are explicitly missing, i.e. you can see an NA in your data. But missing values can also be implicitly missing, if an entire row of data is simply absent from the data. Let’s illustrate the difference with a simple dataset that records the price of some stock each quarter:

        +
        +
        stocks <- tibble(
        +  year  = c(2020, 2020, 2020, 2020, 2021, 2021, 2021),
        +  qtr   = c(   1,    2,    3,    4,    2,    3,    4),
        +  price = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
        +)
        +
        +

        This dataset has two missing observations:

        +
          +
        • The price in the fourth quarter of 2020 is explicitly missing, because its value is NA.

        • +
        • The price for the first quarter of 2021 is implicitly missing, because it simply does not appear in the dataset.

        • +
        +

        One way to think about the difference is with this Zen-like koan:

        +
        +

        An explicit missing value is the presence of an absence.

        +

        An implicit missing value is the absence of a presence.

        +
        +

        Sometimes you want to make implicit missings explicit in order to have something physical to work with. In other cases, explicit missings are forced upon you by the structure of the data and you want to get rid of them. The following sections discuss some tools for moving between implicit and explicit missingness.

        +

        +18.3.1 Pivoting

        +

        You’ve already seen one tool that can make implicit missings explicit and vice versa: pivoting. Making data wider can make implicit missing values explicit because every combination of the rows and new columns must have some value. For example, if we pivot stocks to put the quarter in the columns, both missing values become explicit:

        +
        +
        stocks |>
        +  pivot_wider(
        +    names_from = qtr, 
        +    values_from = price
        +  )
        +#> # A tibble: 2 × 5
        +#>    year   `1`   `2`   `3`   `4`
        +#>   <dbl> <dbl> <dbl> <dbl> <dbl>
        +#> 1  2020  1.88  0.59  0.35 NA   
        +#> 2  2021 NA     0.92  0.17  2.66
        +
        +

        By default, making data longer preserves explicit missing values, but if they are structurally missing values that only exist because the data is not tidy, you can drop them (make them implicit) by setting values_drop_na = TRUE. See the examples in Seção 5.2 for more details.

        +

        +18.3.2 Complete

        +

        tidyr::complete() allows you to generate explicit missing values by providing a set of variables that define the combination of rows that should exist. For example, we know that all combinations of year and qtr should exist in the stocks data:

        +
        +
        stocks |>
        +  complete(year, qtr)
        +#> # A tibble: 8 × 3
        +#>    year   qtr price
        +#>   <dbl> <dbl> <dbl>
        +#> 1  2020     1  1.88
        +#> 2  2020     2  0.59
        +#> 3  2020     3  0.35
        +#> 4  2020     4 NA   
        +#> 5  2021     1 NA   
        +#> 6  2021     2  0.92
        +#> # ℹ 2 more rows
        +
        +

        Typically, you’ll call complete() with names of existing variables, filling in the missing combinations. However, sometimes the individual variables are themselves incomplete, so you can instead provide your own data. For example, you might know that the stocks dataset is supposed to run from 2019 to 2021, so you could explicitly supply those values for year:

        +
        +
        stocks |>
        +  complete(year = 2019:2021, qtr)
        +#> # A tibble: 12 × 3
        +#>    year   qtr price
        +#>   <dbl> <dbl> <dbl>
        +#> 1  2019     1 NA   
        +#> 2  2019     2 NA   
        +#> 3  2019     3 NA   
        +#> 4  2019     4 NA   
        +#> 5  2020     1  1.88
        +#> 6  2020     2  0.59
        +#> # ℹ 6 more rows
        +
        +

        If the range of a variable is correct, but not all values are present, you could use full_seq(x, 1) to generate all values from min(x) to max(x) spaced out by 1.

        +

        In some cases, the complete set of observations can’t be generated by a simple combination of variables. In that case, you can do manually what complete() does for you: create a data frame that contains all the rows that should exist (using whatever combination of techniques you need), then combine it with your original dataset with dplyr::full_join().

        +

        +18.3.3 Joins

        +

        This brings us to another important way of revealing implicitly missing observations: joins. You’ll learn more about joins in Capítulo 19, but we wanted to quickly mention them to you here since you can often only know that values are missing from one dataset when you compare it to another.

        +

        dplyr::anti_join(x, y) is a particularly useful tool here because it selects only the rows in x that don’t have a match in y. For example, we can use two anti_join()s to reveal that we’re missing information for four airports and 722 planes mentioned in flights:

        +
        +
        library(nycflights13)
        +
        +flights |> 
        +  distinct(faa = dest) |> 
        +  anti_join(airports)
        +#> Joining with `by = join_by(faa)`
        +#> # A tibble: 4 × 1
        +#>   faa  
        +#>   <chr>
        +#> 1 BQN  
        +#> 2 SJU  
        +#> 3 STT  
        +#> 4 PSE
        +
        +flights |> 
        +  distinct(tailnum) |> 
        +  anti_join(planes)
        +#> Joining with `by = join_by(tailnum)`
        +#> # A tibble: 722 × 1
        +#>   tailnum
        +#>   <chr>  
        +#> 1 N3ALAA 
        +#> 2 N3DUAA 
        +#> 3 N542MQ 
        +#> 4 N730MQ 
        +#> 5 N9EAMQ 
        +#> 6 N532UA 
        +#> # ℹ 716 more rows
        +
        +

        +18.3.4 Exercises

        +
          +
        1. Can you find any relationship between the carrier and the rows that appear to be missing from planes?
        2. +

        +18.4 Factors and empty groups

        +

        A final type of missingness is the empty group, a group that doesn’t contain any observations, which can arise when working with factors. For example, imagine we have a dataset that contains some health information about people:

        +
        +
        health <- tibble(
        +  name   = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),
        +  smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")),
        +  age    = c(34, 88, 75, 47, 56),
        +)
        +
        +

        And we want to count the number of smokers with dplyr::count():

        +
        +
        health |> count(smoker)
        +#> # A tibble: 1 × 2
        +#>   smoker     n
        +#>   <fct>  <int>
        +#> 1 no         5
        +
        +

        This dataset only contains non-smokers, but we know that smokers exist; the group of non-smokers is empty. We can request count() to keep all the groups, even those not seen in the data by using .drop = FALSE:

        +
        +
        health |> count(smoker, .drop = FALSE)
        +#> # A tibble: 2 × 2
        +#>   smoker     n
        +#>   <fct>  <int>
        +#> 1 yes        0
        +#> 2 no         5
        +
        +

        The same principle applies to ggplot2’s discrete axes, which will also drop levels that don’t have any values. You can force them to display by supplying drop = FALSE to the appropriate discrete axis:

        +
        +
        ggplot(health, aes(x = smoker)) +
        +  geom_bar() +
        +  scale_x_discrete()
        +
        +ggplot(health, aes(x = smoker)) +
        +  geom_bar() +
        +  scale_x_discrete(drop = FALSE)
        +
        +
        +
        +

        A bar chart with a single value on the x-axis, "no".

        +
        +
        +

        The same bar chart as the last plot, but now with two values on the x-axis, "yes" and "no". There is no bar for the "yes" category.

        +
        +
        +
        +
        +

        The same problem comes up more generally with dplyr::group_by(). And again you can use .drop = FALSE to preserve all factor levels:

        +
        +
        health |> 
        +  group_by(smoker, .drop = FALSE) |> 
        +  summarize(
        +    n = n(),
        +    mean_age = mean(age),
        +    min_age = min(age),
        +    max_age = max(age),
        +    sd_age = sd(age)
        +  )
        +#> # A tibble: 2 × 6
        +#>   smoker     n mean_age min_age max_age sd_age
        +#>   <fct>  <int>    <dbl>   <dbl>   <dbl>  <dbl>
        +#> 1 yes        0      NaN     Inf    -Inf   NA  
        +#> 2 no         5       60      34      88   21.6
        +
        +

        We get some interesting results here because when summarizing an empty group, the summary functions are applied to zero-length vectors. There’s an important distinction between empty vectors, which have length 0, and missing values, each of which has length 1.

        +
        +
        # A vector containing two missing values
        +x1 <- c(NA, NA)
        +length(x1)
        +#> [1] 2
        +
        +# A vector containing nothing
        +x2 <- numeric()
        +length(x2)
        +#> [1] 0
        +
        +

        All summary functions work with zero-length vectors, but they may return results that are surprising at first glance. Here we see mean(age) returning NaN because mean(age) = sum(age)/length(age) which here is 0/0. max() and min() return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute you’ll get the minimum or maximum of the new data1.

        +

        Sometimes a simpler approach is to perform the summary and then make the implicit missings explicit with complete().

        +
        +
        health |> 
        +  group_by(smoker) |> 
        +  summarize(
        +    n = n(),
        +    mean_age = mean(age),
        +    min_age = min(age),
        +    max_age = max(age),
        +    sd_age = sd(age)
        +  ) |> 
        +  complete(smoker)
        +#> # A tibble: 2 × 6
        +#>   smoker     n mean_age min_age max_age sd_age
        +#>   <fct>  <int>    <dbl>   <dbl>   <dbl>  <dbl>
        +#> 1 yes       NA       NA      NA      NA   NA  
        +#> 2 no         5       60      34      88   21.6
        +
        +

        The main drawback of this approach is that you get an NA for the count, even though you know that it should be zero.

        +

        +18.5 Summary

        +

        Missing values are weird! Sometimes they’re recorded as an explicit NA but other times you only notice them by their absence. This chapter has given you some tools for working with explicit missing values, tools for uncovering implicit missing values, and discussed some of the ways that implicit can become explicit and vice versa.

        +

        In the next chapter, we tackle the final chapter in this part of the book: joins. This is a bit of a change from the chapters so far because we’re going to discuss tools that work with data frames as a whole, not something that you put inside a data frame.

        + + +

        +
          +
        1. In other words, min(c(x, y)) is always equal to min(min(x), min(y)).↩︎

        2. +
        +
        +
        +
        + + + \ No newline at end of file diff --git a/missing-values_files/figure-html/unnamed-chunk-17-1.png b/missing-values_files/figure-html/unnamed-chunk-17-1.png new file mode 100644 index 000000000..c9c0b5fd5 Binary files /dev/null and b/missing-values_files/figure-html/unnamed-chunk-17-1.png differ diff --git a/missing-values_files/figure-html/unnamed-chunk-17-2.png b/missing-values_files/figure-html/unnamed-chunk-17-2.png new file mode 100644 index 000000000..56acd7353 Binary files /dev/null and b/missing-values_files/figure-html/unnamed-chunk-17-2.png differ diff --git a/numbers.html b/numbers.html new file mode 100644 index 000000000..44afdb84e --- /dev/null +++ b/numbers.html @@ -0,0 +1,1452 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 13  Numbers + + + + + + + + + + + + + + + + + + + + + + + + +
        +
        + +
        + + + +
        +

        13  Numbers

        +
        + + + +
        + + + + +
        + + +

        +13.1 Introduction

        +

        Numeric vectors are the backbone of data science, and you’ve already used them a bunch of times earlier in the book. Now it’s time to systematically survey what you can do with them in R, ensuring that you’re well situated to tackle any future problem involving numeric vectors.

        +

        We’ll start by giving you a couple of tools to make numbers if you have strings, and then going into a little more detail of count(). Then we’ll dive into various numeric transformations that pair well with mutate(), including more general transformations that can be applied to other types of vectors, but are often used with numeric vectors. We’ll finish off by covering the summary functions that pair well with summarize() and show you how they can also be used with mutate().

        +

        +13.1.1 Prerequisites

        +

        This chapter mostly uses functions from base R, which are available without loading any packages. But we still need the tidyverse because we’ll use these base R functions inside of tidyverse functions like mutate() and filter(). Like in the last chapter, we’ll use real examples from nycflights13, as well as toy examples made with c() and tribble().

        + +

        +13.2 Making numbers

        +

        In most cases, you’ll get numbers already recorded in one of R’s numeric types: integer or double. In some cases, however, you’ll encounter them as strings, possibly because you’ve created them by pivoting from column headers or because something has gone wrong in your data import process.

        +

        readr provides two useful functions for parsing strings into numbers: parse_double() and parse_number(). Use parse_double() when you have numbers that have been written as strings:

        +
        +
        x <- c("1.2", "5.6", "1e3")
        +parse_double(x)
        +#> [1]    1.2    5.6 1000.0
        +
        +

        Use parse_number() when the string contains non-numeric text that you want to ignore. This is particularly useful for currency data and percentages:

        +
        +
        x <- c("$1,234", "USD 3,513", "59%")
        +parse_number(x)
        +#> [1] 1234 3513   59
        +
        +

        +13.3 Counts

        +

        It’s surprising how much data science you can do with just counts and a little basic arithmetic, so dplyr strives to make counting as easy as possible with count(). This function is great for quick exploration and checks during analysis:

        +
        +
        flights |> count(dest)
        +#> # A tibble: 105 × 2
        +#>   dest      n
        +#>   <chr> <int>
        +#> 1 ABQ     254
        +#> 2 ACK     265
        +#> 3 ALB     439
        +#> 4 ANC       8
        +#> 5 ATL   17215
        +#> 6 AUS    2439
        +#> # ℹ 99 more rows
        +
        +

        (Despite the advice in Capítulo 4, we usually put count() on a single line because it’s usually used at the console for a quick check that a calculation is working as expected.)

        +

        If you want to see the most common values, add sort = TRUE:

        +
        +
        flights |> count(dest, sort = TRUE)
        +#> # A tibble: 105 × 2
        +#>   dest      n
        +#>   <chr> <int>
        +#> 1 ORD   17283
        +#> 2 ATL   17215
        +#> 3 LAX   16174
        +#> 4 BOS   15508
        +#> 5 MCO   14082
        +#> 6 CLT   14064
        +#> # ℹ 99 more rows
        +
        +

        And remember that if you want to see all the values, you can use |> View() or |> print(n = Inf).

        +

        You can perform the same computation “by hand” with group_by(), summarize() and n(). This is useful because it allows you to compute other summaries at the same time:

        +
        +
        flights |> 
        +  group_by(dest) |> 
        +  summarize(
        +    n = n(),
        +    delay = mean(arr_delay, na.rm = TRUE)
        +  )
        +#> # A tibble: 105 × 3
        +#>   dest      n delay
        +#>   <chr> <int> <dbl>
        +#> 1 ABQ     254  4.38
        +#> 2 ACK     265  4.85
        +#> 3 ALB     439 14.4 
        +#> 4 ANC       8 -2.5 
        +#> 5 ATL   17215 11.3 
        +#> 6 AUS    2439  6.02
        +#> # ℹ 99 more rows
        +
        +

        n() is a special summary function that doesn’t take any arguments and instead accesses information about the “current” group. This means that it only works inside dplyr verbs:

        +
        +
        n()
        +#> Error in `n()`:
        +#> ! Must only be used inside data-masking verbs like `mutate()`,
        +#>   `filter()`, and `group_by()`.
        +
        +

        There are a couple of variants of n() and count() that you might find useful:

        +
          +
        • +

          n_distinct(x) counts the number of distinct (unique) values of one or more variables. For example, we could figure out which destinations are served by the most carriers:

          +
          +
          flights |> 
          +  group_by(dest) |> 
          +  summarize(carriers = n_distinct(carrier)) |> 
          +  arrange(desc(carriers))
          +#> # A tibble: 105 × 2
          +#>   dest  carriers
          +#>   <chr>    <int>
          +#> 1 ATL          7
          +#> 2 BOS          7
          +#> 3 CLT          7
          +#> 4 ORD          7
          +#> 5 TPA          7
          +#> 6 AUS          6
          +#> # ℹ 99 more rows
          +
          +
        • +
        • +

          A weighted count is a sum. For example you could “count” the number of miles each plane flew:

          +
          +
          flights |> 
          +  group_by(tailnum) |> 
          +  summarize(miles = sum(distance))
          +#> # A tibble: 4,044 × 2
          +#>   tailnum  miles
          +#>   <chr>    <dbl>
          +#> 1 D942DN    3418
          +#> 2 N0EGMQ  250866
          +#> 3 N10156  115966
          +#> 4 N102UW   25722
          +#> 5 N103US   24619
          +#> 6 N104UW   25157
          +#> # ℹ 4,038 more rows
          +
          +

          Weighted counts are a common problem so count() has a wt argument that does the same thing:

          +
          +
          flights |> count(tailnum, wt = distance)
          +
          +
        • +
        • +

          You can count missing values by combining sum() and is.na(). In the flights dataset this represents flights that are cancelled:

          +
          +
          flights |> 
          +  group_by(dest) |> 
          +  summarize(n_cancelled = sum(is.na(dep_time))) 
          +#> # A tibble: 105 × 2
          +#>   dest  n_cancelled
          +#>   <chr>       <int>
          +#> 1 ABQ             0
          +#> 2 ACK             0
          +#> 3 ALB            20
          +#> 4 ANC             0
          +#> 5 ATL           317
          +#> 6 AUS            21
          +#> # ℹ 99 more rows
          +
          +
        • +
        +

        +13.3.1 Exercises

        +
          +
        1. How can you use count() to count the number rows with a missing value for a given variable?
        2. +
        3. Expand the following calls to count() to instead use group_by(), summarize(), and arrange(): +
            +
          1. flights |> count(dest, sort = TRUE)

          2. +
          3. flights |> count(tailnum, wt = distance)

          4. +
          +
        4. +

        +13.4 Numeric transformations

        +

        Transformation functions work well with mutate() because their output is the same length as the input. The vast majority of transformation functions are already built into base R. It’s impractical to list them all so this section will show the most useful ones. As an example, while R provides all the trigonometric functions that you might dream of, we don’t list them here because they’re rarely needed for data science.

        +

        +13.4.1 Arithmetic and recycling rules

        +

        We introduced the basics of arithmetic (+, -, *, /, ^) in Capítulo 2 and have used them a bunch since. These functions don’t need a huge amount of explanation because they do what you learned in grade school. But we need to briefly talk about the recycling rules which determine what happens when the left and right hand sides have different lengths. This is important for operations like flights |> mutate(air_time = air_time / 60) because there are 336,776 numbers on the left of / but only one on the right.

        +

        R handles mismatched lengths by recycling, or repeating, the short vector. We can see this in operation more easily if we create some vectors outside of a data frame:

        +
        +
        x <- c(1, 2, 10, 20)
        +x / 5
        +#> [1] 0.2 0.4 2.0 4.0
        +# is shorthand for
        +x / c(5, 5, 5, 5)
        +#> [1] 0.2 0.4 2.0 4.0
        +
        +

        Generally, you only want to recycle single numbers (i.e. vectors of length 1), but R will recycle any shorter length vector. It usually (but not always) gives you a warning if the longer vector isn’t a multiple of the shorter:

        +
        +
        x * c(1, 2)
        +#> [1]  1  4 10 40
        +x * c(1, 2, 3)
        +#> Warning in x * c(1, 2, 3): longer object length is not a multiple of shorter
        +#> object length
        +#> [1]  1  4 30 20
        +
        +

        These recycling rules are also applied to logical comparisons (==, <, <=, >, >=, !=) and can lead to a surprising result if you accidentally use == instead of %in% and the data frame has an unfortunate number of rows. For example, take this code which attempts to find all flights in January and February:

        +
        +
        flights |> 
        +  filter(month == c(1, 2))
        +#> # A tibble: 25,977 × 19
        +#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
        +#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
        +#> 1  2013     1     1      517            515         2      830            819
        +#> 2  2013     1     1      542            540         2      923            850
        +#> 3  2013     1     1      554            600        -6      812            837
        +#> 4  2013     1     1      555            600        -5      913            854
        +#> 5  2013     1     1      557            600        -3      838            846
        +#> 6  2013     1     1      558            600        -2      849            851
        +#> # ℹ 25,971 more rows
        +#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
        +
        +

        The code runs without error, but it doesn’t return what you want. Because of the recycling rules it finds flights in odd numbered rows that departed in January and flights in even numbered rows that departed in February. And unfortunately there’s no warning because flights has an even number of rows.

        +

        To protect you from this type of silent failure, most tidyverse functions use a stricter form of recycling that only recycles single values. Unfortunately that doesn’t help here, or in many other cases, because the key computation is performed by the base R function ==, not filter().

        +

        +13.4.2 Minimum and maximum

        +

        The arithmetic functions work with pairs of variables. Two closely related functions are pmin() and pmax(), which when given two or more variables will return the smallest or largest value in each row:

        +
        +
        df <- tribble(
        +  ~x, ~y,
        +  1,  3,
        +  5,  2,
        +  7, NA,
        +)
        +
        +df |> 
        +  mutate(
        +    min = pmin(x, y, na.rm = TRUE),
        +    max = pmax(x, y, na.rm = TRUE)
        +  )
        +#> # A tibble: 3 × 4
        +#>       x     y   min   max
        +#>   <dbl> <dbl> <dbl> <dbl>
        +#> 1     1     3     1     3
        +#> 2     5     2     2     5
        +#> 3     7    NA     7     7
        +
        +

        Note that these are different to the summary functions min() and max() which take multiple observations and return a single value. You can tell that you’ve used the wrong form when all the minimums and all the maximums have the same value:

        +
        +
        df |> 
        +  mutate(
        +    min = min(x, y, na.rm = TRUE),
        +    max = max(x, y, na.rm = TRUE)
        +  )
        +#> # A tibble: 3 × 4
        +#>       x     y   min   max
        +#>   <dbl> <dbl> <dbl> <dbl>
        +#> 1     1     3     1     7
        +#> 2     5     2     1     7
        +#> 3     7    NA     1     7
        +
        +

        +13.4.3 Modular arithmetic

        +

        Modular arithmetic is the technical name for the type of math you did before you learned about decimal places, i.e. division that yields a whole number and a remainder. In R, %/% does integer division and %% computes the remainder:

        +
        +
        1:10 %/% 3
        +#>  [1] 0 0 1 1 1 2 2 2 3 3
        +1:10 %% 3
        +#>  [1] 1 2 0 1 2 0 1 2 0 1
        +
        +

        Modular arithmetic is handy for the flights dataset, because we can use it to unpack the sched_dep_time variable into hour and minute:

        +
        +
        flights |> 
        +  mutate(
        +    hour = sched_dep_time %/% 100,
        +    minute = sched_dep_time %% 100,
        +    .keep = "used"
        +  )
        +#> # A tibble: 336,776 × 3
        +#>   sched_dep_time  hour minute
        +#>            <int> <dbl>  <dbl>
        +#> 1            515     5     15
        +#> 2            529     5     29
        +#> 3            540     5     40
        +#> 4            545     5     45
        +#> 5            600     6      0
        +#> 6            558     5     58
        +#> # ℹ 336,770 more rows
        +
        +

        We can combine that with the mean(is.na(x)) trick from Seção 12.4 to see how the proportion of cancelled flights varies over the course of the day. The results are shown in Figura 13.1.

        +
        +
        flights |> 
        +  group_by(hour = sched_dep_time %/% 100) |> 
        +  summarize(prop_cancelled = mean(is.na(dep_time)), n = n()) |> 
        +  filter(hour > 1) |> 
        +  ggplot(aes(x = hour, y = prop_cancelled)) +
        +  geom_line(color = "grey50") + 
        +  geom_point(aes(size = n))
        +
        +
        +

        A line plot showing how proportion of cancelled flights changes over the course of the day. The proportion starts low at around 0.5% at 6am, then steadily increases over the course of the day until peaking at 4% at 7pm. The proportion of cancelled flights then drops rapidly getting down to around 1% by midnight.

        +
        Figura 13.1: A line plot with scheduled departure hour on the x-axis, and proportion of cancelled flights on the y-axis. Cancellations seem to accumulate over the course of the day until 8pm, very late flights are much less likely to be cancelled.
        +
        +
        +
        +

        +13.4.4 Logarithms

        +

        Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude and converting exponential growth to linear growth. In R, you have a choice of three logarithms: log() (the natural log, base e), log2() (base 2), and log10() (base 10). We recommend using log2() or log10(). log2() is easy to interpret because a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving; whereas log10() is easy to back-transform because (e.g.) 3 is 10^3 = 1000. The inverse of log() is exp(); to compute the inverse of log2() or log10() you’ll need to use 2^ or 10^.

        +

        +13.4.5 Rounding

        +

        Use round(x) to round a number to the nearest integer:

        +
        +
        round(123.456)
        +#> [1] 123
        +
        +

        You can control the precision of the rounding with the second argument, digits. round(x, digits) rounds to the nearest 10^-n so digits = 2 will round to the nearest 0.01. This definition is useful because it implies round(x, -3) will round to the nearest thousand, which indeed it does:

        +
        +
        round(123.456, 2)  # two digits
        +#> [1] 123.46
        +round(123.456, 1)  # one digit
        +#> [1] 123.5
        +round(123.456, -1) # round to nearest ten
        +#> [1] 120
        +round(123.456, -2) # round to nearest hundred
        +#> [1] 100
        +
        +

        There’s one weirdness with round() that seems surprising at first glance:

        +
        +
        round(c(1.5, 2.5))
        +#> [1] 2 2
        +
        +

        round() uses what’s known as “round half to even” or Banker’s rounding: if a number is half way between two integers, it will be rounded to the even integer. This is a good strategy because it keeps the rounding unbiased: half of all 0.5s are rounded up, and half are rounded down.

        +

        round() is paired with floor() which always rounds down and ceiling() which always rounds up:

        +
        +
        x <- 123.456
        +
        +floor(x)
        +#> [1] 123
        +ceiling(x)
        +#> [1] 124
        +
        +

        These functions don’t have a digits argument, so you can instead scale down, round, and then scale back up:

        +
        +
        # Round down to nearest two digits
        +floor(x / 0.01) * 0.01
        +#> [1] 123.45
        +# Round up to nearest two digits
        +ceiling(x / 0.01) * 0.01
        +#> [1] 123.46
        +
        +

        You can use the same technique if you want to round() to a multiple of some other number:

        +
        +
        # Round to nearest multiple of 4
        +round(x / 4) * 4
        +#> [1] 124
        +
        +# Round to nearest 0.25
        +round(x / 0.25) * 0.25
        +#> [1] 123.5
        +
        +

        +13.4.6 Cutting numbers into ranges

        +

        Use cut()1 to break up (aka bin) a numeric vector into discrete buckets:

        +
        +
        x <- c(1, 2, 5, 10, 15, 20)
        +cut(x, breaks = c(0, 5, 10, 15, 20))
        +#> [1] (0,5]   (0,5]   (0,5]   (5,10]  (10,15] (15,20]
        +#> Levels: (0,5] (5,10] (10,15] (15,20]
        +
        +

        The breaks don’t need to be evenly spaced:

        +
        +
        cut(x, breaks = c(0, 5, 10, 100))
        +#> [1] (0,5]    (0,5]    (0,5]    (5,10]   (10,100] (10,100]
        +#> Levels: (0,5] (5,10] (10,100]
        +
        +

        You can optionally supply your own labels. Note that there should be one less labels than breaks.

        +
        +
        cut(x, 
        +  breaks = c(0, 5, 10, 15, 20), 
        +  labels = c("sm", "md", "lg", "xl")
        +)
        +#> [1] sm sm sm md lg xl
        +#> Levels: sm md lg xl
        +
        +

        Any values outside of the range of the breaks will become NA:

        +
        +
        y <- c(NA, -10, 5, 10, 30)
        +cut(y, breaks = c(0, 5, 10, 15, 20))
        +#> [1] <NA>   <NA>   (0,5]  (5,10] <NA>  
        +#> Levels: (0,5] (5,10] (10,15] (15,20]
        +
        +

        See the documentation for other useful arguments like right and include.lowest, which control if the intervals are [a, b) or (a, b] and if the lowest interval should be [a, b].

        +

        +13.4.7 Cumulative and rolling aggregates

        +

        Base R provides cumsum(), cumprod(), cummin(), cummax() for running, or cumulative, sums, products, mins and maxes. dplyr provides cummean() for cumulative means. Cumulative sums tend to come up the most in practice:

        +
        +
        x <- 1:10
        +cumsum(x)
        +#>  [1]  1  3  6 10 15 21 28 36 45 55
        +
        +

        If you need more complex rolling or sliding aggregates, try the slider package.

        +

        +13.4.8 Exercises

        +
          +
        1. Explain in words what each line of the code used to generate Figura 13.1 does.

        2. +
        3. What trigonometric functions does R provide? Guess some names and look up the documentation. Do they use degrees or radians?

        4. +
        5. +

          Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. You can see the basic problem by running the code below: there’s a gap between each hour.

          +
          +
          flights |> 
          +  filter(month == 1, day == 1) |> 
          +  ggplot(aes(x = sched_dep_time, y = dep_delay)) +
          +  geom_point()
          +
          +

          Convert them to a more truthful representation of time (either fractional hours or minutes since midnight).

          +
        6. +
        7. Round dep_time and arr_time to the nearest five minutes.

        8. +

        +13.5 General transformations

        +

        The following sections describe some general transformations which are often used with numeric vectors, but can be applied to all other column types.

        +

        +13.5.1 Ranks

        +

        dplyr provides a number of ranking functions inspired by SQL, but you should always start with dplyr::min_rank(). It uses the typical method for dealing with ties, e.g., 1st, 2nd, 2nd, 4th.

        +
        +
        x <- c(1, 2, 2, 3, 4, NA)
        +min_rank(x)
        +#> [1]  1  2  2  4  5 NA
        +
        +

        Note that the smallest values get the lowest ranks; use desc(x) to give the largest values the smallest ranks:

        +
        +
        min_rank(desc(x))
        +#> [1]  5  3  3  2  1 NA
        +
        +

        If min_rank() doesn’t do what you need, look at the variants dplyr::row_number(), dplyr::dense_rank(), dplyr::percent_rank(), and dplyr::cume_dist(). See the documentation for details.

        +
        +
        df <- tibble(x = x)
        +df |> 
        +  mutate(
        +    row_number = row_number(x),
        +    dense_rank = dense_rank(x),
        +    percent_rank = percent_rank(x),
        +    cume_dist = cume_dist(x)
        +  )
        +#> # A tibble: 6 × 5
        +#>       x row_number dense_rank percent_rank cume_dist
        +#>   <dbl>      <int>      <int>        <dbl>     <dbl>
        +#> 1     1          1          1         0          0.2
        +#> 2     2          2          2         0.25       0.6
        +#> 3     2          3          2         0.25       0.6
        +#> 4     3          4          3         0.75       0.8
        +#> 5     4          5          4         1          1  
        +#> 6    NA         NA         NA        NA         NA
        +
        +

        You can achieve many of the same results by picking the appropriate ties.method argument to base R’s rank(); you’ll probably also want to set na.last = "keep" to keep NAs as NA.

        +

        row_number() can also be used without any arguments when inside a dplyr verb. In this case, it’ll give the number of the “current” row. When combined with %% or %/% this can be a useful tool for dividing data into similarly sized groups:

        +
        +
        df <- tibble(id = 1:10)
        +
        +df |> 
        +  mutate(
        +    row0 = row_number() - 1,
        +    three_groups = row0 %% 3,
        +    three_in_each_group = row0 %/% 3
        +  )
        +#> # A tibble: 10 × 4
        +#>      id  row0 three_groups three_in_each_group
        +#>   <int> <dbl>        <dbl>               <dbl>
        +#> 1     1     0            0                   0
        +#> 2     2     1            1                   0
        +#> 3     3     2            2                   0
        +#> 4     4     3            0                   1
        +#> 5     5     4            1                   1
        +#> 6     6     5            2                   1
        +#> # ℹ 4 more rows
        +
        +

        +13.5.2 Offsets

        +

        dplyr::lead() and dplyr::lag() allow you to refer the values just before or just after the “current” value. They return a vector of the same length as the input, padded with NAs at the start or end:

        +
        +
        x <- c(2, 5, 11, 11, 19, 35)
        +lag(x)
        +#> [1] NA  2  5 11 11 19
        +lead(x)
        +#> [1]  5 11 11 19 35 NA
        +
        +
          +
        • +

          x - lag(x) gives you the difference between the current and previous value.

          +
          +
          x - lag(x)
          +#> [1] NA  3  6  0  8 16
          +
          +
        • +
        • +

          x == lag(x) tells you when the current value changes.

          +
          +
          x == lag(x)
          +#> [1]    NA FALSE FALSE  TRUE FALSE FALSE
          +
          +
        • +
        +

        You can lead or lag by more than one position by using the second argument, n.

        +

        +13.5.3 Consecutive identifiers

        +

        Sometimes you want to start a new group every time some event occurs. For example, when you’re looking at website data, it’s common to want to break up events into sessions, where you begin a new session after gap of more than x minutes since the last activity. For example, imagine you have the times when someone visited a website:

        +
        +
        events <- tibble(
        +  time = c(0, 1, 2, 3, 5, 10, 12, 15, 17, 19, 20, 27, 28, 30)
        +)
        +
        +

        And you’ve computed the time between each event, and figured out if there’s a gap that’s big enough to qualify:

        +
        +
        events <- events |> 
        +  mutate(
        +    diff = time - lag(time, default = first(time)),
        +    has_gap = diff >= 5
        +  )
        +events
        +#> # A tibble: 14 × 3
        +#>    time  diff has_gap
        +#>   <dbl> <dbl> <lgl>  
        +#> 1     0     0 FALSE  
        +#> 2     1     1 FALSE  
        +#> 3     2     1 FALSE  
        +#> 4     3     1 FALSE  
        +#> 5     5     2 FALSE  
        +#> 6    10     5 TRUE   
        +#> # ℹ 8 more rows
        +
        +

        But how do we go from that logical vector to something that we can group_by()? cumsum(), from Seção 13.4.7, comes to the rescue as gap, i.e. has_gap is TRUE, will increment group by one (Seção 12.4.2):

        +
        +
        events |> mutate(
        +  group = cumsum(has_gap)
        +)
        +#> # A tibble: 14 × 4
        +#>    time  diff has_gap group
        +#>   <dbl> <dbl> <lgl>   <int>
        +#> 1     0     0 FALSE       0
        +#> 2     1     1 FALSE       0
        +#> 3     2     1 FALSE       0
        +#> 4     3     1 FALSE       0
        +#> 5     5     2 FALSE       0
        +#> 6    10     5 TRUE        1
        +#> # ℹ 8 more rows
        +
        +

        Another approach for creating grouping variables is consecutive_id(), which starts a new group every time one of its arguments changes. For example, inspired by this stackoverflow question, imagine you have a data frame with a bunch of repeated values:

        +
        +
        df <- tibble(
        +  x = c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b"),
        +  y = c(1, 2, 3, 2, 4, 1, 3, 9, 4, 8, 10, 199)
        +)
        +
        +

        If you want to keep the first row from each repeated x, you could use group_by(), consecutive_id(), and slice_head():

        +
        +
        df |> 
        +  group_by(id = consecutive_id(x)) |> 
        +  slice_head(n = 1)
        +#> # A tibble: 7 × 3
        +#> # Groups:   id [7]
        +#>   x         y    id
        +#>   <chr> <dbl> <int>
        +#> 1 a         1     1
        +#> 2 b         2     2
        +#> 3 c         4     3
        +#> 4 d         3     4
        +#> 5 e         9     5
        +#> 6 a         4     6
        +#> # ℹ 1 more row
        +
        +

        +13.5.4 Exercises

        +
          +
        1. Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().

        2. +
        3. Which plane (tailnum) has the worst on-time record?

        4. +
        5. What time of day should you fly if you want to avoid delays as much as possible?

        6. +
        7. What does flights |> group_by(dest) |> filter(row_number() < 4) do? What does flights |> group_by(dest) |> filter(row_number(dep_delay) < 4) do?

        8. +
        9. For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.

        10. +
        11. +

          Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag(), explore how the average flight delay for an hour is related to the average delay for the previous hour.

          +
          +
          flights |> 
          +  mutate(hour = dep_time %/% 100) |> 
          +  group_by(year, month, day, hour) |> 
          +  summarize(
          +    dep_delay = mean(dep_delay, na.rm = TRUE),
          +    n = n(),
          +    .groups = "drop"
          +  ) |> 
          +  filter(n > 5)
          +
          +
        12. +
        13. Look at each destination. Can you find flights that are suspiciously fast (i.e. flights that represent a potential data entry error)? Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?

        14. +
        15. Find all destinations that are flown by at least two carriers. Use those destinations to come up with a relative ranking of the carriers based on their performance for the same destination.

        16. +

        +13.6 Numeric summaries

        +

        Just using the counts, means, and sums that we’ve introduced already can get you a long way, but R provides many other useful summary functions. Here is a selection that you might find useful.

        +

        +13.6.1 Center

        +

        So far, we’ve mostly used mean() to summarize the center of a vector of values. As we’ve seen in Seção 3.6, because the mean is the sum divided by the count, it is sensitive to even just a few unusually high or low values. An alternative is to use the median(), which finds a value that lies in the “middle” of the vector, i.e. 50% of the values is above it and 50% are below it. Depending on the shape of the distribution of the variable you’re interested in, mean or median might be a better measure of center. For example, for symmetric distributions we generally report the mean while for skewed distributions we usually report the median.

        +

        Figura 13.2 compares the mean vs. the median departure delay (in minutes) for each destination. The median delay is always smaller than the mean delay because flights sometimes leave multiple hours late, but never leave multiple hours early.

        +
        +
        flights |>
        +  group_by(year, month, day) |>
        +  summarize(
        +    mean = mean(dep_delay, na.rm = TRUE),
        +    median = median(dep_delay, na.rm = TRUE),
        +    n = n(),
        +    .groups = "drop"
        +  ) |> 
        +  ggplot(aes(x = mean, y = median)) + 
        +  geom_abline(slope = 1, intercept = 0, color = "white", linewidth = 2) +
        +  geom_point()
        +
        +
        +

        All points fall below a 45° line, meaning that the median delay is always less than the mean delay. Most points are clustered in a dense region of mean [0, 20] and median [0, 5]. As the mean delay increases, the spread of the median also increases. There are two outlying points with mean ~60, median ~50, and mean ~85, median ~55.

        +
        Figura 13.2: A scatterplot showing the differences of summarizing daily depature delay with median instead of mean.
        +
        +
        +
        +

        You might also wonder about the mode, or the most common value. This is a summary that only works well for very simple cases (which is why you might have learned about it in high school), but it doesn’t work well for many real datasets. If the data is discrete, there may be multiple most common values, and if the data is continuous, there might be no most common value because every value is ever so slightly different. For these reasons, the mode tends not to be used by statisticians and there’s no mode function included in base R2.

        +

        +13.6.2 Minimum, maximum, and quantiles

        +

        What if you’re interested in locations other than the center? min() and max() will give you the largest and smallest values. Another powerful tool is quantile() which is a generalization of the median: quantile(x, 0.25) will find the value of x that is greater than 25% of the values, quantile(x, 0.5) is equivalent to the median, and quantile(x, 0.95) will find the value that’s greater than 95% of the values.

        +

        For the flights data, you might want to look at the 95% quantile of delays rather than the maximum, because it will ignore the 5% of most delayed flights which can be quite extreme.

        +
        +
        flights |>
        +  group_by(year, month, day) |>
        +  summarize(
        +    max = max(dep_delay, na.rm = TRUE),
        +    q95 = quantile(dep_delay, 0.95, na.rm = TRUE),
        +    .groups = "drop"
        +  )
        +#> # A tibble: 365 × 5
        +#>    year month   day   max   q95
        +#>   <int> <int> <int> <dbl> <dbl>
        +#> 1  2013     1     1   853  70.1
        +#> 2  2013     1     2   379  85  
        +#> 3  2013     1     3   291  68  
        +#> 4  2013     1     4   288  60  
        +#> 5  2013     1     5   327  41  
        +#> 6  2013     1     6   202  51  
        +#> # ℹ 359 more rows
        +
        +

        +13.6.3 Spread

        +

        Sometimes you’re not so interested in where the bulk of the data lies, but in how it is spread out. Two commonly used summaries are the standard deviation, sd(x), and the inter-quartile range, IQR(). We won’t explain sd() here since you’re probably already familiar with it, but IQR() might be new — it’s quantile(x, 0.75) - quantile(x, 0.25) and gives you the range that contains the middle 50% of the data.

        +

        We can use this to reveal a small oddity in the flights data. You might expect the spread of the distance between origin and destination to be zero, since airports are always in the same place. But the code below reveals a data oddity for airport EGE:

        +
        +
        flights |> 
        +  group_by(origin, dest) |> 
        +  summarize(
        +    distance_sd = IQR(distance), 
        +    n = n(),
        +    .groups = "drop"
        +  ) |> 
        +  filter(distance_sd > 0)
        +#> # A tibble: 2 × 4
        +#>   origin dest  distance_sd     n
        +#>   <chr>  <chr>       <dbl> <int>
        +#> 1 EWR    EGE             1   110
        +#> 2 JFK    EGE             1   103
        +
        +

        +13.6.4 Distributions

        +

        It’s worth remembering that all of the summary statistics described above are a way of reducing the distribution down to a single number. This means that they’re fundamentally reductive, and if you pick the wrong summary, you can easily miss important differences between groups. That’s why it’s always a good idea to visualize the distribution before committing to your summary statistics.

        +

        Figura 13.3 shows the overall distribution of departure delays. The distribution is so skewed that we have to zoom in to see the bulk of the data. This suggests that the mean is unlikely to be a good summary and we might prefer the median instead.

        +
        +
        +
        +

        Two histograms of `dep_delay`. On the left, it's very hard to see any pattern except that there's a very large spike around zero, the bars rapidly decay in height, and for most of the plot, you can't see any bars because they are too short to see. On the right, where we've discarded delays of greater than two hours, we can see that the spike occurs slightly below zero (i.e. most flights leave a couple of minutes early), but there's still a very steep decay after that.

        +
        Figura 13.3: (Left) The histogram of the full data is extremely skewed making it hard to get any details. (Right) Zooming into delays of less than two hours makes it possible to see what’s happening with the bulk of the observations.
        +
        +
        +
        +

        It’s also a good idea to check that distributions for subgroups resemble the whole. In the following plot 365 frequency polygons of dep_delay, one for each day, are overlaid. The distributions seem to follow a common pattern, suggesting it’s fine to use the same summary for each day.

        +
        +
        flights |>
        +  filter(dep_delay < 120) |> 
        +  ggplot(aes(x = dep_delay, group = interaction(day, month))) + 
        +  geom_freqpoly(binwidth = 5, alpha = 1/5)
        +
        +

        The distribution of `dep_delay` is highly right skewed with a strong peak slightly less than 0. The 365 frequency polygons are mostly overlapping forming a thick black bland.

        +
        +
        +

        Don’t be afraid to explore your own custom summaries specifically tailored for the data that you’re working with. In this case, that might mean separately summarizing the flights that left early vs. the flights that left late, or given that the values are so heavily skewed, you might try a log-transformation. Finally, don’t forget what you learned in Seção 3.6: whenever creating numerical summaries, it’s a good idea to include the number of observations in each group.

        +

        +13.6.5 Positions

        +

        There’s one final type of summary that’s useful for numeric vectors, but also works with every other type of value: extracting a value at a specific position: first(x), last(x), and nth(x, n).

        +

        For example, we can find the first and last departure for each day:

        +
        +
        flights |> 
        +  group_by(year, month, day) |> 
        +  summarize(
        +    first_dep = first(dep_time, na_rm = TRUE), 
        +    fifth_dep = nth(dep_time, 5, na_rm = TRUE),
        +    last_dep = last(dep_time, na_rm = TRUE)
        +  )
        +#> `summarise()` has grouped output by 'year', 'month'. You can override using
        +#> the `.groups` argument.
        +#> # A tibble: 365 × 6
        +#> # Groups:   year, month [12]
        +#>    year month   day first_dep fifth_dep last_dep
        +#>   <int> <int> <int>     <int>     <int>    <int>
        +#> 1  2013     1     1       517       554     2356
        +#> 2  2013     1     2        42       535     2354
        +#> 3  2013     1     3        32       520     2349
        +#> 4  2013     1     4        25       531     2358
        +#> 5  2013     1     5        14       534     2357
        +#> 6  2013     1     6        16       555     2355
        +#> # ℹ 359 more rows
        +
        +

        (NB: Because dplyr functions use _ to separate components of function and arguments names, these functions use na_rm instead of na.rm.)

        +

        If you’re familiar with [, which we’ll come back to in Seção 27.2, you might wonder if you ever need these functions. There are three reasons: the default argument allows you to provide a default if the specified position doesn’t exist, the order_by argument allows you to locally override the order of the rows, and the na_rm argument allows you to drop missing values.

        +

        Extracting values at positions is complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row:

        +
        +
        flights |> 
        +  group_by(year, month, day) |> 
        +  mutate(r = min_rank(sched_dep_time)) |> 
        +  filter(r %in% c(1, max(r)))
        +#> # A tibble: 1,195 × 20
        +#> # Groups:   year, month, day [365]
        +#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
        +#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
        +#> 1  2013     1     1      517            515         2      830            819
        +#> 2  2013     1     1     2353           2359        -6      425            445
        +#> 3  2013     1     1     2353           2359        -6      418            442
        +#> 4  2013     1     1     2356           2359        -3      425            437
        +#> 5  2013     1     2       42           2359        43      518            442
        +#> 6  2013     1     2      458            500        -2      703            650
        +#> # ℹ 1,189 more rows
        +#> # ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
        +
        +

        +13.6.6 With mutate() +

        +

        As the names suggest, the summary functions are typically paired with summarize(). However, because of the recycling rules we discussed in Seção 13.4.1 they can also be usefully paired with mutate(), particularly when you want do some sort of group standardization. For example:

        +
          +
        • +x / sum(x) calculates the proportion of a total.
        • +
        • +(x - mean(x)) / sd(x) computes a Z-score (standardized to mean 0 and sd 1).
        • +
        • +(x - min(x)) / (max(x) - min(x)) standardizes to range [0, 1].
        • +
        • +x / first(x) computes an index based on the first observation.
        • +

        +13.6.7 Exercises

        +
          +
        1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. When is mean() useful? When is median() useful? When might you want to use something else? Should you use arrival delay or departure delay? Why might you want to use data from planes?

        2. +
        3. Which destinations show the greatest variation in air speed?

        4. +
        5. Create a plot to further explore the adventures of EGE. Can you find any evidence that the airport moved locations? Can you find another variable that might explain the difference?

        6. +

        +13.7 Summary

        +

        You’re already familiar with many tools for working with numbers, and after reading this chapter you now know how to use them in R. You’ve also learned a handful of useful general transformations that are commonly, but not exclusively, applied to numeric vectors like ranks and offsets. Finally, you worked through a number of numeric summaries, and discussed a few of the statistical challenges that you should consider.

        +

        Over the next two chapters, we’ll dive into working with strings with the stringr package. Strings are a big topic so they get two chapters, one on the fundamentals of strings and one on regular expressions.

        + + +

        +
          +
        1. ggplot2 provides some helpers for common cases in cut_interval(), cut_number(), and cut_width(). ggplot2 is an admittedly weird place for these functions to live, but they are useful as part of histogram computation and were written before any other parts of the tidyverse existed.↩︎

        2. +
        3. The mode() function does something quite different!↩︎

        4. +
        +
        +
        +
        + + + \ No newline at end of file diff --git a/numbers_files/figure-html/fig-flights-dist-1.png b/numbers_files/figure-html/fig-flights-dist-1.png new file mode 100644 index 000000000..9079841c2 Binary files /dev/null and b/numbers_files/figure-html/fig-flights-dist-1.png differ diff --git a/numbers_files/figure-html/fig-mean-vs-median-1.png b/numbers_files/figure-html/fig-mean-vs-median-1.png new file mode 100644 index 000000000..d4c9cbb6a Binary files /dev/null and b/numbers_files/figure-html/fig-mean-vs-median-1.png differ diff --git a/numbers_files/figure-html/fig-prop-cancelled-1.png b/numbers_files/figure-html/fig-prop-cancelled-1.png new file mode 100644 index 000000000..14551efd7 Binary files /dev/null and b/numbers_files/figure-html/fig-prop-cancelled-1.png differ diff --git a/numbers_files/figure-html/unnamed-chunk-50-1.png b/numbers_files/figure-html/unnamed-chunk-50-1.png new file mode 100644 index 000000000..f13371c8c Binary files /dev/null and b/numbers_files/figure-html/unnamed-chunk-50-1.png differ diff --git a/preface-2e.html b/preface-2e.html index 11dce4ca4..6c3c29235 100644 --- a/preface-2e.html +++ b/preface-2e.html @@ -148,28 +148,225 @@ 2  Fluxo de Trabalho: básico + + + + + + + diff --git a/program.html b/program.html index f4f0855b6..77efe1e7d 100644 --- a/program.html +++ b/program.html @@ -27,8 +27,8 @@ - - + + @@ -133,29 +133,226 @@ 2  Fluxo de Trabalho: básico + + + + + + + + + + + + + + + + + + + + + +
        @@ -190,9 +387,9 @@

        Programming produces code, and code is a tool of communication. Obviously code tells the computer what you want it to do. But it also communicates meaning to other humans. Thinking about code as a vehicle for communication is important because every project you do is fundamentally collaborative. Even if you’re not working with other people, you’ll definitely be working with future-you! Writing clear code is important so that others (like future-you) can understand why you tackled an analysis in the way you did. That means getting better at programming also involves getting better at communicating. Over time, you want your code to become not just easier to write, but easier for others to read.

        In the following three chapters, you’ll learn skills to improve your programming skills:

          -
        1. Copy-and-paste is a powerful tool, but you should avoid doing it more than twice. Repeating yourself in code is dangerous because it can easily lead to errors and inconsistencies. Instead, in ?sec-functions, you’ll learn how to write functions which let you extract out repeated tidyverse code so that it can be easily reused.

        2. -
        3. Functions extract out repeated code, but you often need to repeat the same actions on different inputs. You need tools for iteration that let you do similar things again and again. These tools include for loops and functional programming, which you’ll learn about in ?sec-iteration.

        4. -
        5. As you read more code written by others, you’ll see more code that doesn’t use the tidyverse. In ?sec-base-r, you’ll learn some of the most important base R functions that you’ll see in the wild.

        6. +
        7. Copy-and-paste is a powerful tool, but you should avoid doing it more than twice. Repeating yourself in code is dangerous because it can easily lead to errors and inconsistencies. Instead, in Capítulo 25, you’ll learn how to write functions which let you extract out repeated tidyverse code so that it can be easily reused.

        8. +
        9. Functions extract out repeated code, but you often need to repeat the same actions on different inputs. You need tools for iteration that let you do similar things again and again. These tools include for loops and functional programming, which you’ll learn about in Capítulo 26.

        10. +
        11. As you read more code written by others, you’ll see more code that doesn’t use the tidyverse. In Capítulo 27, you’ll learn some of the most important base R functions that you’ll see in the wild.

        The goal of these chapters is to teach you the minimum about programming that you need for data science. Once you have mastered the material here, we strongly recommend that you continue to invest in your programming skills. We’ve written two books that you might find helpful. Hands on Programming with R, by Garrett Grolemund, is an introduction to R as a programming language and is a great place to start if R is your first programming language. Advanced R by Hadley Wickham dives into the details of R the programming language; it’s great place to start if you have existing programming experience and great next step once you’ve internalized the ideas in these chapters.

        @@ -431,13 +628,13 @@

        diff --git a/quarto-formats.html b/quarto-formats.html new file mode 100644 index 000000000..a45815bf4 --- /dev/null +++ b/quarto-formats.html @@ -0,0 +1,869 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 29  Quarto formats + + + + + + + + + + + + + + + + + + + + + + + + + + +
        +
        + +
        + + + +
        +

        29  Quarto formats

        +
        + + + +
        + + + + +
        + + +

        +29.1 Introduction

        +

        So far, you’ve seen Quarto used to produce HTML documents. This chapter gives a brief overview of some of the many other types of output you can produce with Quarto.

        +

        There are two ways to set the output of a document:

        +
          +
        1. +

          Permanently, by modifying the YAML header:

          +
          title: "Diamond sizes"
          +format: html
          +
        2. +
        3. +

          Transiently, by calling quarto::quarto_render() by hand:

          +
          +
          quarto::quarto_render("diamond-sizes.qmd", output_format = "docx")
          +
          +

          This is useful if you want to programmatically produce multiple types of output since the output_format argument can also take a list of values.

          +
          +
          quarto::quarto_render("diamond-sizes.qmd", output_format = c("docx", "pdf"))
          +
          +
        4. +

        +29.2 Output options

        +

        Quarto offers a wide range of output formats. You can find the complete list at https://quarto.org/docs/output-formats/all-formats.html. Many formats share some output options (e.g., toc: true for including a table of contents), but others have options that are format specific (e.g., code-fold: true collapses code chunks into a <details> tag for HTML output so the user can display it on demand, it’s not applicable in a PDF or Word document).

        +

        To override the default options, you need to use an expanded format field. For example, if you wanted to render an html with a floating table of contents, you’d use:

        +
        format:
        +  html:
        +    toc: true
        +    toc_float: true
        +

        You can even render to multiple outputs by supplying a list of formats:

        +
        format:
        +  html:
        +    toc: true
        +    toc_float: true
        +  pdf: default
        +  docx: default
        +

        Note the special syntax (pdf: default) if you don’t want to override any default options.

        +

        To render to all formats specified in the YAML of a document, you can use output_format = "all".

        +
        +
        quarto::quarto_render("diamond-sizes.qmd", output_format = "all")
        +
        +

        +29.3 Documents

        +

        The previous chapter focused on the default html output. There are several basic variations on that theme, generating different types of documents. For example:

        +
          +
        • pdf makes a PDF with LaTeX (an open-source document layout system), which you’ll need to install. RStudio will prompt you if you don’t already have it.

        • +
        • docx for Microsoft Word (.docx) documents.

        • +
        • odt for OpenDocument Text (.odt) documents.

        • +
        • rtf for Rich Text Format (.rtf) documents.

        • +
        • gfm for a GitHub Flavored Markdown (.md) document.

        • +
        • ipynb for Jupyter Notebooks (.ipynb).

        • +
        +

        Remember, when generating a document to share with decision-makers, you can turn off the default display of code by setting global options in the document YAML:

        +
        execute:
        +  echo: false
        +

        For html documents another option is to make the code chunks hidden by default, but visible with a click:

        +
        format:
        +  html:
        +    code: true
        +

        +29.4 Presentations

        +

        You can also use Quarto to produce presentations. You get less visual control than with a tool like Keynote or PowerPoint, but automatically inserting the results of your R code into a presentation can save a huge amount of time. Presentations work by dividing your content into slides, with a new slide beginning at each second (##) level header. Additionally, first (#) level headers indicate the beginning of a new section with a section title slide that is, by default, centered in the middle.

        +

        Quarto supports a variety of presentation formats, including:

        +
          +
        1. revealjs - HTML presentation with revealjs

        2. +
        3. pptx - PowerPoint presentation

        4. +
        5. beamer - PDF presentation with LaTeX Beamer.

        6. +
        +

        You can read more about creating presentations with Quarto at https://quarto.org/docs/presentations.

        +

        +29.5 Interactivity

        +

        Just like any HTML document, HTML documents created with Quarto can contain interactive components as well. Here we introduce two options for including interactivity in your Quarto documents: htmlwidgets and Shiny.

        +

        +29.5.1 htmlwidgets

        +

        HTML is an interactive format, and you can take advantage of that interactivity with htmlwidgets, R functions that produce interactive HTML visualizations. For example, take the leaflet map below. If you’re viewing this page on the web, you can drag the map around, zoom in and out, etc. You obviously can’t do that in a book, so Quarto automatically inserts a static screenshot for you.

        +
        +
        library(leaflet)
        +leaflet() |>
        +  setView(174.764, -36.877, zoom = 16) |> 
        +  addTiles() |>
        +  addMarkers(174.764, -36.877, popup = "Maungawhau") 
        +
        +
        + +
        +
        +

        The great thing about htmlwidgets is that you don’t need to know anything about HTML or JavaScript to use them. All the details are wrapped inside the package, so you don’t need to worry about it.

        +

        There are many packages that provide htmlwidgets, including:

        +
          +
        • dygraphs for interactive time series visualizations.

        • +
        • DT for interactive tables.

        • +
        • threejs for interactive 3d plots.

        • +
        • DiagrammeR for diagrams (like flow charts and simple node-link diagrams).

        • +
        +

        To learn more about htmlwidgets and see a complete list of packages that provide them visit https://www.htmlwidgets.org.

        +

        +29.5.2 Shiny

        +

        htmlwidgets provide client-side interactivity — all the interactivity happens in the browser, independently of R. On the one hand, that’s great because you can distribute the HTML file without any connection to R. However, that fundamentally limits what you can do to things that have been implemented in HTML and JavaScript. An alternative approach is to use shiny, a package that allows you to create interactivity using R code, not JavaScript.

        +

        To call Shiny code from a Quarto document, add server: shiny to the YAML header:

        +
        title: "Shiny Web App"
        +format: html
        +server: shiny
        +

        Then you can use the “input” functions to add interactive components to the document:

        +
        +
        library(shiny)
        +
        +textInput("name", "What is your name?")
        +numericInput("age", "How old are you?", NA, min = 0, max = 150)
        +
        +
        +
        +

        Two input boxes on top of each other. Top one says, "What is your name?", the bottom, "How old are you?".

        +
        +
        +

        And you also need a code chunk with chunk option context: server which contains the code that needs to run in a Shiny server.

        +

        You can then refer to the values with input$name and input$age, and the code that uses them will be automatically re-run whenever they change.

        +

        We can’t show you a live shiny app here because shiny interactions occur on the server-side. This means that you can write interactive apps without knowing JavaScript, but you need a server to run them on. This introduces a logistical issue: Shiny apps need a Shiny server to be run online. When you run Shiny apps on your own computer, Shiny automatically sets up a Shiny server for you, but you need a public-facing Shiny server if you want to publish this sort of interactivity online. That’s the fundamental trade-off of shiny: you can do anything in a shiny document that you can do in R, but it requires someone to be running R.

        +

        For learning more about Shiny, we recommend reading Mastering Shiny by Hadley Wickham, https://mastering-shiny.org.

        +

        +29.6 Websites and books

        +

        With a bit of additional infrastructure, you can use Quarto to generate a complete website or book:

        +
          +
        • Put your .qmd files in a single directory. index.qmd will become the home page.

        • +
        • +

          Add a YAML file named _quarto.yml that provides the navigation for the site. In this file, set the project type to either book or website, e.g.:

          +
          project:
          +  type: book
          +
        • +
        +

        For example, the following _quarto.yml file creates a website from three source files: index.qmd (the home page), viridis-colors.qmd, and terrain-colors.qmd.

        +
        +
        project:
        +  type: website
        +
        +website:
        +  title: "A website on color scales"
        +  navbar:
        +    left:
        +      - href: index.qmd
        +        text: Home
        +      - href: viridis-colors.qmd
        +        text: Viridis colors
        +      - href: terrain-colors.qmd
        +        text: Terrain colors
        +
        +

        The _quarto.yml file you need for a book is very similarly structured. The following example shows how you can create a book with four chapters that renders to three different outputs (html, pdf, and epub). Once again, the source files are .qmd files.

        +
        +
        project:
        +  type: book
        +
        +book:
        +  title: "A book on color scales"
        +  author: "Jane Coloriste"
        +  chapters:
        +    - index.qmd
        +    - intro.qmd
        +    - viridis-colors.qmd
        +    - terrain-colors.qmd
        +
        +format:
        +  html:
        +    theme: cosmo
        +  pdf: default
        +  epub: default
        +
        +

        We recommend that you use an RStudio project for your websites and books. Based on the _quarto.yml file, RStudio will recognize the type of project you’re working on, and add a Build tab to the IDE that you can use to render and preview your websites and books. Both websites and books can also be rendered using quarto::render().

        +

        Read more at https://quarto.org/docs/websites about Quarto websites and https://quarto.org/docs/books about books.

        +

        +29.7 Other formats

        +

        Quarto offers even more output formats:

        + +

        See https://quarto.org/docs/output-formats/all-formats.html for a list of even more formats.

        +

        +29.8 Summary

        +

        In this chapter we presented you a variety of options for communicating your results with Quarto, from static and interactive documents to presentations to websites and books.

        +

        To learn more about effective communication in these different formats, we recommend the following resources:

        + + + +
        +
        +
        +
        + + + \ No newline at end of file diff --git a/quarto.html b/quarto.html new file mode 100644 index 000000000..0dfb91b5b --- /dev/null +++ b/quarto.html @@ -0,0 +1,1461 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 28  Quarto + + + + + + + + + + + + + + + + + + + + + + + + +
        +
        + +
        + + + +
        +

        28  Quarto

        +
        + + + +
        + + + + +
        + + +

        +28.1 Introduction

        +

        Quarto provides a unified authoring framework for data science, combining your code, its results, and your prose. Quarto documents are fully reproducible and support dozens of output formats, like PDFs, Word files, presentations, and more.

        +

        Quarto files are designed to be used in three ways:

        +
          +
        1. For communicating to decision-makers, who want to focus on the conclusions, not the code behind the analysis.

        2. +
        3. For collaborating with other data scientists (including future you!), who are interested in both your conclusions, and how you reached them (i.e. the code).

        4. +
        5. As an environment in which to do data science, as a modern-day lab notebook where you can capture not only what you did, but also what you were thinking.

        6. +
        +

        Quarto is a command line interface tool, not an R package. This means that help is, by-and-large, not available through ?. Instead, as you work through this chapter, and use Quarto in the future, you should refer to the Quarto documentation.

        +

        If you’re an R Markdown user, you might be thinking “Quarto sounds a lot like R Markdown”. You’re not wrong! Quarto unifies the functionality of many packages from the R Markdown ecosystem (rmarkdown, bookdown, distill, xaringan, etc.) into a single consistent system as well as extends it with native support for multiple programming languages like Python and Julia in addition to R. In a way, Quarto reflects everything that was learned from expanding and supporting the R Markdown ecosystem over a decade.

        +

        +28.1.1 Prerequisites

        +

        You need the Quarto command line interface (Quarto CLI), but you don’t need to explicitly install it or load it, as RStudio automatically does both when needed.

        +

        +28.2 Quarto basics

        +

        This is a Quarto file – a plain text file that has the extension .qmd:

        +
        +
        ---
        +title: "Diamond sizes"
        +date: 2022-09-12
        +format: html
        +---
        +
        +```{r}
        +#| label: setup
        +#| include: false
        +
        +library(tidyverse)
        +
        +smaller <- diamonds |> 
        +  filter(carat <= 2.5)
        +```
        +
        +We have data about `r nrow(diamonds)` diamonds.
        +Only `r nrow(diamonds) - nrow(smaller)` are larger than 2.5 carats.
        +The distribution of the remainder is shown below:
        +
        +```{r}
        +#| label: plot-smaller-diamonds
        +#| echo: false
        +
        +smaller |> 
        +  ggplot(aes(x = carat)) + 
        +  geom_freqpoly(binwidth = 0.01)
        +```
        +
        +

        It contains three important types of content:

        +
          +
        1. An (optional) YAML header surrounded by ---s.
        2. +
        3. +Chunks of R code surrounded by ```.
        4. +
        5. Text mixed with simple text formatting like # heading and _italics_.
        6. +
        +

        Figura 28.1 shows a .qmd document in RStudio with notebook interface where code and output are interleaved. You can run each code chunk by clicking the Run icon (it looks like a play button at the top of the chunk), or by pressing Cmd/Ctrl + Shift + Enter. RStudio executes the code and displays the results inline with the code.

        +
        +
        +
        +

        RStudio window with a Quarto document titled "diamond-sizes.qmd" on the left and a blank Viewer window on the right. The Quarto document has a code chunk that creates a frequency plot of diamonds that weigh less than 2.5 carats. The plot shows that the frequency decreases as the weight increases.

        +
        Figura 28.1: A Quarto document in RStudio. Code and output interleaved in the document, with the plot output appearing right underneath the code.
        +
        +
        +
        +

        If you don’t like seeing your plots and output in your document and would rather make use of RStudio’s Console and Plot panes, you can click on the gear icon next to “Render” and switch to “Chunk Output in Console”, as shown in Figura 28.2.

        +
        +
        +
        +

        RStudio window with a Quarto document titled "diamond-sizes.qmd" on the left and the Plot pane on the bottom right. The Quarto document has a code chunk that creates a frequency plot of diamonds that weigh less than 2.5 carats. The plot is displayed in the Plot pane and shows that the frequency decreases as the weight increases. The RStudio option to show Chunk Output in Console is also highlighted.

        +
        Figura 28.2: A Quarto document in RStudio with the plot output in the Plots pane.
        +
        +
        +
        +

        To produce a complete report containing all text, code, and results, click “Render” or press Cmd/Ctrl + Shift + K. You can also do this programmatically with quarto::quarto_render("diamond-sizes.qmd"). This will display the report in the viewer pane as shown in Figura 28.3 and create an HTML file.

        +
        +
        +
        +

        RStudio window with a Quarto document titled "diamond-sizes.qmd" on the left and the Plot pane on the bottom right. The rendered document does not show any of the code, but the code is visible in the source document.

        +
        Figura 28.3: A Quarto document in RStudio with the rendered document in the Viewer pane.
        +
        +
        +
        +

        When you render the document, Quarto sends the .qmd file to knitr, https://yihui.org/knitr/, which executes all of the code chunks and creates a new markdown (.md) document which includes the code and its output. The markdown file generated by knitr is then processed by pandoc, https://pandoc.org, which is responsible for creating the finished file. This process is shown in Figura 28.4. The advantage of this two step workflow is that you can create a very wide range of output formats, as you’ll learn about in Capítulo 29.

        +
        +
        +
        +

        Workflow diagram starting with a qmd file, then knitr, then md, then pandoc, then PDF, MS Word, or HTML.

        +
        Figura 28.4: Diagram of Quarto workflow from qmd, to knitr, to md, to pandoc, to output in PDF, MS Word, or HTML formats.
        +
        +
        +
        +

        To get started with your own .qmd file, select File > New File > Quarto Document… in the menu bar. RStudio will launch a wizard that you can use to pre-populate your file with useful content that reminds you how the key features of Quarto work.

        +

        The following sections dive into the three components of a Quarto document in more details: the markdown text, the code chunks, and the YAML header.

        +

        +28.2.1 Exercises

        +
          +
        1. Create a new Quarto document using File > New File > Quarto Document. Read the instructions. Practice running the chunks individually. Then render the document by clicking the appropriate button and then by using the appropriate keyboard short cut. Verify that you can modify the code, re-run it, and see modified output.

        2. +
        3. Create one new Quarto document for each of the three built-in formats: HTML, PDF and Word. Render each of the three documents. How do the outputs differ? How do the inputs differ? (You may need to install LaTeX in order to build the PDF output — RStudio will prompt you if this is necessary.)

        4. +

        +28.3 Visual editor

        +

        The Visual editor in RStudio provides a WYSIWYM interface for authoring Quarto documents. Under the hood, prose in Quarto documents (.qmd files) is written in Markdown, a lightweight set of conventions for formatting plain text files. In fact, Quarto uses Pandoc markdown (a slightly extended version of Markdown that Quarto understands), including tables, citations, cross-references, footnotes, divs/spans, definition lists, attributes, raw HTML/TeX, and more as well as support for executing code cells and viewing their output inline. While Markdown is designed to be easy to read and write, as you will see in Seção 28.4, it still requires learning new syntax. Therefore, if you’re new to computational documents like .qmd files but have experience using tools like Google Docs or MS Word, the easiest way to get started with Quarto in RStudio is the visual editor.

        +

        In the visual editor you can either use the buttons on the menu bar to insert images, tables, cross-references, etc. or you can use the catch-all ⌘ / shortcut to insert just about anything. If you are at the beginning of a line (as shown in Figura 28.5), you can also enter just / to invoke the shortcut.

        +
        +
        +
        +

        A Quarto document displaying various features of the visual editor such as text formatting (italic, bold, underline, small caps, code, superscript, and subscript), first through third level headings, bulleted and numbered lists, links, linked phrases, and images (along with a pop-up window for customizing image size, adding a caption and alt text, etc.), tables with a header row, and the insert anything tool with options to insert an R code chunk, a Python code chunk, a div, a bullet list, a numbered list, or a first level heading (the top few choices in the tool).

        +
        Figura 28.5: Quarto visual editor.
        +
        +
        +
        +

        Inserting images and customizing how they are displayed is also facilitated with the visual editor. You can either paste an image from your clipboard directly into the visual editor (and RStudio will place a copy of that image in the project directory and link to it) or you can use the visual editor’s Insert > Figure / Image menu to browse to the image you want to insert or paste it’s URL. In addition, using the same menu you can resize the image as well as add a caption, alternative text, and a link.

        +

        The visual editor has many more features that we haven’t enumerated here that you might find useful as you gain experience authoring with it.

        +

        Most importantly, while the visual editor displays your content with formatting, under the hood, it saves your content in plain Markdown and you can switch back and forth between the visual and source editors to view and edit your content using either tool.

        +

        +28.3.1 Exercises

        +
          +
        1. Re-create the document in Figura 28.5 using the visual editor.
        2. +
        3. Using the visual editor, insert a code chunk using the Insert menu and then the insert anything tool.
        4. +
        5. Using the visual editor, figure out how to: +
            +
          1. Add a footnote.
          2. +
          3. Add a horizontal rule.
          4. +
          5. Add a block quote.
          6. +
          +
        6. +
        7. In the visual editor, go to Insert > Citation and insert a citation to the paper titled Welcome to the Tidyverse using its DOI (digital object identifier), which is 10.21105/joss.01686. Render the document and observe how the reference shows up in the document. What change do you observe in the YAML of your document?
        8. +

        +28.4 Source editor

        +

        You can also edit Quarto documents using the Source editor in RStudio, without the assist of the Visual editor. While the Visual editor will feel familiar to those with experience writing in tools like Google docs, the Source editor will feel familiar to those with experience writing R scripts or R Markdown documents. The Source editor can also be useful for debugging any Quarto syntax errors since it’s often easier to catch these in plain text.

        +

        The guide below shows how to use Pandoc’s Markdown for authoring Quarto documents in the source editor.

        +
        +
        ## Text formatting
        +
        +*italic* **bold** ~~strikeout~~ `code`
        +
        +superscript^2^ subscript~2~
        +
        +[underline]{.underline} [small caps]{.smallcaps}
        +
        +## Headings
        +
        +# 1st Level Header
        +
        +## 2nd Level Header
        +
        +### 3rd Level Header
        +
        +## Lists
        +
        +-   Bulleted list item 1
        +
        +-   Item 2
        +
        +    -   Item 2a
        +
        +    -   Item 2b
        +
        +1.  Numbered list item 1
        +
        +2.  Item 2.
        +    The numbers are incremented automatically in the output.
        +
        +## Links and images
        +
        +<http://example.com>
        +
        +[linked phrase](http://example.com)
        +
        +![optional caption text](quarto.png){fig-alt="Quarto logo and the word quarto spelled in small case letters"}
        +
        +## Tables
        +
        +| First Header | Second Header |
        +|--------------|---------------|
        +| Content Cell | Content Cell  |
        +| Content Cell | Content Cell  |
        +
        +

        The best way to learn these is simply to try them out. It will take a few days, but soon they will become second nature, and you won’t need to think about them. If you forget, you can get to a handy reference sheet with Help > Markdown Quick Reference.

        +

        +28.4.1 Exercises

        +
          +
        1. Practice what you’ve learned by creating a brief CV. The title should be your name, and you should include headings for (at least) education or employment. Each of the sections should include a bulleted list of jobs/degrees. Highlight the year in bold.

        2. +
        3. +

          Using the source editor and the Markdown quick reference, figure out how to:

          +
            +
          1. Add a footnote.
          2. +
          3. Add a horizontal rule.
          4. +
          5. Add a block quote.
          6. +
          +
        4. +
        5. Copy and paste the contents of diamond-sizes.qmd from https://github.com/hadley/r4ds/tree/main/quarto in to a local R Quarto document. Check that you can run it, then add text after the frequency polygon that describes its most striking features.

        6. +
        7. Create a document in a Google doc or MS Word (or locate a document you have created previously) with some content in it such as headings, hyperlinks, formatted text, etc. Copy the contents of this document and paste it into a Quarto document in the visual editor. Then, switch over to the source editor and inspect the source code.

        8. +

        +28.5 Code chunks

        +

        To run code inside a Quarto document, you need to insert a chunk. There are three ways to do so:

        +
          +
        1. The keyboard shortcut Cmd + Option + I / Ctrl + Alt + I.

        2. +
        3. The “Insert” button icon in the editor toolbar.

        4. +
        5. By manually typing the chunk delimiters ```{r} and ```.

        6. +
        +

        We’d recommend you learn the keyboard shortcut. It will save you a lot of time in the long run!

        +

        You can continue to run the code using the keyboard shortcut that by now (we hope!) you know and love: Cmd/Ctrl + Enter. However, chunks get a new keyboard shortcut: Cmd/Ctrl + Shift + Enter, which runs all the code in the chunk. Think of a chunk like a function. A chunk should be relatively self-contained, and focused around a single task.

        +

        The following sections describe the chunk header which consists of ```{r}, followed by an optional chunk label and various other chunk options, each on their own line, marked by #|.

        +

        +28.5.1 Chunk label

        +

        Chunks can be given an optional label, e.g.

        +
        +
        ```{r}
        +#| label: simple-addition
        +
        +1 + 1
        +```
        +
        #> [1] 2
        +
        +

        This has three advantages:

        +
          +
        1. +

          You can more easily navigate to specific chunks using the drop-down code navigator in the bottom-left of the script editor:

          +
          +
          +

          Snippet of RStudio IDE showing only the drop-down code navigator which shows three chunks. Chunk 1 is setup. Chunk 2 is cars and it is in a section called Quarto. Chunk 3 is pressure and it is in a section called Including plots.

          +
          +
          +
        2. +
        3. Graphics produced by the chunks will have useful names that make them easier to use elsewhere. More on that in Seção 28.6.

        4. +
        5. You can set up networks of cached chunks to avoid re-performing expensive computations on every run. More on that in Seção 28.8.

        6. +
        +

        Your chunk labels should be short but evocative and should not contain spaces. We recommend using dashes (-) to separate words (instead of underscores, _) and avoiding other special characters in chunk labels.

        +

        You are generally free to label your chunk however you like, but there is one chunk name that imbues special behavior: setup. When you’re in a notebook mode, the chunk named setup will be run automatically once, before any other code is run.

        +

        Additionally, chunk labels cannot be duplicated. Each chunk label must be unique.

        +

        +28.5.2 Chunk options

        +

        Chunk output can be customized with options, fields supplied to chunk header. Knitr provides almost 60 options that you can use to customize your code chunks. Here we’ll cover the most important chunk options that you’ll use frequently. You can see the full list at https://yihui.org/knitr/options.

        +

        The most important set of options controls if your code block is executed and what results are inserted in the finished report:

        +
          +
        • eval: false prevents code from being evaluated. (And obviously if the code is not run, no results will be generated). This is useful for displaying example code, or for disabling a large block of code without commenting each line.

        • +
        • include: false runs the code, but doesn’t show the code or results in the final document. Use this for setup code that you don’t want cluttering your report.

        • +
        • echo: false prevents code, but not the results from appearing in the finished file. Use this when writing reports aimed at people who don’t want to see the underlying R code.

        • +
        • message: false or warning: false prevents messages or warnings from appearing in the finished file.

        • +
        • results: hide hides printed output; fig-show: hide hides plots.

        • +
        • error: true causes the render to continue even if code returns an error. This is rarely something you’ll want to include in the final version of your report, but can be very useful if you need to debug exactly what is going on inside your .qmd. It’s also useful if you’re teaching R and want to deliberately include an error. The default, error: false causes rendering to fail if there is a single error in the document.

        • +
        +

        Each of these chunk options get added to the header of the chunk, following #|, e.g., in the following chunk the result is not printed since eval is set to false.

        +
        +
        ```{r}
        +#| label: simple-multiplication
        +#| eval: false
        +
        +2 * 2
        +```
        +
        +

        The following table summarizes which types of output each option suppresses:

        + +++++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
        OptionRun codeShow codeOutputPlotsMessagesWarnings
        eval: falseXXXXX
        include: falseXXXXX
        echo: falseX
        results: hideX
        fig-show: hideX
        message: falseX
        warning: falseX

        +28.5.3 Global options

        +

        As you work more with knitr, you will discover that some of the default chunk options don’t fit your needs and you want to change them.

        +

        You can do this by adding the preferred options in the document YAML, under execute. For example, if you are preparing a report for an audience who does not need to see your code but only your results and narrative, you might set echo: false at the document level. That will hide the code by default, so only showing the chunks you deliberately choose to show (with echo: true). You might consider setting message: false and warning: false, but that would make it harder to debug problems because you wouldn’t see any messages in the final document.

        +
        title: "My report"
        +execute:
        +  echo: false
        +

        Since Quarto is designed to be multi-lingual (works with R as well as other languages like Python, Julia, etc.), all of the knitr options are not available at the document execution level since some of them only work with knitr and not other engines Quarto uses for running code in other languages (e.g., Jupyter). You can, however, still set these as global options for your document under the knitr field, under opts_chunk. For example, when writing books and tutorials we set:

        +
        title: "Tutorial"
        +knitr:
        +  opts_chunk:
        +    comment: "#>"
        +    collapse: true
        +

        This uses our preferred comment formatting and ensures that the code and output are kept closely entwined.

        +

        +28.5.4 Inline code

        +

        There is one other way to embed R code into a Quarto document: directly into the text, with: `r `. This can be very useful if you mention properties of your data in the text. For example, the example document used at the start of the chapter had:

        +
        +

        We have data about `r nrow(diamonds)` diamonds. Only `r nrow(diamonds) - nrow(smaller)` are larger than 2.5 carats. The distribution of the remainder is shown below:

        +
        +

        When the report is rendered, the results of these computations are inserted into the text:

        +
        +

        We have data about 53940 diamonds. Only 126 are larger than 2.5 carats. The distribution of the remainder is shown below:

        +
        +

        When inserting numbers into text, format() is your friend. It allows you to set the number of digits so you don’t print to a ridiculous degree of accuracy, and a big.mark to make numbers easier to read. You might combine these into a helper function:

        +
        +
        comma <- function(x) format(x, digits = 2, big.mark = ",")
        +comma(3452345)
        +#> [1] "3,452,345"
        +comma(.12358124331)
        +#> [1] "0.12"
        +
        +

        +28.5.5 Exercises

        +
          +
        1. Add a section that explores how diamond sizes vary by cut, color, and clarity. Assume you’re writing a report for someone who doesn’t know R, and instead of setting echo: false on each chunk, set a global option.

        2. +
        3. Download diamond-sizes.qmd from https://github.com/hadley/r4ds/tree/main/quarto. Add a section that describes the largest 20 diamonds, including a table that displays their most important attributes.

        4. +
        5. Modify diamonds-sizes.qmd to use label_comma() to produce nicely formatted output. Also include the percentage of diamonds that are larger than 2.5 carats.

        6. +

        +28.6 Figures

        +

        The figures in a Quarto document can be embedded (e.g., a PNG or JPEG file) or generated as a result of a code chunk.

        +

        To embed an image from an external file, you can use the Insert menu in the Visual Editor in RStudio and select Figure / Image. This will pop open a menu where you can browse to the image you want to insert as well as add alternative text or caption to it and adjust its size. In the visual editor you can also simply paste an image from your clipboard into your document and RStudio will place a copy of that image in your project folder.

        +

        If you include a code chunk that generates a figure (e.g., includes a ggplot() call), the resulting figure will be automatically included in your Quarto document.

        +

        +28.6.1 Figure sizing

        +

        The biggest challenge of graphics in Quarto is getting your figures the right size and shape. There are five main options that control figure sizing: fig-width, fig-height, fig-asp, out-width and out-height. Image sizing is challenging because there are two sizes (the size of the figure created by R and the size at which it is inserted in the output document), and multiple ways of specifying the size (i.e. height, width, and aspect ratio: pick two of three).

        +

        We recommend three of the five options:

        +
          +
        • Plots tend to be more aesthetically pleasing if they have consistent width. To enforce this, set fig-width: 6 (6”) and fig-asp: 0.618 (the golden ratio) in the defaults. Then in individual chunks, only adjust fig-asp.

        • +
        • +

          Control the output size with out-width and set it to a percentage of the body width of the output document. We suggest to out-width: "70%" and fig-align: center.

          +

          That gives plots room to breathe, without taking up too much space.

          +
        • +
        • To put multiple plots in a single row, set the layout-ncol to 2 for two plots, 3 for three plots, etc. This effectively sets out-width to “50%” for each of your plots if layout-ncol is 2, “33%” if layout-ncol is 3, etc. Depending on what you’re trying to illustrate (e.g., show data or show plot variations), you might also tweak fig-width, as discussed below.

        • +
        +

        If you find that you’re having to squint to read the text in your plot, you need to tweak fig-width. If fig-width is larger than the size the figure is rendered in the final doc, the text will be too small; if fig-width is smaller, the text will be too big. You’ll often need to do a little experimentation to figure out the right ratio between the fig-width and the eventual width in your document. To illustrate the principle, the following three plots have fig-width of 4, 6, and 8 respectively:

        +
        +
        +

        Scatterplot of highway mileage vs. displacement of cars, where the points are normally sized and the axis text and labels are in similar font size to the surrounding text.

        +
        +
        +
        +
        +

        Scatterplot of highway mileage vs. displacement of cars, where the points are smaller than in the previous plot and the axis text and labels are smallter than the surrounding text.

        +
        +
        +
        +
        +

        Scatterplot of highway mileage vs. displacement of cars, where the points are even smaller than in the previous plot and the axis text and labels are even smallter than the surrounding text.

        +
        +
        +

        If you want to make sure the font size is consistent across all your figures, whenever you set out-width, you’ll also need to adjust fig-width to maintain the same ratio with your default out-width. For example, if your default fig-width is 6 and out-width is “70%”, when you set out-width: "50%" you’ll need to set fig-width to 4.3 (6 * 0.5 / 0.7).

        +

        Figure sizing and scaling is an art and science and getting things right can require an iterative trial-and-error approach. You can learn more about figure sizing in the taking control of plot scaling blog post.

        +

        +28.6.2 Other important options

        +

        When mingling code and text, like in this book, you can set fig-show: hold so that plots are shown after the code. This has the pleasant side effect of forcing you to break up large blocks of code with their explanations.

        +

        To add a caption to the plot, use fig-cap. In Quarto this will change the figure from inline to “floating”.

        +

        If you’re producing PDF output, the default graphics type is PDF. This is a good default because PDFs are high quality vector graphics. However, they can produce very large and slow plots if you are displaying thousands of points. In that case, set fig-format: "png" to force the use of PNGs. They are slightly lower quality, but will be much more compact.

        +

        It’s a good idea to name code chunks that produce figures, even if you don’t routinely label other chunks. The chunk label is used to generate the file name of the graphic on disk, so naming your chunks makes it much easier to pick out plots and reuse in other circumstances (e.g., if you want to quickly drop a single plot into an email).

        +

        +28.6.3 Exercises

        +
          +
        1. Open diamond-sizes.qmd in the visual editor, find an image of a diamond, copy it, and paste it into the document. Double click on the image and add a caption. Resize the image and render your document. Observe how the image is saved in your current working directory.
        2. +
        3. Edit the label of the code chunk in diamond-sizes.qmd that generates a plot to start with the prefix fig- and add a caption to the figure with the chunk option fig-cap. Then, edit the text above the code chunk to add a cross-reference to the figure with Insert > Cross Reference.
        4. +
        5. Change the size of the figure with the following chunk options, one at a time, render your document, and describe how the figure changes. +
            +
          1. fig-width: 10

          2. +
          3. fig-height: 3

          4. +
          5. out-width: "100%"

          6. +
          7. out-width: "20%"

          8. +
          +
        6. +

        +28.7 Tables

        +

        Similar to figures, you can include two types of tables in a Quarto document. They can be markdown tables that you create directly in your Quarto document (using the Insert Table menu) or they can be tables generated as a result of a code chunk. In this section we will focus on the latter, tables generated via computation.

        +

        By default, Quarto prints data frames and matrices as you’d see them in the console:

        +
        +
        mtcars[1:5, ]
        +#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
        +#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
        +#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
        +#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
        +#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
        +#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
        +
        +

        If you prefer that data be displayed with additional formatting you can use the knitr::kable() function. The code below generates Tabela 28.1.

        +
        +
        knitr::kable(mtcars[1:5, ], )
        +
        +
        + + ++++++++++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
        Tabela 28.1: A knitr kable.
        mpgcyldisphpdratwtqsecvsamgearcarb
        Mazda RX421.061601103.902.62016.460144
        Mazda RX4 Wag21.061601103.902.87517.020144
        Datsun 71022.84108933.852.32018.611141
        Hornet 4 Drive21.462581103.083.21519.441031
        Hornet Sportabout18.783601753.153.44017.020032
        +
        +
        +
        +

        Read the documentation for ?knitr::kable to see the other ways in which you can customize the table. For even deeper customization, consider the gt, huxtable, reactable, kableExtra, xtable, stargazer, pander, tables, and ascii packages. Each provides a set of tools for returning formatted tables from R code.

        +

        +28.7.1 Exercises

        +
          +
        1. Open diamond-sizes.qmd in the visual editor, insert a code chunk, and add a table with knitr::kable() that shows the first 5 rows of the diamonds data frame.
        2. +
        3. Display the same table with gt::gt() instead.
        4. +
        5. Add a chunk label that starts with the prefix tbl- and add a caption to the table with the chunk option tbl-cap. Then, edit the text above the code chunk to add a cross-reference to the table with Insert > Cross Reference.
        6. +

        +28.8 Caching

        +

        Normally, each render of a document starts from a completely clean slate. This is great for reproducibility, because it ensures that you’ve captured every important computation in code. However, it can be painful if you have some computations that take a long time. The solution is cache: true.

        +

        You can enable the Knitr cache at the document level for caching the results of all computations in a document using standard YAML options:

        +
        ---
        +title: "My Document"
        +execute: 
        +  cache: true
        +---
        +

        You can also enable caching at the chunk level for caching the results of computation in a specific chunk:

        +
        +
        ```{r}
        +#| cache: true
        +
        +# code for lengthy computation...
        +```
        +
        +

        When set, this will save the output of the chunk to a specially named file on disk. On subsequent runs, knitr will check to see if the code has changed, and if it hasn’t, it will reuse the cached results.

        +

        The caching system must be used with care, because by default it is based on the code only, not its dependencies. For example, here the processed_data chunk depends on the raw-data chunk:

        +
        ```{r}
        +#| label: raw-data
        +#| cache: true
        +
        +rawdata <- readr::read_csv("a_very_large_file.csv")
        +```
        +
        ```{r}
        +#| label: processed_data
        +#| cache: true
        +
        +processed_data <- rawdata |> 
        +  filter(!is.na(import_var)) |> 
        +  mutate(new_variable = complicated_transformation(x, y, z))
        +```
        +

        Caching the processed_data chunk means that it will get re-run if the dplyr pipeline is changed, but it won’t get rerun if the read_csv() call changes. You can avoid that problem with the dependson chunk option:

        +
        ```{r}
        +#| label: processed-data
        +#| cache: true
        +#| dependson: "raw-data"
        +
        +processed_data <- rawdata |> 
        +  filter(!is.na(import_var)) |> 
        +  mutate(new_variable = complicated_transformation(x, y, z))
        +```
        +

        dependson should contain a character vector of every chunk that the cached chunk depends on. Knitr will update the results for the cached chunk whenever it detects that one of its dependencies have changed.

        +

        Note that the chunks won’t update if a_very_large_file.csv changes, because knitr caching only tracks changes within the .qmd file. If you want to also track changes to that file you can use the cache.extra option. This is an arbitrary R expression that will invalidate the cache whenever it changes. A good function to use is file.mtime(): it returns when it was last modified. Then you can write:

        +
        ```{r}
        +#| label: raw-data
        +#| cache: true
        +#| cache.extra: !expr file.mtime("a_very_large_file.csv")
        +
        +rawdata <- readr::read_csv("a_very_large_file.csv")
        +```
        +

        We’ve followed the advice of David Robinson to name these chunks: each chunk is named after the primary object that it creates. This makes it easier to understand the dependson specification.

        +

        As your caching strategies get progressively more complicated, it’s a good idea to regularly clear out all your caches with knitr::clean_cache().

        +

        +28.8.1 Exercises

        +
          +
        1. Set up a network of chunks where d depends on c and b, and both b and c depend on a. Have each chunk print lubridate::now(), set cache: true, then verify your understanding of caching.
        2. +

        +28.9 Troubleshooting

        +

        Troubleshooting Quarto documents can be challenging because you are no longer in an interactive R environment, and you will need to learn some new tricks. Additionally, the error could be due to issues with the Quarto document itself or due to the R code in the Quarto document.

        +

        One common error in documents with code chunks is duplicated chunk labels, which are especially pervasive if your workflow involves copying and pasting code chunks. To address this issue, all you need to do is to change one of your duplicated labels.

        +

        If the errors are due to the R code in the document, the first thing you should always try is to recreate the problem in an interactive session. Restart R, then “Run all chunks”, either from the Code menu, under Run region or with the keyboard shortcut Ctrl + Alt + R. If you’re lucky, that will recreate the problem, and you can figure out what’s going on interactively.

        +

        If that doesn’t help, there must be something different between your interactive environment and the Quarto environment. You’re going to need to systematically explore the options. The most common difference is the working directory: the working directory of a Quarto is the directory in which it lives. Check the working directory is what you expect by including getwd() in a chunk.

        +

        Next, brainstorm all the things that might cause the bug. You’ll need to systematically check that they’re the same in your R session and your Quarto session. The easiest way to do that is to set error: true on the chunk causing the problem, then use print() and str() to check that settings are as you expect.

        +

        +28.10 YAML header

        +

        You can control many other “whole document” settings by tweaking the parameters of the YAML header. You might wonder what YAML stands for: it’s “YAML Ain’t Markup Language”, which is designed for representing hierarchical data in a way that’s easy for humans to read and write. Quarto uses it to control many details of the output. Here we’ll discuss three: self-contained documents, document parameters, and bibliographies.

        +

        +28.10.1 Self-contained

        +

        HTML documents typically have a number of external dependencies (e.g., images, CSS style sheets, JavaScript, etc.) and, by default, Quarto places these dependencies in a _files folder in the same directory as your .qmd file. If you publish the HTML file on a hosting platform (e.g., QuartoPub, https://quartopub.com/), the dependencies in this directory are published with your document and hence are available in the published report. However, if you want to email the report to a colleague, you might prefer to have a single, self-contained, HTML document that embeds all of its dependencies. You can do this by specifying the embed-resources option:

        +
        format:
        +  html:
        +    embed-resources: true
        +

        The resulting file will be self-contained, such that it will need no external files and no internet access to be displayed properly by a browser.

        +

        +28.10.2 Parameters

        +

        Quarto documents can include one or more parameters whose values can be set when you render the report. Parameters are useful when you want to re-render the same report with distinct values for various key inputs. For example, you might be producing sales reports per branch, exam results by student, or demographic summaries by country. To declare one or more parameters, use the params field.

        +

        This example uses a my_class parameter to determine which class of cars to display:

        +
        +
        ---
        +format: html
        +params:
        +  my_class: "suv"
        +---
        +
        +```{r}
        +#| label: setup
        +#| include: false
        +
        +library(tidyverse)
        +
        +class <- mpg |> filter(class == params$my_class)
        +```
        +
        +# Fuel economy for `r params$my_class`s
        +
        +```{r}
        +#| message: false
        +
        +ggplot(class, aes(x = displ, y = hwy)) + 
        +  geom_point() + 
        +  geom_smooth(se = FALSE)
        +```
        +
        +

        As you can see, parameters are available within the code chunks as a read-only list named params.

        +

        You can write atomic vectors directly into the YAML header. You can also run arbitrary R expressions by prefacing the parameter value with !expr. This is a good way to specify date/time parameters.

        +
        params:
        +  start: !expr lubridate::ymd("2015-01-01")
        +  snapshot: !expr lubridate::ymd_hms("2015-01-01 12:30:00")
        +

        +28.10.3 Bibliographies and Citations

        +

        Quarto can automatically generate citations and a bibliography in a number of styles. The most straightforward way of adding citations and bibliographies to a Quarto document is using the visual editor in RStudio.

        +

        To add a citation using the visual editor, go to Insert > Citation. Citations can be inserted from a variety of sources:

        +
          +
        1. DOI (Document Object Identifier) references.

        2. +
        3. Zotero personal or group libraries.

        4. +
        5. Searches of Crossref, DataCite, or PubMed.

        6. +
        7. Your document bibliography (a .bib file in the directory of your document)

        8. +
        +

        Under the hood, the visual mode uses the standard Pandoc markdown representation for citations (e.g., [@citation]).

        +

        If you add a citation using one of the first three methods, the visual editor will automatically create a bibliography.bib file for you and add the reference to it. It will also add a bibliography field to the document YAML. As you add more references, this file will get populated with their citations. You can also directly edit this file using many common bibliography formats including BibLaTeX, BibTeX, EndNote, Medline.

        +

        To create a citation within your .qmd file in the source editor, use a key composed of ‘@’ + the citation identifier from the bibliography file. Then place the citation in square brackets. Here are some examples:

        +
        Separate multiple citations with a `;`: Blah blah [@smith04; @doe99].
        +
        +You can add arbitrary comments inside the square brackets: 
        +Blah blah [see @doe99, pp. 33-35; also @smith04, ch. 1].
        +
        +Remove the square brackets to create an in-text citation: @smith04 
        +says blah, or @smith04 [p. 33] says blah.
        +
        +Add a `-` before the citation to suppress the author's name: 
        +Smith says blah [-@smith04].
        +

        When Quarto renders your file, it will build and append a bibliography to the end of your document. The bibliography will contain each of the cited references from your bibliography file, but it will not contain a section heading. As a result it is common practice to end your file with a section header for the bibliography, such as # References or # Bibliography.

        +

        You can change the style of your citations and bibliography by referencing a CSL (citation style language) file in the csl field:

        +
        bibliography: rmarkdown.bib
        +csl: apa.csl
        +

        As with the bibliography field, your csl file should contain a path to the file. Here we assume that the csl file is in the same directory as the .qmd file. A good place to find CSL style files for common bibliography styles is https://github.com/citation-style-language/styles.

        +

        +28.11 Workflow

        +

        Earlier, we discussed a basic workflow for capturing your R code where you work interactively in the console, then capture what works in the script editor. Quarto brings together the console and the script editor, blurring the lines between interactive exploration and long-term code capture. You can rapidly iterate within a chunk, editing and re-executing with Cmd/Ctrl + Shift + Enter. When you’re happy, you move on and start a new chunk.

        +

        Quarto is also important because it so tightly integrates prose and code. This makes it a great analysis notebook because it lets you develop code and record your thoughts. An analysis notebook shares many of the same goals as a classic lab notebook in the physical sciences. It:

        +
          +
        • Records what you did and why you did it. Regardless of how great your memory is, if you don’t record what you do, there will come a time when you have forgotten important details. Write them down so you don’t forget!

        • +
        • Supports rigorous thinking. You are more likely to come up with a strong analysis if you record your thoughts as you go, and continue to reflect on them. This also saves you time when you eventually write up your analysis to share with others.

        • +
        • Helps others understand your work. It is rare to do data analysis by yourself, and you’ll often be working as part of a team. A lab notebook helps you share not only what you’ve done, but why you did it with your colleagues or lab mates.

        • +
        +

        Much of the good advice about using lab notebooks effectively can also be translated to analysis notebooks. We’ve drawn on our own experiences and Colin Purrington’s advice on lab notebooks (https://colinpurrington.com/tips/lab-notebooks) to come up with the following tips:

        +
          +
        • Ensure each notebook has a descriptive title, an evocative file name, and a first paragraph that briefly describes the aims of the analysis.

        • +
        • +

          Use the YAML header date field to record the date you started working on the notebook:

          +
          date: 2016-08-23
          +

          Use ISO8601 YYYY-MM-DD format so that’s there no ambiguity. Use it even if you don’t normally write dates that way!

          +
        • +
        • If you spend a lot of time on an analysis idea and it turns out to be a dead end, don’t delete it! Write up a brief note about why it failed and leave it in the notebook. That will help you avoid going down the same dead end when you come back to the analysis in the future.

        • +
        • Generally, you’re better off doing data entry outside of R. But if you do need to record a small snippet of data, clearly lay it out using tibble::tribble().

        • +
        • If you discover an error in a data file, never modify it directly, but instead write code to correct the value. Explain why you made the fix.

        • +
        • Before you finish for the day, make sure you can render the notebook. If you’re using caching, make sure to clear the caches. That will let you fix any problems while the code is still fresh in your mind.

        • +
        • If you want your code to be reproducible in the long-run (i.e. so you can come back to run it next month or next year), you’ll need to track the versions of the packages that your code uses. A rigorous approach is to use renv, https://rstudio.github.io/renv/index.html, which stores packages in your project directory. A quick and dirty hack is to include a chunk that runs sessionInfo() — that won’t let you easily recreate your packages as they are today, but at least you’ll know what they were.

        • +
        • You are going to create many, many, many analysis notebooks over the course of your career. How are you going to organize them so you can find them again in the future? We recommend storing them in individual projects, and coming up with a good naming scheme.

        • +

        +28.12 Summary

        +

        In this chapter we introduced you to Quarto for authoring and publishing reproducible computational documents that include your code and your prose in one place. You’ve learned about writing Quarto documents in RStudio with the visual or the source editor, how code chunks work and how to customize options for them, how to include figures and tables in your Quarto documents, and options for caching for computations. Additionally, you’ve learned about adjusting YAML header options for creating self-contained or parametrized documents as well as including citations and bibliography. We have also given you some troubleshooting and workflow tips.

        +

        While this introduction should be sufficient to get you started with Quarto, there is still a lot more to learn. Quarto is still relatively young, and is still growing rapidly. The best place to stay on top of innovations is the official Quarto website: https://quarto.org.

        +

        There are two important topics that we haven’t covered here: collaboration and the details of accurately communicating your ideas to other humans. Collaboration is a vital part of modern data science, and you can make your life much easier by using version control tools, like Git and GitHub. We recommend “Happy Git with R”, a user friendly introduction to Git and GitHub from R users, by Jenny Bryan. The book is freely available online: https://happygitwithr.com.

        +

        We have also not touched on what you should actually write in order to clearly communicate the results of your analysis. To improve your writing, we highly recommend reading either Style: Lessons in Clarity and Grace by Joseph M. Williams & Joseph Bizup, or The Sense of Structure: Writing from the Reader’s Perspective by George Gopen. Both books will help you understand the structure of sentences and paragraphs, and give you the tools to make your writing more clear. (These books are rather expensive if purchased new, but they’re used by many English classes so there are plenty of cheap second-hand copies). George Gopen also has a number of short articles on writing at https://www.georgegopen.com/the-litigation-articles.html. They are aimed at lawyers, but almost everything applies to data scientists too.

        + + +
        +
        +
        +
        + + + \ No newline at end of file diff --git a/quarto/diamond-sizes-console-output.png b/quarto/diamond-sizes-console-output.png new file mode 100644 index 000000000..35504cd02 Binary files /dev/null and b/quarto/diamond-sizes-console-output.png differ diff --git a/quarto/diamond-sizes-notebook.png b/quarto/diamond-sizes-notebook.png new file mode 100644 index 000000000..926973e60 Binary files /dev/null and b/quarto/diamond-sizes-notebook.png differ diff --git a/quarto/diamond-sizes-report.png b/quarto/diamond-sizes-report.png new file mode 100644 index 000000000..02704df2f Binary files /dev/null and b/quarto/diamond-sizes-report.png differ diff --git a/quarto/quarto-shiny.png b/quarto/quarto-shiny.png new file mode 100644 index 000000000..804badad9 Binary files /dev/null and b/quarto/quarto-shiny.png differ diff --git a/quarto/quarto-visual-editor.png b/quarto/quarto-visual-editor.png new file mode 100644 index 000000000..69206d735 Binary files /dev/null and b/quarto/quarto-visual-editor.png differ diff --git a/quarto_files/figure-html/unnamed-chunk-15-1.png b/quarto_files/figure-html/unnamed-chunk-15-1.png new file mode 100644 index 000000000..b33760f73 Binary files /dev/null and b/quarto_files/figure-html/unnamed-chunk-15-1.png differ diff --git a/quarto_files/figure-html/unnamed-chunk-16-1.png b/quarto_files/figure-html/unnamed-chunk-16-1.png new file mode 100644 index 000000000..dff599615 Binary files /dev/null and b/quarto_files/figure-html/unnamed-chunk-16-1.png differ diff --git a/quarto_files/figure-html/unnamed-chunk-17-1.png b/quarto_files/figure-html/unnamed-chunk-17-1.png new file mode 100644 index 000000000..9548faf7d Binary files /dev/null and b/quarto_files/figure-html/unnamed-chunk-17-1.png differ diff --git a/rectangling.html b/rectangling.html new file mode 100644 index 000000000..6fdbc3dd6 --- /dev/null +++ b/rectangling.html @@ -0,0 +1,1558 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 23  Hierarchical data + + + + + + + + + + + + + + + + + + + + + + + + +
        +
        + +
        + + + +
        +

        23  Hierarchical data

        +
        + + + +
        + + + + +
        + + +

        +23.1 Introduction

        +

        In this chapter, you’ll learn the art of data rectangling: taking data that is fundamentally hierarchical, or tree-like, and converting it into a rectangular data frame made up of rows and columns. This is important because hierarchical data is surprisingly common, especially when working with data that comes from the web.

        +

        To learn about rectangling, you’ll need to first learn about lists, the data structure that makes hierarchical data possible. Then you’ll learn about two crucial tidyr functions: tidyr::unnest_longer() and tidyr::unnest_wider(). We’ll then show you a few case studies, applying these simple functions again and again to solve real problems. We’ll finish off by talking about JSON, the most frequent source of hierarchical datasets and a common format for data exchange on the web.

        +

        +23.1.1 Prerequisites

        +

        In this chapter, we’ll use many functions from tidyr, a core member of the tidyverse. We’ll also use repurrrsive to provide some interesting datasets for rectangling practice, and we’ll finish by using jsonlite to read JSON files into R lists.

        + +

        +23.2 Lists

        +

        So far you’ve worked with data frames that contain simple vectors like integers, numbers, characters, date-times, and factors. These vectors are simple because they’re homogeneous: every element is of the same data type. If you want to store elements of different types in the same vector, you’ll need a list, which you create with list():

        +
        +
        x1 <- list(1:4, "a", TRUE)
        +x1
        +#> [[1]]
        +#> [1] 1 2 3 4
        +#> 
        +#> [[2]]
        +#> [1] "a"
        +#> 
        +#> [[3]]
        +#> [1] TRUE
        +
        +

        It’s often convenient to name the components, or children, of a list, which you can do in the same way as naming the columns of a tibble:

        +
        +
        x2 <- list(a = 1:2, b = 1:3, c = 1:4)
        +x2
        +#> $a
        +#> [1] 1 2
        +#> 
        +#> $b
        +#> [1] 1 2 3
        +#> 
        +#> $c
        +#> [1] 1 2 3 4
        +
        +

        Even for these very simple lists, printing takes up quite a lot of space. A useful alternative is str(), which generates a compact display of the structure, de-emphasizing the contents:

        +
        +
        str(x1)
        +#> List of 3
        +#>  $ : int [1:4] 1 2 3 4
        +#>  $ : chr "a"
        +#>  $ : logi TRUE
        +str(x2)
        +#> List of 3
        +#>  $ a: int [1:2] 1 2
        +#>  $ b: int [1:3] 1 2 3
        +#>  $ c: int [1:4] 1 2 3 4
        +
        +

        As you can see, str() displays each child of the list on its own line. It displays the name, if present, then an abbreviation of the type, then the first few values.

        +

        +23.2.1 Hierarchy

        +

        Lists can contain any type of object, including other lists. This makes them suitable for representing hierarchical (tree-like) structures:

        +
        +
        x3 <- list(list(1, 2), list(3, 4))
        +str(x3)
        +#> List of 2
        +#>  $ :List of 2
        +#>   ..$ : num 1
        +#>   ..$ : num 2
        +#>  $ :List of 2
        +#>   ..$ : num 3
        +#>   ..$ : num 4
        +
        +

        This is notably different to c(), which generates a flat vector:

        +
        +
        c(c(1, 2), c(3, 4))
        +#> [1] 1 2 3 4
        +
        +x4 <- c(list(1, 2), list(3, 4))
        +str(x4)
        +#> List of 4
        +#>  $ : num 1
        +#>  $ : num 2
        +#>  $ : num 3
        +#>  $ : num 4
        +
        +

        As lists get more complex, str() gets more useful, as it lets you see the hierarchy at a glance:

        +
        +
        x5 <- list(1, list(2, list(3, list(4, list(5)))))
        +str(x5)
        +#> List of 2
        +#>  $ : num 1
        +#>  $ :List of 2
        +#>   ..$ : num 2
        +#>   ..$ :List of 2
        +#>   .. ..$ : num 3
        +#>   .. ..$ :List of 2
        +#>   .. .. ..$ : num 4
        +#>   .. .. ..$ :List of 1
        +#>   .. .. .. ..$ : num 5
        +
        +

        As lists get even larger and more complex, str() eventually starts to fail, and you’ll need to switch to View()1. Figura 23.1 shows the result of calling View(x5). The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in Figura 23.2. RStudio will also show you the code you need to access that element, as in Figura 23.3. We’ll come back to how this code works in Seção 27.3.

        +
        +
        +
        +

        A screenshot of RStudio showing the list-viewer. It shows the two children of x5: the first child is a double vector and the second child is a list. A rightward facing triable indicates that the second child itself has children but you can't see them.

        +
        Figura 23.1: The RStudio view lets you interactively explore a complex list. The viewer opens showing only the top level of the list.
        +
        +
        +
        +
        +
        +
        +

        Another screenshot of the list-viewer having expand the second child of x5. It also has two children, a double vector and another list.

        +
        Figura 23.2: Clicking on the rightward facing triangle expands that component of the list so that you can also see its children.
        +
        +
        +
        +
        +
        +
        +

        Another screenshot, having expanded the grandchild of x5 to see its two children, again a double vector and a list.

        +
        Figura 23.3: You can repeat this operation as many times as needed to get to the data you’re interested in. Note the bottom-left corner: if you click an element of the list, RStudio will give you the subsetting code needed to access it, in this case x5[[2]][[2]][[2]].
        +
        +
        +
        +

        +23.2.2 List-columns

        +

        Lists can also live inside a tibble, where we call them list-columns. List-columns are useful because they allow you to place objects in a tibble that wouldn’t usually belong in there. In particular, list-columns are used a lot in the tidymodels ecosystem, because they allow you to store things like model outputs or resamples in a data frame.

        +

        Here’s a simple example of a list-column:

        +
        +
        df <- tibble(
        +  x = 1:2, 
        +  y = c("a", "b"),
        +  z = list(list(1, 2), list(3, 4, 5))
        +)
        +df
        +#> # A tibble: 2 × 3
        +#>       x y     z         
        +#>   <int> <chr> <list>    
        +#> 1     1 a     <list [2]>
        +#> 2     2 b     <list [3]>
        +
        +

        There’s nothing special about lists in a tibble; they behave like any other column:

        +
        +
        df |> 
        +  filter(x == 1)
        +#> # A tibble: 1 × 3
        +#>       x y     z         
        +#>   <int> <chr> <list>    
        +#> 1     1 a     <list [2]>
        +
        +

        Computing with list-columns is harder, but that’s because computing with lists is harder in general; we’ll come back to that in Capítulo 26. In this chapter, we’ll focus on unnesting list-columns out into regular variables so you can use your existing tools on them.

        +

        The default print method just displays a rough summary of the contents. The list column could be arbitrarily complex, so there’s no good way to print it. If you want to see it, you’ll need to pull out just the one list-column and apply one of the techniques that you’ve learned above, like df |> pull(z) |> str() or df |> pull(z) |> View().

        +
        +
        +
        + +
        +
        +Base R +
        +
        +
        +

        It’s possible to put a list in a column of a data.frame, but it’s a lot fiddlier because data.frame() treats a list as a list of columns:

        +
        +
        data.frame(x = list(1:3, 3:5))
        +#>   x.1.3 x.3.5
        +#> 1     1     3
        +#> 2     2     4
        +#> 3     3     5
        +
        +

        You can force data.frame() to treat a list as a list of rows by wrapping it in list I(), but the result doesn’t print particularly well:

        +
        +
        data.frame(
        +  x = I(list(1:2, 3:5)), 
        +  y = c("1, 2", "3, 4, 5")
        +)
        +#>         x       y
        +#> 1    1, 2    1, 2
        +#> 2 3, 4, 5 3, 4, 5
        +
        +

        It’s easier to use list-columns with tibbles because tibble() treats lists like vectors and the print method has been designed with lists in mind.

        +
        +
        +

        +23.3 Unnesting

        +

        Now that you’ve learned the basics of lists and list-columns, let’s explore how you can turn them back into regular rows and columns. Here we’ll use very simple sample data so you can get the basic idea; in the next section we’ll switch to real data.

        +

        List-columns tend to come in two basic forms: named and unnamed. When the children are named, they tend to have the same names in every row. For example, in df1, every element of list-column y has two elements named a and b. Named list-columns naturally unnest into columns: each named element becomes a new named column.

        +
        +
        df1 <- tribble(
        +  ~x, ~y,
        +  1, list(a = 11, b = 12),
        +  2, list(a = 21, b = 22),
        +  3, list(a = 31, b = 32),
        +)
        +
        +

        When the children are unnamed, the number of elements tends to vary from row-to-row. For example, in df2, the elements of list-column y are unnamed and vary in length from one to three. Unnamed list-columns naturally unnest into rows: you’ll get one row for each child.

        +
        +
        
        +df2 <- tribble(
        +  ~x, ~y,
        +  1, list(11, 12, 13),
        +  2, list(21),
        +  3, list(31, 32),
        +)
        +
        +

        tidyr provides two functions for these two cases: unnest_wider() and unnest_longer(). The following sections explain how they work.

        +

        +23.3.1 unnest_wider() +

        +

        When each row has the same number of elements with the same names, like df1, it’s natural to put each component into its own column with unnest_wider():

        +
        +
        df1 |> 
        +  unnest_wider(y)
        +#> # A tibble: 3 × 3
        +#>       x     a     b
        +#>   <dbl> <dbl> <dbl>
        +#> 1     1    11    12
        +#> 2     2    21    22
        +#> 3     3    31    32
        +
        +

        By default, the names of the new columns come exclusively from the names of the list elements, but you can use the names_sep argument to request that they combine the column name and the element name. This is useful for disambiguating repeated names.

        +
        +
        df1 |> 
        +  unnest_wider(y, names_sep = "_")
        +#> # A tibble: 3 × 3
        +#>       x   y_a   y_b
        +#>   <dbl> <dbl> <dbl>
        +#> 1     1    11    12
        +#> 2     2    21    22
        +#> 3     3    31    32
        +
        +

        +23.3.2 unnest_longer() +

        +

        When each row contains an unnamed list, it’s most natural to put each element into its own row with unnest_longer():

        +
        +
        df2 |> 
        +  unnest_longer(y)
        +#> # A tibble: 6 × 2
        +#>       x     y
        +#>   <dbl> <dbl>
        +#> 1     1    11
        +#> 2     1    12
        +#> 3     1    13
        +#> 4     2    21
        +#> 5     3    31
        +#> 6     3    32
        +
        +

        Note how x is duplicated for each element inside of y: we get one row of output for each element inside the list-column. But what happens if one of the elements is empty, as in the following example?

        +
        +
        df6 <- tribble(
        +  ~x, ~y,
        +  "a", list(1, 2),
        +  "b", list(3),
        +  "c", list()
        +)
        +df6 |> unnest_longer(y)
        +#> # A tibble: 3 × 2
        +#>   x         y
        +#>   <chr> <dbl>
        +#> 1 a         1
        +#> 2 a         2
        +#> 3 b         3
        +
        +

        We get zero rows in the output, so the row effectively disappears. If you want to preserve that row, adding NA in y, set keep_empty = TRUE.

        +

        +23.3.3 Inconsistent types

        +

        What happens if you unnest a list-column that contains different types of vector? For example, take the following dataset where the list-column y contains two numbers, a character, and a logical, which can’t normally be mixed in a single column.

        +
        +
        df4 <- tribble(
        +  ~x, ~y,
        +  "a", list(1),
        +  "b", list("a", TRUE, 5)
        +)
        +
        +

        unnest_longer() always keeps the set of columns unchanged, while changing the number of rows. So what happens? How does unnest_longer() produce five rows while keeping everything in y?

        +
        +
        df4 |> 
        +  unnest_longer(y)
        +#> # A tibble: 4 × 2
        +#>   x     y        
        +#>   <chr> <list>   
        +#> 1 a     <dbl [1]>
        +#> 2 b     <chr [1]>
        +#> 3 b     <lgl [1]>
        +#> 4 b     <dbl [1]>
        +
        +

        As you can see, the output contains a list-column, but every element of the list-column contains a single element. Because unnest_longer() can’t find a common type of vector, it keeps the original types in a list-column. You might wonder if this breaks the commandment that every element of a column must be the same type. It doesn’t: every element is a list, even though the contents are of different types.

        +

        Dealing with inconsistent types is challenging and the details depend on the precise nature of the problem and your goals, but you’ll most likely need tools from Capítulo 26.

        +

        +23.3.4 Other functions

        +

        tidyr has a few other useful rectangling functions that we’re not going to cover in this book:

        +
          +
        • +unnest_auto() automatically picks between unnest_longer() and unnest_wider() based on the structure of the list-column. It’s great for rapid exploration, but ultimately it’s a bad idea because it doesn’t force you to understand how your data is structured, and makes your code harder to understand.
        • +
        • +unnest() expands both rows and columns. It’s useful when you have a list-column that contains a 2d structure like a data frame, which you don’t see in this book, but you might encounter if you use the tidymodels ecosystem.
        • +
        +

        These functions are good to know about as you might encounter them when reading other people’s code or tackling rarer rectangling challenges yourself.

        +

        +23.3.5 Exercises

        +
          +
        1. What happens when you use unnest_wider() with unnamed list-columns like df2? What argument is now necessary? What happens to missing values?

        2. +
        3. What happens when you use unnest_longer() with named list-columns like df1? What additional information do you get in the output? How can you suppress that extra detail?

        4. +
        5. +

          From time-to-time you encounter data frames with multiple list-columns with aligned values. For example, in the following data frame, the values of y and z are aligned (i.e. y and z will always have the same length within a row, and the first value of y corresponds to the first value of z). What happens if you apply two unnest_longer() calls to this data frame? How can you preserve the relationship between x and y? (Hint: carefully read the docs).

          +
          +
          df4 <- tribble(
          +  ~x, ~y, ~z,
          +  "a", list("y-a-1", "y-a-2"), list("z-a-1", "z-a-2"),
          +  "b", list("y-b-1", "y-b-2", "y-b-3"), list("z-b-1", "z-b-2", "z-b-3")
          +)
          +
          +
        6. +

        +23.4 Case studies

        +

        The main difference between the simple examples we used above and real data is that real data typically contains multiple levels of nesting that require multiple calls to unnest_longer() and/or unnest_wider(). To show that in action, this section works through three real rectangling challenges using datasets from the repurrrsive package.

        +

        +23.4.1 Very wide data

        +

        We’ll start with gh_repos. This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It’s a very deeply nested list so it’s difficult to show the structure in this book; we recommend exploring a little on your own with View(gh_repos) before we continue.

        +

        gh_repos is a list, but our tools work with list-columns, so we’ll begin by putting it into a tibble. We call this column json for reasons we’ll get to later.

        +
        +
        repos <- tibble(json = gh_repos)
        +repos
        +#> # A tibble: 6 × 1
        +#>   json       
        +#>   <list>     
        +#> 1 <list [30]>
        +#> 2 <list [30]>
        +#> 3 <list [30]>
        +#> 4 <list [26]>
        +#> 5 <list [30]>
        +#> 6 <list [30]>
        +
        +

        This tibble contains 6 rows, one row for each child of gh_repos. Each row contains a unnamed list with either 26 or 30 rows. Since these are unnamed, we’ll start with unnest_longer() to put each child in its own row:

        +
        +
        repos |> 
        +  unnest_longer(json)
        +#> # A tibble: 176 × 1
        +#>   json             
        +#>   <list>           
        +#> 1 <named list [68]>
        +#> 2 <named list [68]>
        +#> 3 <named list [68]>
        +#> 4 <named list [68]>
        +#> 5 <named list [68]>
        +#> 6 <named list [68]>
        +#> # ℹ 170 more rows
        +
        +

        At first glance, it might seem like we haven’t improved the situation: while we have more rows (176 instead of 6) each element of json is still a list. However, there’s an important difference: now each element is a named list so we can use unnest_wider() to put each element into its own column:

        +
        +
        repos |> 
        +  unnest_longer(json) |> 
        +  unnest_wider(json) 
        +#> # A tibble: 176 × 68
        +#>         id name        full_name         owner        private html_url       
        +#>      <int> <chr>       <chr>             <list>       <lgl>   <chr>          
        +#> 1 61160198 after       gaborcsardi/after <named list> FALSE   https://github…
        +#> 2 40500181 argufy      gaborcsardi/argu… <named list> FALSE   https://github…
        +#> 3 36442442 ask         gaborcsardi/ask   <named list> FALSE   https://github…
        +#> 4 34924886 baseimports gaborcsardi/base… <named list> FALSE   https://github…
        +#> 5 61620661 citest      gaborcsardi/cite… <named list> FALSE   https://github…
        +#> 6 33907457 clisymbols  gaborcsardi/clis… <named list> FALSE   https://github…
        +#> # ℹ 170 more rows
        +#> # ℹ 62 more variables: description <chr>, fork <lgl>, url <chr>, …
        +
        +

        This has worked but the result is a little overwhelming: there are so many columns that tibble doesn’t even print all of them! We can see them all with names(); and here we look at the first 10:

        +
        +
        repos |> 
        +  unnest_longer(json) |> 
        +  unnest_wider(json) |> 
        +  names() |> 
        +  head(10)
        +#>  [1] "id"          "name"        "full_name"   "owner"       "private"    
        +#>  [6] "html_url"    "description" "fork"        "url"         "forks_url"
        +
        +

        Let’s pull out a few that look interesting:

        +
        +
        repos |> 
        +  unnest_longer(json) |> 
        +  unnest_wider(json) |> 
        +  select(id, full_name, owner, description)
        +#> # A tibble: 176 × 4
        +#>         id full_name               owner             description             
        +#>      <int> <chr>                   <list>            <chr>                   
        +#> 1 61160198 gaborcsardi/after       <named list [17]> Run Code in the Backgro…
        +#> 2 40500181 gaborcsardi/argufy      <named list [17]> Declarative function ar…
        +#> 3 36442442 gaborcsardi/ask         <named list [17]> Friendly CLI interactio…
        +#> 4 34924886 gaborcsardi/baseimports <named list [17]> Do we get warnings for …
        +#> 5 61620661 gaborcsardi/citest      <named list [17]> Test R package and repo…
        +#> 6 33907457 gaborcsardi/clisymbols  <named list [17]> Unicode symbols for CLI…
        +#> # ℹ 170 more rows
        +
        +

        You can use this to work back to understand how gh_repos was structured: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.

        +

        owner is another list-column, and since it contains a named list, we can use unnest_wider() to get at the values:

        +
        +
        repos |> 
        +  unnest_longer(json) |> 
        +  unnest_wider(json) |> 
        +  select(id, full_name, owner, description) |> 
        +  unnest_wider(owner)
        +#> Error in `unnest_wider()`:
        +#> ! Can't duplicate names between the affected columns and the original
        +#>   data.
        +#> ✖ These names are duplicated:
        +#>   ℹ `id`, from `owner`.
        +#> ℹ Use `names_sep` to disambiguate using the column name.
        +#> ℹ Or use `names_repair` to specify a repair strategy.
        +
        +

        Uh oh, this list column also contains an id column and we can’t have two id columns in the same data frame. As suggested, lets use names_sep to resolve the problem:

        +
        +
        repos |> 
        +  unnest_longer(json) |> 
        +  unnest_wider(json) |> 
        +  select(id, full_name, owner, description) |> 
        +  unnest_wider(owner, names_sep = "_")
        +#> # A tibble: 176 × 20
        +#>         id full_name               owner_login owner_id owner_avatar_url     
        +#>      <int> <chr>                   <chr>          <int> <chr>                
        +#> 1 61160198 gaborcsardi/after       gaborcsardi   660288 https://avatars.gith…
        +#> 2 40500181 gaborcsardi/argufy      gaborcsardi   660288 https://avatars.gith…
        +#> 3 36442442 gaborcsardi/ask         gaborcsardi   660288 https://avatars.gith…
        +#> 4 34924886 gaborcsardi/baseimports gaborcsardi   660288 https://avatars.gith…
        +#> 5 61620661 gaborcsardi/citest      gaborcsardi   660288 https://avatars.gith…
        +#> 6 33907457 gaborcsardi/clisymbols  gaborcsardi   660288 https://avatars.gith…
        +#> # ℹ 170 more rows
        +#> # ℹ 15 more variables: owner_gravatar_id <chr>, owner_url <chr>, …
        +
        +

        This gives another wide dataset, but you can get the sense that owner appears to contain a lot of additional data about the person who “owns” the repository.

        +

        +23.4.2 Relational data

        +

        Nested data is sometimes used to represent data that we’d usually spread across multiple data frames. For example, take got_chars which contains data about characters that appear in the Game of Thrones books and TV series. Like gh_repos it’s a list, so we start by turning it into a list-column of a tibble:

        +
        +
        chars <- tibble(json = got_chars)
        +chars
        +#> # A tibble: 30 × 1
        +#>   json             
        +#>   <list>           
        +#> 1 <named list [18]>
        +#> 2 <named list [18]>
        +#> 3 <named list [18]>
        +#> 4 <named list [18]>
        +#> 5 <named list [18]>
        +#> 6 <named list [18]>
        +#> # ℹ 24 more rows
        +
        +

        The json column contains named elements, so we’ll start by widening it:

        +
        +
        chars |> 
        +  unnest_wider(json)
        +#> # A tibble: 30 × 18
        +#>   url                    id name            gender culture    born           
        +#>   <chr>               <int> <chr>           <chr>  <chr>      <chr>          
        +#> 1 https://www.anapio…  1022 Theon Greyjoy   Male   "Ironborn" "In 278 AC or …
        +#> 2 https://www.anapio…  1052 Tyrion Lannist… Male   ""         "In 273 AC, at…
        +#> 3 https://www.anapio…  1074 Victarion Grey… Male   "Ironborn" "In 268 AC or …
        +#> 4 https://www.anapio…  1109 Will            Male   ""         ""             
        +#> 5 https://www.anapio…  1166 Areo Hotah      Male   "Norvoshi" "In 257 AC or …
        +#> 6 https://www.anapio…  1267 Chett           Male   ""         "At Hag's Mire"
        +#> # ℹ 24 more rows
        +#> # ℹ 12 more variables: died <chr>, alive <lgl>, titles <list>, …
        +
        +

        And selecting a few columns to make it easier to read:

        +
        +
        characters <- chars |> 
        +  unnest_wider(json) |> 
        +  select(id, name, gender, culture, born, died, alive)
        +characters
        +#> # A tibble: 30 × 7
        +#>      id name              gender culture    born              died           
        +#>   <int> <chr>             <chr>  <chr>      <chr>             <chr>          
        +#> 1  1022 Theon Greyjoy     Male   "Ironborn" "In 278 AC or 27… ""             
        +#> 2  1052 Tyrion Lannister  Male   ""         "In 273 AC, at C… ""             
        +#> 3  1074 Victarion Greyjoy Male   "Ironborn" "In 268 AC or be… ""             
        +#> 4  1109 Will              Male   ""         ""                "In 297 AC, at…
        +#> 5  1166 Areo Hotah        Male   "Norvoshi" "In 257 AC or be… ""             
        +#> 6  1267 Chett             Male   ""         "At Hag's Mire"   "In 299 AC, at…
        +#> # ℹ 24 more rows
        +#> # ℹ 1 more variable: alive <lgl>
        +
        +

        This dataset contains also many list-columns:

        +
        +
        chars |> 
        +  unnest_wider(json) |> 
        +  select(id, where(is.list))
        +#> # A tibble: 30 × 8
        +#>      id titles    aliases    allegiances books     povBooks tvSeries playedBy
        +#>   <int> <list>    <list>     <list>      <list>    <list>   <list>   <list>  
        +#> 1  1022 <chr [2]> <chr [4]>  <chr [1]>   <chr [3]> <chr>    <chr>    <chr>   
        +#> 2  1052 <chr [2]> <chr [11]> <chr [1]>   <chr [2]> <chr>    <chr>    <chr>   
        +#> 3  1074 <chr [2]> <chr [1]>  <chr [1]>   <chr [3]> <chr>    <chr>    <chr>   
        +#> 4  1109 <chr [1]> <chr [1]>  <NULL>      <chr [1]> <chr>    <chr>    <chr>   
        +#> 5  1166 <chr [1]> <chr [1]>  <chr [1]>   <chr [3]> <chr>    <chr>    <chr>   
        +#> 6  1267 <chr [1]> <chr [1]>  <NULL>      <chr [2]> <chr>    <chr>    <chr>   
        +#> # ℹ 24 more rows
        +
        +

        Let’s explore the titles column. It’s an unnamed list-column, so we’ll unnest it into rows:

        +
        +
        chars |> 
        +  unnest_wider(json) |> 
        +  select(id, titles) |> 
        +  unnest_longer(titles)
        +#> # A tibble: 59 × 2
        +#>      id titles                                              
        +#>   <int> <chr>                                               
        +#> 1  1022 Prince of Winterfell                                
        +#> 2  1022 Lord of the Iron Islands (by law of the green lands)
        +#> 3  1052 Acting Hand of the King (former)                    
        +#> 4  1052 Master of Coin (former)                             
        +#> 5  1074 Lord Captain of the Iron Fleet                      
        +#> 6  1074 Master of the Iron Victory                          
        +#> # ℹ 53 more rows
        +
        +

        You might expect to see this data in its own table because it would be easy to join to the characters data as needed. Let’s do that, which requires little cleaning: removing the rows containing empty strings and renaming titles to title since each row now only contains a single title.

        +
        +
        titles <- chars |> 
        +  unnest_wider(json) |> 
        +  select(id, titles) |> 
        +  unnest_longer(titles) |> 
        +  filter(titles != "") |> 
        +  rename(title = titles)
        +titles
        +#> # A tibble: 52 × 2
        +#>      id title                                               
        +#>   <int> <chr>                                               
        +#> 1  1022 Prince of Winterfell                                
        +#> 2  1022 Lord of the Iron Islands (by law of the green lands)
        +#> 3  1052 Acting Hand of the King (former)                    
        +#> 4  1052 Master of Coin (former)                             
        +#> 5  1074 Lord Captain of the Iron Fleet                      
        +#> 6  1074 Master of the Iron Victory                          
        +#> # ℹ 46 more rows
        +
        +

        You could imagine creating a table like this for each of the list-columns, then using joins to combine them with the character data as you need it.

        +

        +23.4.3 Deeply nested

        +

        We’ll finish off these case studies with a list-column that’s very deeply nested and requires repeated rounds of unnest_wider() and unnest_longer() to unravel: gmaps_cities. This is a two column tibble containing five city names and the results of using Google’s geocoding API to determine their location:

        +
        +
        gmaps_cities
        +#> # A tibble: 5 × 2
        +#>   city       json            
        +#>   <chr>      <list>          
        +#> 1 Houston    <named list [2]>
        +#> 2 Washington <named list [2]>
        +#> 3 New York   <named list [2]>
        +#> 4 Chicago    <named list [2]>
        +#> 5 Arlington  <named list [2]>
        +
        +

        json is a list-column with internal names, so we start with an unnest_wider():

        +
        +
        gmaps_cities |> 
        +  unnest_wider(json)
        +#> # A tibble: 5 × 3
        +#>   city       results    status
        +#>   <chr>      <list>     <chr> 
        +#> 1 Houston    <list [1]> OK    
        +#> 2 Washington <list [2]> OK    
        +#> 3 New York   <list [1]> OK    
        +#> 4 Chicago    <list [1]> OK    
        +#> 5 Arlington  <list [2]> OK
        +
        +

        This gives us the status and the results. We’ll drop the status column since they’re all OK; in a real analysis, you’d also want to capture all the rows where status != "OK" and figure out what went wrong. results is an unnamed list, with either one or two elements (we’ll see why shortly) so we’ll unnest it into rows:

        +
        +
        gmaps_cities |> 
        +  unnest_wider(json) |> 
        +  select(-status) |> 
        +  unnest_longer(results)
        +#> # A tibble: 7 × 2
        +#>   city       results         
        +#>   <chr>      <list>          
        +#> 1 Houston    <named list [5]>
        +#> 2 Washington <named list [5]>
        +#> 3 Washington <named list [5]>
        +#> 4 New York   <named list [5]>
        +#> 5 Chicago    <named list [5]>
        +#> 6 Arlington  <named list [5]>
        +#> # ℹ 1 more row
        +
        +

        Now results is a named list, so we’ll use unnest_wider():

        +
        +
        locations <- gmaps_cities |> 
        +  unnest_wider(json) |> 
        +  select(-status) |> 
        +  unnest_longer(results) |> 
        +  unnest_wider(results)
        +locations
        +#> # A tibble: 7 × 6
        +#>   city       address_components formatted_address   geometry        
        +#>   <chr>      <list>             <chr>               <list>          
        +#> 1 Houston    <list [4]>         Houston, TX, USA    <named list [4]>
        +#> 2 Washington <list [2]>         Washington, USA     <named list [4]>
        +#> 3 Washington <list [4]>         Washington, DC, USA <named list [4]>
        +#> 4 New York   <list [3]>         New York, NY, USA   <named list [4]>
        +#> 5 Chicago    <list [4]>         Chicago, IL, USA    <named list [4]>
        +#> 6 Arlington  <list [4]>         Arlington, TX, USA  <named list [4]>
        +#> # ℹ 1 more row
        +#> # ℹ 2 more variables: place_id <chr>, types <list>
        +
        +

        Now we can see why two cities got two results: Washington matched both Washington state and Washington, DC, and Arlington matched Arlington, Virginia and Arlington, Texas.

        +

        There are a few different places we could go from here. We might want to determine the exact location of the match, which is stored in the geometry list-column:

        +
        +
        locations |> 
        +  select(city, formatted_address, geometry) |> 
        +  unnest_wider(geometry)
        +#> # A tibble: 7 × 6
        +#>   city       formatted_address   bounds           location     location_type
        +#>   <chr>      <chr>               <list>           <list>       <chr>        
        +#> 1 Houston    Houston, TX, USA    <named list [2]> <named list> APPROXIMATE  
        +#> 2 Washington Washington, USA     <named list [2]> <named list> APPROXIMATE  
        +#> 3 Washington Washington, DC, USA <named list [2]> <named list> APPROXIMATE  
        +#> 4 New York   New York, NY, USA   <named list [2]> <named list> APPROXIMATE  
        +#> 5 Chicago    Chicago, IL, USA    <named list [2]> <named list> APPROXIMATE  
        +#> 6 Arlington  Arlington, TX, USA  <named list [2]> <named list> APPROXIMATE  
        +#> # ℹ 1 more row
        +#> # ℹ 1 more variable: viewport <list>
        +
        +

        That gives us new bounds (a rectangular region) and location (a point). We can unnest location to see the latitude (lat) and longitude (lng):

        +
        +
        locations |> 
        +  select(city, formatted_address, geometry) |> 
        +  unnest_wider(geometry) |> 
        +  unnest_wider(location)
        +#> # A tibble: 7 × 7
        +#>   city       formatted_address   bounds             lat    lng location_type
        +#>   <chr>      <chr>               <list>           <dbl>  <dbl> <chr>        
        +#> 1 Houston    Houston, TX, USA    <named list [2]>  29.8  -95.4 APPROXIMATE  
        +#> 2 Washington Washington, USA     <named list [2]>  47.8 -121.  APPROXIMATE  
        +#> 3 Washington Washington, DC, USA <named list [2]>  38.9  -77.0 APPROXIMATE  
        +#> 4 New York   New York, NY, USA   <named list [2]>  40.7  -74.0 APPROXIMATE  
        +#> 5 Chicago    Chicago, IL, USA    <named list [2]>  41.9  -87.6 APPROXIMATE  
        +#> 6 Arlington  Arlington, TX, USA  <named list [2]>  32.7  -97.1 APPROXIMATE  
        +#> # ℹ 1 more row
        +#> # ℹ 1 more variable: viewport <list>
        +
        +

        Extracting the bounds requires a few more steps:

        +
        +
        locations |> 
        +  select(city, formatted_address, geometry) |> 
        +  unnest_wider(geometry) |> 
        +  # focus on the variables of interest
        +  select(!location:viewport) |>
        +  unnest_wider(bounds)
        +#> # A tibble: 7 × 4
        +#>   city       formatted_address   northeast        southwest       
        +#>   <chr>      <chr>               <list>           <list>          
        +#> 1 Houston    Houston, TX, USA    <named list [2]> <named list [2]>
        +#> 2 Washington Washington, USA     <named list [2]> <named list [2]>
        +#> 3 Washington Washington, DC, USA <named list [2]> <named list [2]>
        +#> 4 New York   New York, NY, USA   <named list [2]> <named list [2]>
        +#> 5 Chicago    Chicago, IL, USA    <named list [2]> <named list [2]>
        +#> 6 Arlington  Arlington, TX, USA  <named list [2]> <named list [2]>
        +#> # ℹ 1 more row
        +
        +

        We then rename southwest and northeast (the corners of the rectangle) so we can use names_sep to create short but evocative names:

        +
        +
        locations |> 
        +  select(city, formatted_address, geometry) |> 
        +  unnest_wider(geometry) |> 
        +  select(!location:viewport) |>
        +  unnest_wider(bounds) |> 
        +  rename(ne = northeast, sw = southwest) |> 
        +  unnest_wider(c(ne, sw), names_sep = "_") 
        +#> # A tibble: 7 × 6
        +#>   city       formatted_address   ne_lat ne_lng sw_lat sw_lng
        +#>   <chr>      <chr>                <dbl>  <dbl>  <dbl>  <dbl>
        +#> 1 Houston    Houston, TX, USA      30.1  -95.0   29.5  -95.8
        +#> 2 Washington Washington, USA       49.0 -117.    45.5 -125. 
        +#> 3 Washington Washington, DC, USA   39.0  -76.9   38.8  -77.1
        +#> 4 New York   New York, NY, USA     40.9  -73.7   40.5  -74.3
        +#> 5 Chicago    Chicago, IL, USA      42.0  -87.5   41.6  -87.9
        +#> 6 Arlington  Arlington, TX, USA    32.8  -97.0   32.6  -97.2
        +#> # ℹ 1 more row
        +
        +

        Note how we unnest two columns simultaneously by supplying a vector of variable names to unnest_wider().

        +

        Once you’ve discovered the path to get to the components you’re interested in, you can extract them directly using another tidyr function, hoist():

        +
        +
        locations |> 
        +  select(city, formatted_address, geometry) |> 
        +  hoist(
        +    geometry,
        +    ne_lat = c("bounds", "northeast", "lat"),
        +    sw_lat = c("bounds", "southwest", "lat"),
        +    ne_lng = c("bounds", "northeast", "lng"),
        +    sw_lng = c("bounds", "southwest", "lng"),
        +  )
        +
        +

        If these case studies have whetted your appetite for more real-life rectangling, you can see a few more examples in vignette("rectangling", package = "tidyr").

        +

        +23.4.4 Exercises

        +
          +
        1. Roughly estimate when gh_repos was created. Why can you only roughly estimate the date?

        2. +
        3. The owner column of gh_repo contains a lot of duplicated information because each owner can have many repos. Can you construct an owners data frame that contains one row for each owner? (Hint: does distinct() work with list-cols?)

        4. +
        5. Follow the steps used for titles to create similar tables for the aliases, allegiances, books, and TV series for the Game of Thrones characters.

        6. +
        7. +

          Explain the following code line-by-line. Why is it interesting? Why does it work for got_chars but might not work in general?

          +
          +
          tibble(json = got_chars) |> 
          +  unnest_wider(json) |> 
          +  select(id, where(is.list)) |> 
          +  pivot_longer(
          +    where(is.list), 
          +    names_to = "name", 
          +    values_to = "value"
          +  ) |>  
          +  unnest_longer(value)
          +
          +
        8. +
        9. In gmaps_cities, what does address_components contain? Why does the length vary between rows? Unnest it appropriately to figure it out. (Hint: types always appears to contain two elements. Does unnest_wider() make it easier to work with than unnest_longer()?) .

        10. +

        +23.5 JSON

        +

        All of the case studies in the previous section were sourced from wild-caught JSON. JSON is short for javascript object notation and is the way that most web APIs return data. It’s important to understand it because while JSON and R’s data types are pretty similar, there isn’t a perfect 1-to-1 mapping, so it’s good to understand a bit about JSON if things go wrong.

        +

        +23.5.1 Data types

        +

        JSON is a simple format designed to be easily read and written by machines, not humans. It has six key data types. Four of them are scalars:

        +
          +
        • The simplest type is a null (null) which plays the same role as NA in R. It represents the absence of data.
        • +
        • A string is much like a string in R, but must always use double quotes.
        • +
        • A number is similar to R’s numbers: they can use integer (e.g., 123), decimal (e.g., 123.45), or scientific (e.g., 1.23e3) notation. JSON doesn’t support Inf, -Inf, or NaN.
        • +
        • A boolean is similar to R’s TRUE and FALSE, but uses lowercase true and false.
        • +
        +

        JSON’s strings, numbers, and booleans are pretty similar to R’s character, numeric, and logical vectors. The main difference is that JSON’s scalars can only represent a single value. To represent multiple values you need to use one of the two remaining types: arrays and objects.

        +

        Both arrays and objects are similar to lists in R; the difference is whether or not they’re named. An array is like an unnamed list, and is written with []. For example [1, 2, 3] is an array containing 3 numbers, and [null, 1, "string", false] is an array that contains a null, a number, a string, and a boolean. An object is like a named list, and is written with {}. The names (keys in JSON terminology) are strings, so must be surrounded by quotes. For example, {"x": 1, "y": 2} is an object that maps x to 1 and y to 2.

        +

        Note that JSON doesn’t have any native way to represent dates or date-times, so they’re often stored as strings, and you’ll need to use readr::parse_date() or readr::parse_datetime() to turn them into the correct data structure. Similarly, JSON’s rules for representing floating point numbers in JSON are a little imprecise, so you’ll also sometimes find numbers stored in strings. Apply readr::parse_double() as needed to get the correct variable type.

        +

        +23.5.2 jsonlite

        +

        To convert JSON into R data structures, we recommend the jsonlite package, by Jeroen Ooms. We’ll use only two jsonlite functions: read_json() and parse_json(). In real life, you’ll use read_json() to read a JSON file from disk. For example, the repurrsive package also provides the source for gh_user as a JSON file and you can read it with read_json():

        +
        +
        # A path to a json file inside the package:
        +gh_users_json()
        +#> [1] "/home/runner/work/_temp/Library/repurrrsive/extdata/gh_users.json"
        +
        +# Read it with read_json()
        +gh_users2 <- read_json(gh_users_json())
        +
        +# Check it's the same as the data we were using previously
        +identical(gh_users, gh_users2)
        +#> [1] TRUE
        +
        +

        In this book, we’ll also use parse_json(), since it takes a string containing JSON, which makes it good for generating simple examples. To get started, here are three simple JSON datasets, starting with a number, then putting a few numbers in an array, then putting that array in an object:

        +
        +
        str(parse_json('1'))
        +#>  int 1
        +str(parse_json('[1, 2, 3]'))
        +#> List of 3
        +#>  $ : int 1
        +#>  $ : int 2
        +#>  $ : int 3
        +str(parse_json('{"x": [1, 2, 3]}'))
        +#> List of 1
        +#>  $ x:List of 3
        +#>   ..$ : int 1
        +#>   ..$ : int 2
        +#>   ..$ : int 3
        +
        +

        jsonlite has another important function called fromJSON(). We don’t use it here because it performs automatic simplification (simplifyVector = TRUE). This often works well, particularly in simple cases, but we think you’re better off doing the rectangling yourself so you know exactly what’s happening and can more easily handle the most complicated nested structures.

        +

        +23.5.3 Starting the rectangling process

        +

        In most cases, JSON files contain a single top-level array, because they’re designed to provide data about multiple “things”, e.g., multiple pages, or multiple records, or multiple results. In this case, you’ll start your rectangling with tibble(json) so that each element becomes a row:

        +
        +
        json <- '[
        +  {"name": "John", "age": 34},
        +  {"name": "Susan", "age": 27}
        +]'
        +df <- tibble(json = parse_json(json))
        +df
        +#> # A tibble: 2 × 1
        +#>   json            
        +#>   <list>          
        +#> 1 <named list [2]>
        +#> 2 <named list [2]>
        +
        +df |> 
        +  unnest_wider(json)
        +#> # A tibble: 2 × 2
        +#>   name    age
        +#>   <chr> <int>
        +#> 1 John     34
        +#> 2 Susan    27
        +
        +

        In rarer cases, the JSON file consists of a single top-level JSON object, representing one “thing”. In this case, you’ll need to kick off the rectangling process by wrapping it in a list, before you put it in a tibble.

        +
        +
        json <- '{
        +  "status": "OK", 
        +  "results": [
        +    {"name": "John", "age": 34},
        +    {"name": "Susan", "age": 27}
        + ]
        +}
        +'
        +df <- tibble(json = list(parse_json(json)))
        +df
        +#> # A tibble: 1 × 1
        +#>   json            
        +#>   <list>          
        +#> 1 <named list [2]>
        +
        +df |> 
        +  unnest_wider(json) |> 
        +  unnest_longer(results) |> 
        +  unnest_wider(results)
        +#> # A tibble: 2 × 3
        +#>   status name    age
        +#>   <chr>  <chr> <int>
        +#> 1 OK     John     34
        +#> 2 OK     Susan    27
        +
        +

        Alternatively, you can reach inside the parsed JSON and start with the bit that you actually care about:

        +
        +
        df <- tibble(results = parse_json(json)$results)
        +df |> 
        +  unnest_wider(results)
        +#> # A tibble: 2 × 2
        +#>   name    age
        +#>   <chr> <int>
        +#> 1 John     34
        +#> 2 Susan    27
        +
        +

        +23.5.4 Exercises

        +
          +
        1. +

          Rectangle the df_col and df_row below. They represent the two ways of encoding a data frame in JSON.

          +
          +
          json_col <- parse_json('
          +  {
          +    "x": ["a", "x", "z"],
          +    "y": [10, null, 3]
          +  }
          +')
          +json_row <- parse_json('
          +  [
          +    {"x": "a", "y": 10},
          +    {"x": "x", "y": null},
          +    {"x": "z", "y": 3}
          +  ]
          +')
          +
          +df_col <- tibble(json = list(json_col)) 
          +df_row <- tibble(json = json_row)
          +
          +
        2. +

        +23.6 Summary

        +

        In this chapter, you learned what lists are, how you can generate them from JSON files, and how to turn them into rectangular data frames. Surprisingly we only need two new functions: unnest_longer() to put list elements into rows and unnest_wider() to put list elements into columns. It doesn’t matter how deeply nested the list-column is; all you need to do is repeatedly call these two functions.

        +

        JSON is the most common data format returned by web APIs. What happens if the website doesn’t have an API, but you can see data you want on the website? That’s the topic of the next chapter: web scraping, extracting data from HTML webpages.

        + + +

        +
          +
        1. This is an RStudio feature.↩︎

        2. +
        +
        +
        +
        + + + \ No newline at end of file diff --git a/regexps.html b/regexps.html new file mode 100644 index 000000000..bd181f252 --- /dev/null +++ b/regexps.html @@ -0,0 +1,1583 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 15  Regular expressions + + + + + + + + + + + + + + + + + + + + + + + + +
        +
        + +
        + + + +
        +

        15  Regular expressions

        +
        + + + +
        + + + + +
        + + +

        +15.1 Introduction

        +

        In Capítulo 14, you learned a whole bunch of useful functions for working with strings. This chapter will focus on functions that use regular expressions, a concise and powerful language for describing patterns within strings. The term “regular expression” is a bit of a mouthful, so most people abbreviate it to “regex”1 or “regexp”.

        +

        The chapter starts with the basics of regular expressions and the most useful stringr functions for data analysis. We’ll then expand your knowledge of patterns and cover seven important new topics (escaping, anchoring, character classes, shorthand classes, quantifiers, precedence, and grouping). Next, we’ll talk about some of the other types of patterns that stringr functions can work with and the various “flags” that allow you to tweak the operation of regular expressions. We’ll finish with a survey of other places in the tidyverse and base R where you might use regexes.

        +

        +15.1.1 Prerequisites

        +

        In this chapter, we’ll use regular expression functions from stringr and tidyr, both core members of the tidyverse, as well as data from the babynames package.

        + +

        Through this chapter, we’ll use a mix of very simple inline examples so you can get the basic idea, the baby names data, and three character vectors from stringr:

        +
          +
        • +fruit contains the names of 80 fruits.
        • +
        • +words contains 980 common English words.
        • +
        • +sentences contains 720 short sentences.
        • +

        +15.2 Pattern basics

        +

        We’ll use str_view() to learn how regex patterns work. We used str_view() in the last chapter to better understand a string vs. its printed representation, and now we’ll use it with its second argument, a regular expression. When this is supplied, str_view() will show only the elements of the string vector that match, surrounding each match with <>, and, where possible, highlighting the match in blue.

        +

        The simplest patterns consist of letters and numbers which match those characters exactly:

        +
        +
        str_view(fruit, "berry")
        +#>  [6] │ bil<berry>
        +#>  [7] │ black<berry>
        +#> [10] │ blue<berry>
        +#> [11] │ boysen<berry>
        +#> [19] │ cloud<berry>
        +#> [21] │ cran<berry>
        +#> ... and 8 more
        +
        +

        Letters and numbers match exactly and are called literal characters. Most punctuation characters, like ., +, *, [, ], and ?, have special meanings2 and are called metacharacters. For example, . will match any character3, so "a." will match any string that contains an “a” followed by another character :

        +
        +
        str_view(c("a", "ab", "ae", "bd", "ea", "eab"), "a.")
        +#> [2] │ <ab>
        +#> [3] │ <ae>
        +#> [6] │ e<ab>
        +
        +

        Or we could find all the fruits that contain an “a”, followed by three letters, followed by an “e”:

        +
        +
        str_view(fruit, "a...e")
        +#>  [1] │ <apple>
        +#>  [7] │ bl<ackbe>rry
        +#> [48] │ mand<arine>
        +#> [51] │ nect<arine>
        +#> [62] │ pine<apple>
        +#> [64] │ pomegr<anate>
        +#> ... and 2 more
        +
        +

        Quantifiers control how many times a pattern can match:

        +
          +
        • +? makes a pattern optional (i.e. it matches 0 or 1 times)
        • +
        • ++ lets a pattern repeat (i.e. it matches at least once)
        • +
        • +* lets a pattern be optional or repeat (i.e. it matches any number of times, including 0).
        • +
        +
        +
        # ab? matches an "a", optionally followed by a "b".
        +str_view(c("a", "ab", "abb"), "ab?")
        +#> [1] │ <a>
        +#> [2] │ <ab>
        +#> [3] │ <ab>b
        +
        +# ab+ matches an "a", followed by at least one "b".
        +str_view(c("a", "ab", "abb"), "ab+")
        +#> [2] │ <ab>
        +#> [3] │ <abb>
        +
        +# ab* matches an "a", followed by any number of "b"s.
        +str_view(c("a", "ab", "abb"), "ab*")
        +#> [1] │ <a>
        +#> [2] │ <ab>
        +#> [3] │ <abb>
        +
        +

        Character classes are defined by [] and let you match a set of characters, e.g., [abcd] matches “a”, “b”, “c”, or “d”. You can also invert the match by starting with ^: [^abcd] matches anything except “a”, “b”, “c”, or “d”. We can use this idea to find the words containing an “x” surrounded by vowels, or a “y” surrounded by consonants:

        +
        +
        str_view(words, "[aeiou]x[aeiou]")
        +#> [284] │ <exa>ct
        +#> [285] │ <exa>mple
        +#> [288] │ <exe>rcise
        +#> [289] │ <exi>st
        +str_view(words, "[^aeiou]y[^aeiou]")
        +#> [836] │ <sys>tem
        +#> [901] │ <typ>e
        +
        +

        You can use alternation, |, to pick between one or more alternative patterns. For example, the following patterns look for fruits containing “apple”, “melon”, or “nut”, or a repeated vowel.

        +
        +
        str_view(fruit, "apple|melon|nut")
        +#>  [1] │ <apple>
        +#> [13] │ canary <melon>
        +#> [20] │ coco<nut>
        +#> [52] │ <nut>
        +#> [62] │ pine<apple>
        +#> [72] │ rock <melon>
        +#> ... and 1 more
        +str_view(fruit, "aa|ee|ii|oo|uu")
        +#>  [9] │ bl<oo>d orange
        +#> [33] │ g<oo>seberry
        +#> [47] │ lych<ee>
        +#> [66] │ purple mangost<ee>n
        +
        +

        Regular expressions are very compact and use a lot of punctuation characters, so they can seem overwhelming and hard to read at first. Don’t worry; you’ll get better with practice, and simple patterns will soon become second nature. Let’s kick off that process by practicing with some useful stringr functions.

        +

        +15.3 Key functions

        +

        Now that you’ve got the basics of regular expressions under your belt, let’s use them with some stringr and tidyr functions. In the following section, you’ll learn how to detect the presence or absence of a match, how to count the number of matches, how to replace a match with fixed text, and how to extract text using a pattern.

        +

        +15.3.1 Detect matches

        +

        str_detect() returns a logical vector that is TRUE if the pattern matches an element of the character vector and FALSE otherwise:

        +
        +
        str_detect(c("a", "b", "c"), "[aeiou]")
        +#> [1]  TRUE FALSE FALSE
        +
        +

        Since str_detect() returns a logical vector of the same length as the initial vector, it pairs well with filter(). For example, this code finds all the most popular names containing a lower-case “x”:

        +
        +
        babynames |> 
        +  filter(str_detect(name, "x")) |> 
        +  count(name, wt = n, sort = TRUE)
        +#> # A tibble: 974 × 2
        +#>   name           n
        +#>   <chr>      <int>
        +#> 1 Alexander 665492
        +#> 2 Alexis    399551
        +#> 3 Alex      278705
        +#> 4 Alexandra 232223
        +#> 5 Max       148787
        +#> 6 Alexa     123032
        +#> # ℹ 968 more rows
        +
        +

        We can also use str_detect() with summarize() by pairing it with sum() or mean(): sum(str_detect(x, pattern)) tells you the number of observations that match and mean(str_detect(x, pattern)) tells you the proportion that match. For example, the following snippet computes and visualizes the proportion of baby names4 that contain “x”, broken down by year. It looks like they’ve radically increased in popularity lately!

        +
        +
        babynames |> 
        +  group_by(year) |> 
        +  summarize(prop_x = mean(str_detect(name, "x"))) |> 
        +  ggplot(aes(x = year, y = prop_x)) + 
        +  geom_line()
        +
        +

        A time series showing the proportion of baby names that contain the letter x. The proportion declines gradually from 8 per 1000 in 1880 to 4 per 1000 in 1980, then increases rapidly to 16 per 1000 in 2019.

        +
        +
        +

        There are two functions that are closely related to str_detect(): str_subset() and str_which(). str_subset() returns a character vector containing only the strings that match. str_which() returns an integer vector giving the positions of the strings that match.

        +

        +15.3.2 Count matches

        +

        The next step up in complexity from str_detect() is str_count(): rather than a true or false, it tells you how many matches there are in each string.

        +
        +
        x <- c("apple", "banana", "pear")
        +str_count(x, "p")
        +#> [1] 2 0 1
        +
        +

        Note that each match starts at the end of the previous match, i.e. regex matches never overlap. For example, in "abababa", how many times will the pattern "aba" match? Regular expressions say two, not three:

        +
        +
        str_count("abababa", "aba")
        +#> [1] 2
        +str_view("abababa", "aba")
        +#> [1] │ <aba>b<aba>
        +
        +

        It’s natural to use str_count() with mutate(). The following example uses str_count() with character classes to count the number of vowels and consonants in each name.

        +
        +
        babynames |> 
        +  count(name) |> 
        +  mutate(
        +    vowels = str_count(name, "[aeiou]"),
        +    consonants = str_count(name, "[^aeiou]")
        +  )
        +#> # A tibble: 97,310 × 4
        +#>   name          n vowels consonants
        +#>   <chr>     <int>  <int>      <int>
        +#> 1 Aaban        10      2          3
        +#> 2 Aabha         5      2          3
        +#> 3 Aabid         2      2          3
        +#> 4 Aabir         1      2          3
        +#> 5 Aabriella     5      4          5
        +#> 6 Aada          1      2          2
        +#> # ℹ 97,304 more rows
        +
        +

        If you look closely, you’ll notice that there’s something off with our calculations: “Aaban” contains three “a”s, but our summary reports only two vowels. That’s because regular expressions are case sensitive. There are three ways we could fix this:

        +
          +
        • Add the upper case vowels to the character class: str_count(name, "[aeiouAEIOU]").
        • +
        • Tell the regular expression to ignore case: str_count(name, regex("[aeiou]", ignore_case = TRUE)). We’ll talk about more in Seção 15.5.1.
        • +
        • Use str_to_lower() to convert the names to lower case: str_count(str_to_lower(name), "[aeiou]").
        • +
        +

        This variety of approaches is pretty typical when working with strings — there are often multiple ways to reach your goal, either by making your pattern more complicated or by doing some preprocessing on your string. If you get stuck trying one approach, it can often be useful to switch gears and tackle the problem from a different perspective.

        +

        In this case, since we’re applying two functions to the name, I think it’s easier to transform it first:

        +
        +
        babynames |> 
        +  count(name) |> 
        +  mutate(
        +    name = str_to_lower(name),
        +    vowels = str_count(name, "[aeiou]"),
        +    consonants = str_count(name, "[^aeiou]")
        +  )
        +#> # A tibble: 97,310 × 4
        +#>   name          n vowels consonants
        +#>   <chr>     <int>  <int>      <int>
        +#> 1 aaban        10      3          2
        +#> 2 aabha         5      3          2
        +#> 3 aabid         2      3          2
        +#> 4 aabir         1      3          2
        +#> 5 aabriella     5      5          4
        +#> 6 aada          1      3          1
        +#> # ℹ 97,304 more rows
        +
        +

        +15.3.3 Replace values

        +

        As well as detecting and counting matches, we can also modify them with str_replace() and str_replace_all(). str_replace() replaces the first match, and as the name suggests, str_replace_all() replaces all matches.

        +
        +
        x <- c("apple", "pear", "banana")
        +str_replace_all(x, "[aeiou]", "-")
        +#> [1] "-ppl-"  "p--r"   "b-n-n-"
        +
        +

        str_remove() and str_remove_all() are handy shortcuts for str_replace(x, pattern, ""):

        +
        +
        x <- c("apple", "pear", "banana")
        +str_remove_all(x, "[aeiou]")
        +#> [1] "ppl" "pr"  "bnn"
        +
        +

        These functions are naturally paired with mutate() when doing data cleaning, and you’ll often apply them repeatedly to peel off layers of inconsistent formatting.

        +

        +15.3.4 Extract variables

        +

        The last function we’ll discuss uses regular expressions to extract data out of one column into one or more new columns: separate_wider_regex(). It’s a peer of the separate_wider_position() and separate_wider_delim() functions that you learned about in Seção 14.4.2. These functions live in tidyr because they operate on (columns of) data frames, rather than individual vectors.

        +

        Let’s create a simple dataset to show how it works. Here we have some data derived from babynames where we have the name, gender, and age of a bunch of people in a rather weird format5:

        +
        +
        df <- tribble(
        +  ~str,
        +  "<Sheryl>-F_34",
        +  "<Kisha>-F_45", 
        +  "<Brandon>-N_33",
        +  "<Sharon>-F_38", 
        +  "<Penny>-F_58",
        +  "<Justin>-M_41", 
        +  "<Patricia>-F_84", 
        +)
        +
        +

        To extract this data using separate_wider_regex() we just need to construct a sequence of regular expressions that match each piece. If we want the contents of that piece to appear in the output, we give it a name:

        +
        +
        df |> 
        +  separate_wider_regex(
        +    str,
        +    patterns = c(
        +      "<", 
        +      name = "[A-Za-z]+", 
        +      ">-", 
        +      gender = ".",
        +      "_",
        +      age = "[0-9]+"
        +    )
        +  )
        +#> # A tibble: 7 × 3
        +#>   name    gender age  
        +#>   <chr>   <chr>  <chr>
        +#> 1 Sheryl  F      34   
        +#> 2 Kisha   F      45   
        +#> 3 Brandon N      33   
        +#> 4 Sharon  F      38   
        +#> 5 Penny   F      58   
        +#> 6 Justin  M      41   
        +#> # ℹ 1 more row
        +
        +

        If the match fails, you can use too_short = "debug" to figure out what went wrong, just like separate_wider_delim() and separate_wider_position().

        +

        +15.3.5 Exercises

        +
          +
        1. What baby name has the most vowels? What name has the highest proportion of vowels? (Hint: what is the denominator?)

        2. +
        3. Replace all forward slashes in "a/b/c/d/e" with backslashes. What happens if you attempt to undo the transformation by replacing all backslashes with forward slashes? (We’ll discuss the problem very soon.)

        4. +
        5. Implement a simple version of str_to_lower() using str_replace_all().

        6. +
        7. Create a regular expression that will match telephone numbers as commonly written in your country.

        8. +

        +15.4 Pattern details

        +

        Now that you understand the basics of the pattern language and how to use it with some stringr and tidyr functions, it’s time to dig into more of the details. First, we’ll start with escaping, which allows you to match metacharacters that would otherwise be treated specially. Next, you’ll learn about anchors which allow you to match the start or end of the string. Then, you’ll learn more about character classes and their shortcuts which allow you to match any character from a set. Next, you’ll learn the final details of quantifiers which control how many times a pattern can match. Then, we have to cover the important (but complex) topic of operator precedence and parentheses. And we’ll finish off with some details of grouping components of the pattern.

        +

        The terms we use here are the technical names for each component. They’re not always the most evocative of their purpose, but it’s very helpful to know the correct terms if you later want to Google for more details.

        +

        +15.4.1 Escaping

        +

        In order to match a literal ., you need an escape which tells the regular expression to match metacharacters6 literally. Like strings, regexps use the backslash for escaping. So, to match a ., you need the regexp \.. Unfortunately this creates a problem. We use strings to represent regular expressions, and \ is also used as an escape symbol in strings. So to create the regular expression \. we need the string "\\.", as the following example shows.

        +
        +
        # To create the regular expression \., we need to use \\.
        +dot <- "\\."
        +
        +# But the expression itself only contains one \
        +str_view(dot)
        +#> [1] │ \.
        +
        +# And this tells R to look for an explicit .
        +str_view(c("abc", "a.c", "bef"), "a\\.c")
        +#> [2] │ <a.c>
        +
        +

        In this book, we’ll usually write regular expression without quotes, like \.. If we need to emphasize what you’ll actually type, we’ll surround it with quotes and add extra escapes, like "\\.".

        +

        If \ is used as an escape character in regular expressions, how do you match a literal \? Well, you need to escape it, creating the regular expression \\. To create that regular expression, you need to use a string, which also needs to escape \. That means to match a literal \ you need to write "\\\\" — you need four backslashes to match one!

        +
        +
        x <- "a\\b"
        +str_view(x)
        +#> [1] │ a\b
        +str_view(x, "\\\\")
        +#> [1] │ a<\>b
        +
        +

        Alternatively, you might find it easier to use the raw strings you learned about in Seção 14.2.2). That lets you avoid one layer of escaping:

        +
        +
        str_view(x, r"{\\}")
        +#> [1] │ a<\>b
        +
        +

        If you’re trying to match a literal ., $, |, *, +, ?, {, }, (, ), there’s an alternative to using a backslash escape: you can use a character class: [.], [$], [|], ... all match the literal values.

        +
        +
        str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
        +#> [2] │ <a.c>
        +str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
        +#> [3] │ <a*c>
        +
        +

        +15.4.2 Anchors

        +

        By default, regular expressions will match any part of a string. If you want to match at the start or end you need to anchor the regular expression using ^ to match the start or $ to match the end:

        +
        +
        str_view(fruit, "^a")
        +#> [1] │ <a>pple
        +#> [2] │ <a>pricot
        +#> [3] │ <a>vocado
        +str_view(fruit, "a$")
        +#>  [4] │ banan<a>
        +#> [15] │ cherimoy<a>
        +#> [30] │ feijo<a>
        +#> [36] │ guav<a>
        +#> [56] │ papay<a>
        +#> [74] │ satsum<a>
        +
        +

        It’s tempting to think that $ should match the start of a string, because that’s how we write dollar amounts, but that’s not what regular expressions want.

        +

        To force a regular expression to match only the full string, anchor it with both ^ and $:

        +
        +
        str_view(fruit, "apple")
        +#>  [1] │ <apple>
        +#> [62] │ pine<apple>
        +str_view(fruit, "^apple$")
        +#> [1] │ <apple>
        +
        +

        You can also match the boundary between words (i.e. the start or end of a word) with \b. This can be particularly useful when using RStudio’s find and replace tool. For example, if to find all uses of sum(), you can search for \bsum\b to avoid matching summarize, summary, rowsum and so on:

        +
        +
        x <- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
        +str_view(x, "sum")
        +#> [1] │ <sum>mary(x)
        +#> [2] │ <sum>marize(df)
        +#> [3] │ row<sum>(x)
        +#> [4] │ <sum>(x)
        +str_view(x, "\\bsum\\b")
        +#> [4] │ <sum>(x)
        +
        +

        When used alone, anchors will produce a zero-width match:

        +
        +
        str_view("abc", c("$", "^", "\\b"))
        +#> [1] │ abc<>
        +#> [2] │ <>abc
        +#> [3] │ <>abc<>
        +
        +

        This helps you understand what happens when you replace a standalone anchor:

        +
        +
        str_replace_all("abc", c("$", "^", "\\b"), "--")
        +#> [1] "abc--"   "--abc"   "--abc--"
        +
        +

        +15.4.3 Character classes

        +

        A character class, or character set, allows you to match any character in a set. As we discussed above, you can construct your own sets with [], where [abc] matches “a”, “b”, or “c” and [^abc] matches any character except “a”, “b”, or “c”. Apart from ^ there are two other characters that have special meaning inside of []:

        +
          +
        • +- defines a range, e.g., [a-z] matches any lower case letter and [0-9] matches any number.
        • +
        • +\ escapes special characters, so [\^\-\]] matches ^, -, or ].
        • +
        +

        Here are few examples:

        +
        +
        x <- "abcd ABCD 12345 -!@#%."
        +str_view(x, "[abc]+")
        +#> [1] │ <abc>d ABCD 12345 -!@#%.
        +str_view(x, "[a-z]+")
        +#> [1] │ <abcd> ABCD 12345 -!@#%.
        +str_view(x, "[^a-z0-9]+")
        +#> [1] │ abcd< ABCD >12345< -!@#%.>
        +
        +# You need an escape to match characters that are otherwise
        +# special inside of []
        +str_view("a-b-c", "[a-c]")
        +#> [1] │ <a>-<b>-<c>
        +str_view("a-b-c", "[a\\-c]")
        +#> [1] │ <a><->b<-><c>
        +
        +

        Some character classes are used so commonly that they get their own shortcut. You’ve already seen ., which matches any character apart from a newline. There are three other particularly useful pairs7:

        +
          +
        • +\d matches any digit;
          \D matches anything that isn’t a digit.
        • +
        • +\s matches any whitespace (e.g., space, tab, newline);
          \S matches anything that isn’t whitespace.
        • +
        • +\w matches any “word” character, i.e. letters and numbers;
          \W matches any “non-word” character.
        • +
        +

        The following code demonstrates the six shortcuts with a selection of letters, numbers, and punctuation characters.

        +
        +
        x <- "abcd ABCD 12345 -!@#%."
        +str_view(x, "\\d+")
        +#> [1] │ abcd ABCD <12345> -!@#%.
        +str_view(x, "\\D+")
        +#> [1] │ <abcd ABCD >12345< -!@#%.>
        +str_view(x, "\\s+")
        +#> [1] │ abcd< >ABCD< >12345< >-!@#%.
        +str_view(x, "\\S+")
        +#> [1] │ <abcd> <ABCD> <12345> <-!@#%.>
        +str_view(x, "\\w+")
        +#> [1] │ <abcd> <ABCD> <12345> -!@#%.
        +str_view(x, "\\W+")
        +#> [1] │ abcd< >ABCD< >12345< -!@#%.>
        +
        +

        +15.4.4 Quantifiers

        +

        Quantifiers control how many times a pattern matches. In Seção 15.2 you learned about ? (0 or 1 matches), + (1 or more matches), and * (0 or more matches). For example, colou?r will match American or British spelling, \d+ will match one or more digits, and \s? will optionally match a single item of whitespace. You can also specify the number of matches precisely with {}:

        +
          +
        • +{n} matches exactly n times.
        • +
        • +{n,} matches at least n times.
        • +
        • +{n,m} matches between n and m times.
        • +

        +15.4.5 Operator precedence and parentheses

        +

        What does ab+ match? Does it match “a” followed by one or more “b”s, or does it match “ab” repeated any number of times? What does ^a|b$ match? Does it match the complete string a or the complete string b, or does it match a string starting with a or a string ending with b?

        +

        The answer to these questions is determined by operator precedence, similar to the PEMDAS or BEDMAS rules you might have learned in school. You know that a + b * c is equivalent to a + (b * c) not (a + b) * c because * has higher precedence and + has lower precedence: you compute * before +.

        +

        Similarly, regular expressions have their own precedence rules: quantifiers have high precedence and alternation has low precedence which means that ab+ is equivalent to a(b+), and ^a|b$ is equivalent to (^a)|(b$). Just like with algebra, you can use parentheses to override the usual order. But unlike algebra you’re unlikely to remember the precedence rules for regexes, so feel free to use parentheses liberally.

        +

        +15.4.6 Grouping and capturing

        +

        As well as overriding operator precedence, parentheses have another important effect: they create capturing groups that allow you to use sub-components of the match.

        +

        The first way to use a capturing group is to refer back to it within a match with back reference: \1 refers to the match contained in the first parenthesis, \2 in the second parenthesis, and so on. For example, the following pattern finds all fruits that have a repeated pair of letters:

        +
        +
        str_view(fruit, "(..)\\1")
        +#>  [4] │ b<anan>a
        +#> [20] │ <coco>nut
        +#> [22] │ <cucu>mber
        +#> [41] │ <juju>be
        +#> [56] │ <papa>ya
        +#> [73] │ s<alal> berry
        +
        +

        And this one finds all words that start and end with the same pair of letters:

        +
        +
        str_view(words, "^(..).*\\1$")
        +#> [152] │ <church>
        +#> [217] │ <decide>
        +#> [617] │ <photograph>
        +#> [699] │ <require>
        +#> [739] │ <sense>
        +
        +

        You can also use back references in str_replace(). For example, this code switches the order of the second and third words in sentences:

        +
        +
        sentences |> 
        +  str_replace("(\\w+) (\\w+) (\\w+)", "\\1 \\3 \\2") |> 
        +  str_view()
        +#> [1] │ The canoe birch slid on the smooth planks.
        +#> [2] │ Glue sheet the to the dark blue background.
        +#> [3] │ It's to easy tell the depth of a well.
        +#> [4] │ These a days chicken leg is a rare dish.
        +#> [5] │ Rice often is served in round bowls.
        +#> [6] │ The of juice lemons makes fine punch.
        +#> ... and 714 more
        +
        +

        If you want to extract the matches for each group you can use str_match(). But str_match() returns a matrix, so it’s not particularly easy to work with8:

        +
        +
        sentences |> 
        +  str_match("the (\\w+) (\\w+)") |> 
        +  head()
        +#>      [,1]                [,2]     [,3]    
        +#> [1,] "the smooth planks" "smooth" "planks"
        +#> [2,] "the sheet to"      "sheet"  "to"    
        +#> [3,] "the depth of"      "depth"  "of"    
        +#> [4,] NA                  NA       NA      
        +#> [5,] NA                  NA       NA      
        +#> [6,] NA                  NA       NA
        +
        +

        You could convert to a tibble and name the columns:

        +
        +
        sentences |> 
        +  str_match("the (\\w+) (\\w+)") |> 
        +  as_tibble(.name_repair = "minimal") |> 
        +  set_names("match", "word1", "word2")
        +#> # A tibble: 720 × 3
        +#>   match             word1  word2 
        +#>   <chr>             <chr>  <chr> 
        +#> 1 the smooth planks smooth planks
        +#> 2 the sheet to      sheet  to    
        +#> 3 the depth of      depth  of    
        +#> 4 <NA>              <NA>   <NA>  
        +#> 5 <NA>              <NA>   <NA>  
        +#> 6 <NA>              <NA>   <NA>  
        +#> # ℹ 714 more rows
        +
        +

        But then you’ve basically recreated your own version of separate_wider_regex(). Indeed, behind the scenes, separate_wider_regex() converts your vector of patterns to a single regex that uses grouping to capture the named components.

        +

        Occasionally, you’ll want to use parentheses without creating matching groups. You can create a non-capturing group with (?:).

        +
        +
        x <- c("a gray cat", "a grey dog")
        +str_match(x, "gr(e|a)y")
        +#>      [,1]   [,2]
        +#> [1,] "gray" "a" 
        +#> [2,] "grey" "e"
        +str_match(x, "gr(?:e|a)y")
        +#>      [,1]  
        +#> [1,] "gray"
        +#> [2,] "grey"
        +
        +

        +15.4.7 Exercises

        +
          +
        1. How would you match the literal string "'\? How about "$^$"?

        2. +
        3. Explain why each of these patterns don’t match a \: "\", "\\", "\\\".

        4. +
        5. +

          Given the corpus of common words in stringr::words, create regular expressions that find all words that:

          +
            +
          1. Start with “y”.
          2. +
          3. Don’t start with “y”.
          4. +
          5. End with “x”.
          6. +
          7. Are exactly three letters long. (Don’t cheat by using str_length()!)
          8. +
          9. Have seven letters or more.
          10. +
          11. Contain a vowel-consonant pair.
          12. +
          13. Contain at least two vowel-consonant pairs in a row.
          14. +
          15. Only consist of repeated vowel-consonant pairs.
          16. +
          +
        6. +
        7. Create 11 regular expressions that match the British or American spellings for each of the following words: airplane/aeroplane, aluminum/aluminium, analog/analogue, ass/arse, center/centre, defense/defence, donut/doughnut, gray/grey, modeling/modelling, skeptic/sceptic, summarize/summarise. Try and make the shortest possible regex!

        8. +
        9. Switch the first and last letters in words. Which of those strings are still words?

        10. +
        11. +

          Describe in words what these regular expressions match: (read carefully to see if each entry is a regular expression or a string that defines a regular expression.)

          +
            +
          1. ^.*$
          2. +
          3. "\\{.+\\}"
          4. +
          5. \d{4}-\d{2}-\d{2}
          6. +
          7. "\\\\{4}"
          8. +
          9. \..\..\..
          10. +
          11. (.)\1\1
          12. +
          13. "(..)\\1"
          14. +
          +
        12. +
        13. Solve the beginner regexp crosswords at https://regexcrossword.com/challenges/beginner.

        14. +

        +15.5 Pattern control

        +

        It’s possible to exercise extra control over the details of the match by using a pattern object instead of just a string. This allows you to control the so called regex flags and match various types of fixed strings, as described below.

        +

        +15.5.1 Regex flags

        +

        There are a number of settings that can be used to control the details of the regexp. These settings are often called flags in other programming languages. In stringr, you can use these by wrapping the pattern in a call to regex(). The most useful flag is probably ignore_case = TRUE because it allows characters to match either their uppercase or lowercase forms:

        +
        +
        bananas <- c("banana", "Banana", "BANANA")
        +str_view(bananas, "banana")
        +#> [1] │ <banana>
        +str_view(bananas, regex("banana", ignore_case = TRUE))
        +#> [1] │ <banana>
        +#> [2] │ <Banana>
        +#> [3] │ <BANANA>
        +
        +

        If you’re doing a lot of work with multiline strings (i.e. strings that contain \n), dotalland multiline may also be useful:

        +
          +
        • +

          dotall = TRUE lets . match everything, including \n:

          +
          +
          x <- "Line 1\nLine 2\nLine 3"
          +str_view(x, ".Line")
          +str_view(x, regex(".Line", dotall = TRUE))
          +#> [1] │ Line 1<
          +#>     │ Line> 2<
          +#>     │ Line> 3
          +
          +
        • +
        • +

          multiline = TRUE makes ^ and $ match the start and end of each line rather than the start and end of the complete string:

          +
          +
          x <- "Line 1\nLine 2\nLine 3"
          +str_view(x, "^Line")
          +#> [1] │ <Line> 1
          +#>     │ Line 2
          +#>     │ Line 3
          +str_view(x, regex("^Line", multiline = TRUE))
          +#> [1] │ <Line> 1
          +#>     │ <Line> 2
          +#>     │ <Line> 3
          +
          +
        • +
        +

        Finally, if you’re writing a complicated regular expression and you’re worried you might not understand it in the future, you might try comments = TRUE. It tweaks the pattern language to ignore spaces and new lines, as well as everything after #. This allows you to use comments and whitespace to make complex regular expressions more understandable9, as in the following example:

        +
        +
        phone <- regex(
        +  r"(
        +    \(?     # optional opening parens
        +    (\d{3}) # area code
        +    [)\-]?  # optional closing parens or dash
        +    \ ?     # optional space
        +    (\d{3}) # another three numbers
        +    [\ -]?  # optional space or dash
        +    (\d{4}) # four more numbers
        +  )", 
        +  comments = TRUE
        +)
        +
        +str_extract(c("514-791-8141", "(123) 456 7890", "123456"), phone)
        +#> [1] "514-791-8141"   "(123) 456 7890" NA
        +
        +

        If you’re using comments and want to match a space, newline, or #, you’ll need to escape it with \.

        +

        +15.5.2 Fixed matches

        +

        You can opt-out of the regular expression rules by using fixed():

        +
        +
        str_view(c("", "a", "."), fixed("."))
        +#> [3] │ <.>
        +
        +

        fixed() also gives you the ability to ignore case:

        +
        +
        str_view("x X", "X")
        +#> [1] │ x <X>
        +str_view("x X", fixed("X", ignore_case = TRUE))
        +#> [1] │ <x> <X>
        +
        +

        If you’re working with non-English text, you will probably want coll() instead of fixed(), as it implements the full rules for capitalization as used by the locale you specify. See Seção 14.6 for more details on locales.

        +
        +
        str_view("i İ ı I", fixed("İ", ignore_case = TRUE))
        +#> [1] │ i <İ> ı I
        +str_view("i İ ı I", coll("İ", ignore_case = TRUE, locale = "tr"))
        +#> [1] │ <i> <İ> ı I
        +
        +

        +15.6 Practice

        +

        To put these ideas into practice we’ll solve a few semi-authentic problems next. We’ll discuss three general techniques:

        +
          +
        1. checking your work by creating simple positive and negative controls
        2. +
        3. combining regular expressions with Boolean algebra
        4. +
        5. creating complex patterns using string manipulation
        6. +
        +

        +15.6.1 Check your work

        +

        First, let’s find all sentences that start with “The”. Using the ^ anchor alone is not enough:

        +
        +
        str_view(sentences, "^The")
        +#>  [1] │ <The> birch canoe slid on the smooth planks.
        +#>  [4] │ <The>se days a chicken leg is a rare dish.
        +#>  [6] │ <The> juice of lemons makes fine punch.
        +#>  [7] │ <The> box was thrown beside the parked truck.
        +#>  [8] │ <The> hogs were fed chopped corn and garbage.
        +#> [11] │ <The> boy was there when the sun rose.
        +#> ... and 271 more
        +
        +

        Because that pattern also matches sentences starting with words like They or These. We need to make sure that the “e” is the last letter in the word, which we can do by adding a word boundary:

        +
        +
        str_view(sentences, "^The\\b")
        +#>  [1] │ <The> birch canoe slid on the smooth planks.
        +#>  [6] │ <The> juice of lemons makes fine punch.
        +#>  [7] │ <The> box was thrown beside the parked truck.
        +#>  [8] │ <The> hogs were fed chopped corn and garbage.
        +#> [11] │ <The> boy was there when the sun rose.
        +#> [13] │ <The> source of the huge river is the clear spring.
        +#> ... and 250 more
        +
        +

        What about finding all sentences that begin with a pronoun?

        +
        +
        str_view(sentences, "^She|He|It|They\\b")
        +#>  [3] │ <It>'s easy to tell the depth of a well.
        +#> [15] │ <He>lp the woman get back to her feet.
        +#> [27] │ <He>r purse was full of useless trash.
        +#> [29] │ <It> snowed, rained, and hailed the same morning.
        +#> [63] │ <He> ran half way to the hardware store.
        +#> [90] │ <He> lay prone and hardly moved a limb.
        +#> ... and 57 more
        +
        +

        A quick inspection of the results shows that we’re getting some spurious matches. That’s because we’ve forgotten to use parentheses:

        +
        +
        str_view(sentences, "^(She|He|It|They)\\b")
        +#>   [3] │ <It>'s easy to tell the depth of a well.
        +#>  [29] │ <It> snowed, rained, and hailed the same morning.
        +#>  [63] │ <He> ran half way to the hardware store.
        +#>  [90] │ <He> lay prone and hardly moved a limb.
        +#> [116] │ <He> ordered peach pie with ice cream.
        +#> [127] │ <It> caught its hind paw in a rusty trap.
        +#> ... and 51 more
        +
        +

        You might wonder how you might spot such a mistake if it didn’t occur in the first few matches. A good technique is to create a few positive and negative matches and use them to test that your pattern works as expected:

        +
        +
        pos <- c("He is a boy", "She had a good time")
        +neg <- c("Shells come from the sea", "Hadley said 'It's a great day'")
        +
        +pattern <- "^(She|He|It|They)\\b"
        +str_detect(pos, pattern)
        +#> [1] TRUE TRUE
        +str_detect(neg, pattern)
        +#> [1] FALSE FALSE
        +
        +

        It’s typically much easier to come up with good positive examples than negative examples, because it takes a while before you’re good enough with regular expressions to predict where your weaknesses are. Nevertheless, they’re still useful: as you work on the problem you can slowly accumulate a collection of your mistakes, ensuring that you never make the same mistake twice.

        +

        +15.6.2 Boolean operations

        +

        Imagine we want to find words that only contain consonants. One technique is to create a character class that contains all letters except for the vowels ([^aeiou]), then allow that to match any number of letters ([^aeiou]+), then force it to match the whole string by anchoring to the beginning and the end (^[^aeiou]+$):

        +
        +
        str_view(words, "^[^aeiou]+$")
        +#> [123] │ <by>
        +#> [249] │ <dry>
        +#> [328] │ <fly>
        +#> [538] │ <mrs>
        +#> [895] │ <try>
        +#> [952] │ <why>
        +
        +

        But you can make this problem a bit easier by flipping the problem around. Instead of looking for words that contain only consonants, we could look for words that don’t contain any vowels:

        +
        +
        str_view(words[!str_detect(words, "[aeiou]")])
        +#> [1] │ by
        +#> [2] │ dry
        +#> [3] │ fly
        +#> [4] │ mrs
        +#> [5] │ try
        +#> [6] │ why
        +
        +

        This is a useful technique whenever you’re dealing with logical combinations, particularly those involving “and” or “not”. For example, imagine if you want to find all words that contain “a” and “b”. There’s no “and” operator built in to regular expressions so we have to tackle it by looking for all words that contain an “a” followed by a “b”, or a “b” followed by an “a”:

        +
        +
        str_view(words, "a.*b|b.*a")
        +#>  [2] │ <ab>le
        +#>  [3] │ <ab>out
        +#>  [4] │ <ab>solute
        +#> [62] │ <availab>le
        +#> [66] │ <ba>by
        +#> [67] │ <ba>ck
        +#> ... and 24 more
        +
        +

        It’s simpler to combine the results of two calls to str_detect():

        +
        +
        words[str_detect(words, "a") & str_detect(words, "b")]
        +#>  [1] "able"      "about"     "absolute"  "available" "baby"      "back"     
        +#>  [7] "bad"       "bag"       "balance"   "ball"      "bank"      "bar"      
        +#> [13] "base"      "basis"     "bear"      "beat"      "beauty"    "because"  
        +#> [19] "black"     "board"     "boat"      "break"     "brilliant" "britain"  
        +#> [25] "debate"    "husband"   "labour"    "maybe"     "probable"  "table"
        +
        +

        What if we wanted to see if there was a word that contains all vowels? If we did it with patterns we’d need to generate 5! (120) different patterns:

        +
        +
        words[str_detect(words, "a.*e.*i.*o.*u")]
        +# ...
        +words[str_detect(words, "u.*o.*i.*e.*a")]
        +
        +

        It’s much simpler to combine five calls to str_detect():

        +
        +
        words[
        +  str_detect(words, "a") &
        +  str_detect(words, "e") &
        +  str_detect(words, "i") &
        +  str_detect(words, "o") &
        +  str_detect(words, "u")
        +]
        +#> character(0)
        +
        +

        In general, if you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.

        +

        +15.6.3 Creating a pattern with code

        +

        What if we wanted to find all sentences that mention a color? The basic idea is simple: we just combine alternation with word boundaries.

        +
        +
        str_view(sentences, "\\b(red|green|blue)\\b")
        +#>   [2] │ Glue the sheet to the dark <blue> background.
        +#>  [26] │ Two <blue> fish swam in the tank.
        +#>  [92] │ A wisp of cloud hung in the <blue> air.
        +#> [148] │ The spot on the blotter was made by <green> ink.
        +#> [160] │ The sofa cushion is <red> and of light weight.
        +#> [174] │ The sky that morning was clear and bright <blue>.
        +#> ... and 20 more
        +
        +

        But as the number of colors grows, it would quickly get tedious to construct this pattern by hand. Wouldn’t it be nice if we could store the colors in a vector?

        +
        +
        rgb <- c("red", "green", "blue")
        +
        +

        Well, we can! We’d just need to create the pattern from the vector using str_c() and str_flatten():

        +
        +
        str_c("\\b(", str_flatten(rgb, "|"), ")\\b")
        +#> [1] "\\b(red|green|blue)\\b"
        +
        +

        We could make this pattern more comprehensive if we had a good list of colors. One place we could start from is the list of built-in colors that R can use for plots:

        +
        +
        str_view(colors())
        +#> [1] │ white
        +#> [2] │ aliceblue
        +#> [3] │ antiquewhite
        +#> [4] │ antiquewhite1
        +#> [5] │ antiquewhite2
        +#> [6] │ antiquewhite3
        +#> ... and 651 more
        +
        +

        But lets first eliminate the numbered variants:

        +
        +
        cols <- colors()
        +cols <- cols[!str_detect(cols, "\\d")]
        +str_view(cols)
        +#> [1] │ white
        +#> [2] │ aliceblue
        +#> [3] │ antiquewhite
        +#> [4] │ aquamarine
        +#> [5] │ azure
        +#> [6] │ beige
        +#> ... and 137 more
        +
        +

        Then we can turn this into one giant pattern. We won’t show the pattern here because it’s huge, but you can see it working:

        +
        +
        pattern <- str_c("\\b(", str_flatten(cols, "|"), ")\\b")
        +str_view(sentences, pattern)
        +#>   [2] │ Glue the sheet to the dark <blue> background.
        +#>  [12] │ A rod is used to catch <pink> <salmon>.
        +#>  [26] │ Two <blue> fish swam in the tank.
        +#>  [66] │ Cars and busses stalled in <snow> drifts.
        +#>  [92] │ A wisp of cloud hung in the <blue> air.
        +#> [112] │ Leaves turn <brown> and <yellow> in the fall.
        +#> ... and 57 more
        +
        +

        In this example, cols only contains numbers and letters so you don’t need to worry about metacharacters. But in general, whenever you create patterns from existing strings it’s wise to run them through str_escape() to ensure they match literally.

        +

        +15.6.4 Exercises

        +
          +
        1. +

          For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.

          +
            +
          1. Find all words that start or end with x.
          2. +
          3. Find all words that start with a vowel and end with a consonant.
          4. +
          5. Are there any words that contain at least one of each different vowel?
          6. +
          +
        2. +
        3. Construct patterns to find evidence for and against the rule “i before e except after c”?

        4. +
        5. colors() contains a number of modifiers like “lightgray” and “darkblue”. How could you automatically identify these modifiers? (Think about how you might detect and then remove the colors that are modified).

        6. +
        7. Create a regular expression that finds any base R dataset. You can get a list of these datasets via a special use of the data() function: data(package = "datasets")$results[, "Item"]. Note that a number of old datasets are individual vectors; these contain the name of the grouping “data frame” in parentheses, so you’ll need to strip those off.

        8. +

        +15.7 Regular expressions in other places

        +

        Just like in the stringr and tidyr functions, there are many other places in R where you can use regular expressions. The following sections describe some other useful functions in the wider tidyverse and base R.

        +

        +15.7.1 tidyverse

        +

        There are three other particularly useful places where you might want to use a regular expressions

        +
          +
        • matches(pattern) will select all variables whose name matches the supplied pattern. It’s a “tidyselect” function that you can use anywhere in any tidyverse function that selects variables (e.g., select(), rename_with() and across()).

        • +
        • pivot_longer()'s names_pattern argument takes a vector of regular expressions, just like separate_wider_regex(). It’s useful when extracting data out of variable names with a complex structure

        • +
        • The delim argument in separate_longer_delim() and separate_wider_delim() usually matches a fixed string, but you can use regex() to make it match a pattern. This is useful, for example, if you want to match a comma that is optionally followed by a space, i.e. regex(", ?").

        • +

        +15.7.2 Base R

        +

        apropos(pattern) searches all objects available from the global environment that match the given pattern. This is useful if you can’t quite remember the name of a function:

        +
        +
        apropos("replace")
        +#> [1] "%+replace%"       "replace"          "replace_na"      
        +#> [4] "setReplaceMethod" "str_replace"      "str_replace_all" 
        +#> [7] "str_replace_na"   "theme_replace"
        +
        +

        list.files(path, pattern) lists all files in path that match a regular expression pattern. For example, you can find all the R Markdown files in the current directory with:

        +
        +
        head(list.files(pattern = "\\.Rmd$"))
        +#> character(0)
        +
        +

        It’s worth noting that the pattern language used by base R is very slightly different to that used by stringr. That’s because stringr is built on top of the stringi package, which is in turn built on top of the ICU engine, whereas base R functions use either the TRE engine or the PCRE engine, depending on whether or not you’ve set perl = TRUE. Fortunately, the basics of regular expressions are so well established that you’ll encounter few variations when working with the patterns you’ll learn in this book. You only need to be aware of the difference when you start to rely on advanced features like complex Unicode character ranges or special features that use the (?…) syntax.

        +

        +15.8 Summary

        +

        With every punctuation character potentially overloaded with meaning, regular expressions are one of the most compact languages out there. They’re definitely confusing at first but as you train your eyes to read them and your brain to understand them, you unlock a powerful skill that you can use in R and in many other places.

        +

        In this chapter, you’ve started your journey to become a regular expression master by learning the most useful stringr functions and the most important components of the regular expression language. And there are plenty of resources to learn more.

        +

        A good place to start is vignette("regular-expressions", package = "stringr"): it documents the full set of syntax supported by stringr. Another useful reference is https://www.regular-expressions.info/. It’s not R specific, but you can use it to learn about the most advanced features of regexes and how they work under the hood.

        +

        It’s also good to know that stringr is implemented on top of the stringi package by Marek Gagolewski. If you’re struggling to find a function that does what you need in stringr, don’t be afraid to look in stringi. You’ll find stringi very easy to pick up because it follows many of the the same conventions as stringr.

        +

        In the next chapter, we’ll talk about a data structure closely related to strings: factors. Factors are used to represent categorical data in R, i.e. data with a fixed and known set of possible values identified by a vector of strings.

        + + +

        +
          +
        1. You can pronounce it with either a hard-g (reg-x) or a soft-g (rej-x).↩︎

        2. +
        3. You’ll learn how to escape these special meanings in Seção 15.4.1.↩︎

        4. +
        5. Well, any character apart from \n.↩︎

        6. +
        7. This gives us the proportion of names that contain an “x”; if you wanted the proportion of babies with a name containing an x, you’d need to perform a weighted mean.↩︎

        8. +
        9. We wish we could reassure you that you’d never see something this weird in real life, but unfortunately over the course of your career you’re likely to see much weirder!↩︎

        10. +
        11. The complete set of metacharacters is .^$\|*+?{}[]()↩︎

        12. +
        13. Remember, to create a regular expression containing \d or \s, you’ll need to escape the \ for the string, so you’ll type "\\d" or "\\s".↩︎

        14. +
        15. Mostly because we never discuss matrices in this book!↩︎

        16. +
        17. comments = TRUE is particularly effective in combination with a raw string, as we use here.↩︎

        18. +
        +
        +
        +
        + + + \ No newline at end of file diff --git a/regexps_files/figure-html/unnamed-chunk-11-1.png b/regexps_files/figure-html/unnamed-chunk-11-1.png new file mode 100644 index 000000000..86e9b5f9e Binary files /dev/null and b/regexps_files/figure-html/unnamed-chunk-11-1.png differ diff --git a/screenshots/View-1.png b/screenshots/View-1.png new file mode 100644 index 000000000..8aeb78279 Binary files /dev/null and b/screenshots/View-1.png differ diff --git a/screenshots/View-2.png b/screenshots/View-2.png new file mode 100644 index 000000000..e80418c0e Binary files /dev/null and b/screenshots/View-2.png differ diff --git a/screenshots/View-3.png b/screenshots/View-3.png new file mode 100644 index 000000000..00cc92c78 Binary files /dev/null and b/screenshots/View-3.png differ diff --git a/screenshots/import-googlesheets-students.png b/screenshots/import-googlesheets-students.png new file mode 100644 index 000000000..d3ab5708b Binary files /dev/null and b/screenshots/import-googlesheets-students.png differ diff --git a/screenshots/import-spreadsheets-bake-sale.png b/screenshots/import-spreadsheets-bake-sale.png new file mode 100644 index 000000000..3790dfdeb Binary files /dev/null and b/screenshots/import-spreadsheets-bake-sale.png differ diff --git a/screenshots/import-spreadsheets-deaths.png b/screenshots/import-spreadsheets-deaths.png new file mode 100644 index 000000000..2ef7d495b Binary files /dev/null and b/screenshots/import-spreadsheets-deaths.png differ diff --git a/screenshots/import-spreadsheets-penguins-islands.png b/screenshots/import-spreadsheets-penguins-islands.png new file mode 100644 index 000000000..ca60db6ed Binary files /dev/null and b/screenshots/import-spreadsheets-penguins-islands.png differ diff --git a/screenshots/import-spreadsheets-roster.png b/screenshots/import-spreadsheets-roster.png new file mode 100644 index 000000000..2295c674d Binary files /dev/null and b/screenshots/import-spreadsheets-roster.png differ diff --git a/screenshots/import-spreadsheets-sales.png b/screenshots/import-spreadsheets-sales.png new file mode 100644 index 000000000..2b5c6fe30 Binary files /dev/null and b/screenshots/import-spreadsheets-sales.png differ diff --git a/screenshots/import-spreadsheets-students.png b/screenshots/import-spreadsheets-students.png new file mode 100644 index 000000000..7ce5f0b21 Binary files /dev/null and b/screenshots/import-spreadsheets-students.png differ diff --git a/screenshots/import-spreadsheets-survey.png b/screenshots/import-spreadsheets-survey.png new file mode 100644 index 000000000..f9cd2d79e Binary files /dev/null and b/screenshots/import-spreadsheets-survey.png differ diff --git a/screenshots/quarto-chunk-nav.png b/screenshots/quarto-chunk-nav.png new file mode 100644 index 000000000..00e29d752 Binary files /dev/null and b/screenshots/quarto-chunk-nav.png differ diff --git a/screenshots/rstudio-diagnostic-tip.png b/screenshots/rstudio-diagnostic-tip.png new file mode 100644 index 000000000..93038a5dc Binary files /dev/null and b/screenshots/rstudio-diagnostic-tip.png differ diff --git a/screenshots/rstudio-diagnostic-warn.png b/screenshots/rstudio-diagnostic-warn.png new file mode 100644 index 000000000..e83ed7c99 Binary files /dev/null and b/screenshots/rstudio-diagnostic-warn.png differ diff --git a/screenshots/rstudio-diagnostic.png b/screenshots/rstudio-diagnostic.png new file mode 100644 index 000000000..610e78d6d Binary files /dev/null and b/screenshots/rstudio-diagnostic.png differ diff --git a/screenshots/rstudio-nav.png b/screenshots/rstudio-nav.png new file mode 100644 index 000000000..927fceac5 Binary files /dev/null and b/screenshots/rstudio-nav.png differ diff --git a/screenshots/rstudio-palette.png b/screenshots/rstudio-palette.png new file mode 100644 index 000000000..2b448cada Binary files /dev/null and b/screenshots/rstudio-palette.png differ diff --git a/screenshots/rstudio-pipe-options.png b/screenshots/rstudio-pipe-options.png new file mode 100644 index 000000000..b389890ab Binary files /dev/null and b/screenshots/rstudio-pipe-options.png differ diff --git a/screenshots/rstudio-wd.png b/screenshots/rstudio-wd.png new file mode 100644 index 000000000..5401607c1 Binary files /dev/null and b/screenshots/rstudio-wd.png differ diff --git a/screenshots/scraping-imdb.png b/screenshots/scraping-imdb.png new file mode 100644 index 000000000..ac6eee57e Binary files /dev/null and b/screenshots/scraping-imdb.png differ diff --git a/screenshots/stringr-autocomplete.png b/screenshots/stringr-autocomplete.png new file mode 100644 index 000000000..e3fd36275 Binary files /dev/null and b/screenshots/stringr-autocomplete.png differ diff --git a/search.json b/search.json index de7a62cef..b88bc3643 100644 --- a/search.json +++ b/search.json @@ -74,14 +74,14 @@ "href": "intro.html#footnotes", "title": "Introdução", "section": "", - "text": "Nota de tradução: tidy é um verbo em inglês que quer dizer “arrumar/organizar”. Tidy data é uma forma de organizar os dados, que será abordado no capítulo ?sec-data-tidy.↩︎\nNota de tradução: Manipulação de dados é chamado em inglês de data wrangling, porque colocar seus dados em uma forma natural de trabalhar frequentemente parece uma luta (wrangle)!↩︎\nNota de tradução: “Caber na memória” se refere à memória RAM (random access memory) do computador, cuja função é guardar temporariamente toda a informação que o computador precisa (por exemplo, as bases de dados importadas).↩︎\nSe você deseja uma visão abrangente de todos os recursos do RStudio, consulte o Guia de uso do RStudio em https://docs.posit.co/ide/user.↩︎\nNota de tradução: tidyverse é a união das palavras tidy (arrumado) e universe (universo), sendo então a ideia de um “universo arrumado”. messy quer dizer desarrumado, e messyverse seria a ideia de um universo desarrumado.↩︎" + "text": "Nota de tradução: tidy é um verbo em inglês que quer dizer “arrumar/organizar”. Tidy data é uma forma de organizar os dados, que será abordado no capítulo Capítulo 5.↩︎\nNota de tradução: Manipulação de dados é chamado em inglês de data wrangling, porque colocar seus dados em uma forma natural de trabalhar frequentemente parece uma luta (wrangle)!↩︎\nNota de tradução: “Caber na memória” se refere à memória RAM (random access memory) do computador, cuja função é guardar temporariamente toda a informação que o computador precisa (por exemplo, as bases de dados importadas).↩︎\nSe você deseja uma visão abrangente de todos os recursos do RStudio, consulte o Guia de uso do RStudio em https://docs.posit.co/ide/user.↩︎\nNota de tradução: tidyverse é a união das palavras tidy (arrumado) e universe (universo), sendo então a ideia de um “universo arrumado”. messy quer dizer desarrumado, e messyverse seria a ideia de um universo desarrumado.↩︎" }, { "objectID": "whole-game.html", "href": "whole-game.html", "title": "Visão geral", "section": "", - "text": "O nosso objetivo nesta parte do livro é oferecer uma visão geral rápida das principais ferramentas da ciência de dados: importação, organização, transformação e visualização de dados, como mostrado na Figura 1. Queremos apresentar para você uma visão geral da ciência de dados, fornecendo apenas o suficiente de todos os principais elementos para que você possa lidar com conjuntos de dados reais, ainda que simples. As partes posteriores do livro abordarão cada um desses tópicos com mais profundidade, ampliando o leque de desafios da ciência de dados que você pode enfrentar.\n\n\n\n\nFigura 1: Nesta parte do livro, você aprenderá como importar, organizar, transformar e visualizar dados.\n\n\n\nQuatro capítulos se concentram nas ferramentas da ciência de dados:\n\nA visualização é um ótimo ponto de partida para a programação em R, porque os resultados são claros: você pode criar gráficos elegantes e informativos que te ajudam a entender os dados. No Capítulo 1, você mergulhará na visualização, aprendendo a estrutura básica de um gráfico ggplot2 e técnicas poderosas para transformar dados em gráficos.\nGeralmente, apenas a visualização não é suficiente. Portanto, no ?sec-data-transform, você aprenderá os principais verbos que permitem selecionar variáveis importantes, filtrar observações essenciais, criar novas variáveis e fazer sumarizações.\nNo ?sec-data-tidy, você aprenderá sobre dados organizados (tidy data), uma maneira consistente de armazenar seus dados que facilita a transformação, visualização e modelagem. Você aprenderá os princípios de tidy data e como deixar seus dados neste formato.\nAntes de poder transformar e visualizar seus dados, você precisa primeiro importá-los para o R. No ?sec-data-import, você aprenderá o básico de como importar arquivos .csv para o R.\n\nEntre esses capítulos, há outros quatro capítulos que se concentram no fluxo de trabalho no R. Em Capítulo 2, ?sec-workflow-style e ?sec-workflow-scripts-projects, você aprenderá boas práticas de fluxo de trabalho para escrever e organizar seu código R. Isso te preparará para o sucesso a longo prazo, pois fornecerá as ferramentas necessárias para manter a organização ao enfrentar projetos reais. Por fim, no ?sec-workflow-getting-help, você aprenderá como obter ajuda e continuar aprendendo." + "text": "O nosso objetivo nesta parte do livro é oferecer uma visão geral rápida das principais ferramentas da ciência de dados: importação, organização, transformação e visualização de dados, como mostrado na Figura 1. Queremos apresentar para você uma visão geral da ciência de dados, fornecendo apenas o suficiente de todos os principais elementos para que você possa lidar com conjuntos de dados reais, ainda que simples. As partes posteriores do livro abordarão cada um desses tópicos com mais profundidade, ampliando o leque de desafios da ciência de dados que você pode enfrentar.\n\n\n\n\nFigura 1: Nesta parte do livro, você aprenderá como importar, organizar, transformar e visualizar dados.\n\n\n\nQuatro capítulos se concentram nas ferramentas da ciência de dados:\n\nA visualização é um ótimo ponto de partida para a programação em R, porque os resultados são claros: você pode criar gráficos elegantes e informativos que te ajudam a entender os dados. No Capítulo 1, você mergulhará na visualização, aprendendo a estrutura básica de um gráfico ggplot2 e técnicas poderosas para transformar dados em gráficos.\nGeralmente, apenas a visualização não é suficiente. Portanto, no Capítulo 3, você aprenderá os principais verbos que permitem selecionar variáveis importantes, filtrar observações essenciais, criar novas variáveis e fazer sumarizações.\nNo Capítulo 5, você aprenderá sobre dados organizados (tidy data), uma maneira consistente de armazenar seus dados que facilita a transformação, visualização e modelagem. Você aprenderá os princípios de tidy data e como deixar seus dados neste formato.\nAntes de poder transformar e visualizar seus dados, você precisa primeiro importá-los para o R. No Capítulo 7, você aprenderá o básico de como importar arquivos .csv para o R.\n\nEntre esses capítulos, há outros quatro capítulos que se concentram no fluxo de trabalho no R. Em Capítulo 2, Capítulo 4 e Capítulo 6, você aprenderá boas práticas de fluxo de trabalho para escrever e organizar seu código R. Isso te preparará para o sucesso a longo prazo, pois fornecerá as ferramentas necessárias para manter a organização ao enfrentar projetos reais. Por fim, no Capítulo 8, você aprenderá como obter ajuda e continuar aprendendo." }, { "objectID": "data-visualize.html#introdução", @@ -95,35 +95,35 @@ "href": "data-visualize.html#primeiros-passos", "title": "1  Visualização de dados", "section": "\n1.2 Primeiros passos", - "text": "1.2 Primeiros passos\nOs pinguins com nadadeiras mais compridas pesam mais ou menos que pinguins com nadadeiras curtas? Você provavelmente já tem uma resposta, mas tente torná-la mais precisa. Como é a relação entre o comprimento da nadadeira e massa corporal? Ela é positiva? Negativa? Linear? Não linear? A relação varia com a espécie do pinguim? E quanto à ilha onde o pinguim vive? Vamos criar visualizações que podemos usar para responder essas perguntas.\n\n1.2.1 O data frame pinguins\n\nVocê pode testar suas respostas à essas questões usando o data frame pinguins encontrado no pacote dados (usando dados::pinguins). Um data frame é uma coleção tabular (formato de tabela) de variáveis (nas colunas) e observações (nas linhas). pinguins contém 344 observações coletadas e disponibilizadas pela Dra. Kristen Gorman e pelo PELD Estação Palmer, Antártica2.\nPara facilitar a discussão, vamos definir alguns termos:\n\nUma variável é uma quantidade, qualidade ou propriedade que você pode medir.\nUm valor é o estado de uma variável quando você a mede. O valor de uma variável pode mudar de medição para medição.\nUma observação é um conjunto de medições feitas em condições semelhantes (geralmente todas as medições em uma observação são feitas ao mesmo tempo e no mesmo objeto). Uma observação conterá vários valores, cada um associado a uma variável diferente. Às vezes, vamos nos referir a uma observação como um ponto de dados.\nDados tabulares são um conjunto de valores, cada um associado a uma variável e uma observação. Os dados tabulares estarão no formato tidy (arrumado) se cada valor estiver em sua própria “célula”, cada variável em sua própria coluna e cada observação em sua própria linha.\n\nNeste contexto, uma variável refere-se a um atributo de todos os pinguins, e uma observação refere-se a todos os atributos de um único pinguim.\nDigite o nome do data frame no console e o R mostrará uma visualização de seu conteúdo. Observe que aparece escrito tibble no topo desta visualização. No tidyverse, usamos data frames especiais chamados tibbles, dos quais você aprenderá mais em breve.\n\npinguins\n#> # A tibble: 344 × 8\n#> especie ilha comprimento_bico profundidade_bico\n#> <fct> <fct> <dbl> <dbl>\n#> 1 Pinguim-de-adélia Torgersen 39.1 18.7\n#> 2 Pinguim-de-adélia Torgersen 39.5 17.4\n#> 3 Pinguim-de-adélia Torgersen 40.3 18 \n#> 4 Pinguim-de-adélia Torgersen NA NA \n#> 5 Pinguim-de-adélia Torgersen 36.7 19.3\n#> 6 Pinguim-de-adélia Torgersen 39.3 20.6\n#> # ℹ 338 more rows\n#> # ℹ 4 more variables: comprimento_nadadeira <int>, massa_corporal <int>, …\n\nEste data frame contém 8 colunas. Para uma visualização alternativa, onde você pode ver todas as variáveis e as primeiras observações de cada variável, use glimpse(). Ou, se você estiver no RStudio, execute View(pinguins) para abrir um visualizador de dados interativo.\n\nglimpse(pinguins)\n#> Rows: 344\n#> Columns: 8\n#> $ especie <fct> Pinguim-de-adélia, Pinguim-de-adélia, Pinguim…\n#> $ ilha <fct> Torgersen, Torgersen, Torgersen, Torgersen, T…\n#> $ comprimento_bico <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2,…\n#> $ profundidade_bico <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6,…\n#> $ comprimento_nadadeira <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 1…\n#> $ massa_corporal <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675,…\n#> $ sexo <fct> macho, fêmea, fêmea, NA, fêmea, macho, fêmea,…\n#> $ ano <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…\n\nEntre as variáveis em pinguins estão:\n\nespecie: a espécie de um pinguim (Pinguim-de-adélia, Pinguim-de-barbicha e Pinguim-gentoo).\ncomprimento_nadadeira: comprimento da nadadeira de um pinguim, em milímetros.\nmassa_corporal: massa corporal de um pinguim, em gramas.\n\nPara saber mais sobre pinguins, abra sua página de ajuda executando ?pinguins.\n\n1.2.2 Objetivo final\nNosso objetivo final neste capítulo é recriar a seguinte visualização que exibe a relação entre o comprimento da nadadeira e a massa corporal desses pinguins, levando em consideração a espécie do pinguim.\n\n\n\n\n\n\n1.2.3 Criando um gráfico ggplot\nVamos recriar esse gráfico passo a passo.\nNo ggplot2, você inicia um gráfico com a função ggplot(), definindo um objeto de gráfico ao qual você adiciona camadas. O primeiro argumento da função ggplot() é o conjunto de dados a ser usado no gráfico e, portanto, ggplot(data = pinguins) cria um gráfico vazio que está preparado para exibir os dados dos pinguins, mas, como ainda não dissemos como fazer a visualização, por enquanto ele está vazio. Esse não é um gráfico muito interessante, mas você pode pensar nele como uma tela vazia na qual você pintará as camadas restantes do seu gráfico.\n\nggplot(data = pinguins)\n\n\n\n\nEm seguida, precisamos informar ao ggplot() como as informações dos nossos dados serão representadas visualmente. O argumento mapping (mapeamento) da função ggplot() define como as variáveis em seu conjunto de dados são mapeadas para as propriedades visuais (estética) do gráfico. O argumento mapping é sempre definido na função aes(), e os argumentos x e y de aes() especificam quais variáveis devem ser mapeadas nos eixos x e y. Por enquanto, mapearemos apenas o comprimento da nadadeira para o atributo estético x e a massa corporal para o atributo y. O ggplot2 procura as variáveis mapeadas no argumento data, nesse caso, pinguins.\nO gráfico a seguir mostra o resultado da adição desses mapeamentos.\n\nggplot(\n data = pinguins,\n mapping = aes(x = comprimento_nadadeira, y = massa_corporal)\n)\n\n\n\n\nNossa tela vazia agora está mais estruturada: está claro onde os comprimentos das nadadeiras serão exibidos (no eixo x) e onde as massas corporais serão exibidas (no eixo y). Mas os pinguins em si ainda não estão no gráfico. Isso ocorre porque ainda não definimos, em nosso código, como representar as observações de nosso data frame em nosso gráfico.\nPara isso, precisamos definir um geom: A geometria que um gráfico usa para representar os dados. Essas geometrias são disponibilizados no ggplot2 com funções que começam com geom_. As pessoas geralmente descrevem os gráficos pelo tipo de geom que o gráfico usa. Por exemplo, os gráficos de barras usam geometrias de barras (geom_bar()), os gráficos de linhas usam geometrias de linhas (geom_line()), os boxplots usam geometrias de boxplot (geom_boxplot()), os gráficos de dispersão usam geometrias de pontos (geom_point()) e assim por diante.\nA função geom_point() adiciona uma camada de pontos ao seu gráfico, o que cria um gráfico de dispersão. O ggplot2 vem com muitas funções de geometria, cada uma adicionando um tipo diferente de camada a um gráfico. Você aprenderá várias geometrias ao longo do livro, principalmente em ?sec-layers.\n\nggplot(\n data = pinguins,\n mapping = aes(x = comprimento_nadadeira, y = massa_corporal)\n) +\n geom_point()\n#> Warning: Removed 2 rows containing missing values (`geom_point()`).\n\n\n\n\nAgora temos algo que se parece com o que poderíamos chamar de “gráfico de dispersão”. Ele ainda não corresponde ao nosso gráfico mostrado no início da seção “objetivo final”, mas, usando esse gráfico, podemos começar a responder à pergunta que motivou nossa exploração: “Como é a relação entre o comprimento da nadadeira e a massa corporal?” A relação parece ser positiva (à medida que o comprimento da nadadeira aumenta, a massa corporal também aumenta), razoavelmente linear (os pontos estão agrupados em torno de uma linha em vez de uma curva) e moderadamente forte (não há muita dispersão em torno dessa linha). Os pinguins com nadadeiras mais longas geralmente são maiores em termos de massa corporal.\nAntes de adicionarmos mais camadas a esse gráfico, vamos parar por um momento e revisar a mensagem de aviso que recebemos:\n\nRemoved 2 rows containing missing values (geom_point()).\n\nEstamos vendo essa mensagem porque há dois pinguins em nosso conjunto de dados com valores faltantes (missing values - NA*) de massa corporal e/ou comprimento da nadadeira e o ggplot2 não tem como representá-los no gráfico sem esses dois valores. Assim como o R, o ggplot2 adota a filosofia de que os valores faltantes nunca devem desaparecer silenciosamente. Esse tipo de aviso é provavelmente um dos tipos mais comuns de avisos que você verá ao trabalhar com dados reais - os valores faltantes são um problema muito comum e você aprenderá mais sobre eles ao longo do livro, especialmente em ?sec-missing-values. Nos demais gráficos deste capítulo, vamos suprimir esse aviso para que ele não seja mostrado em cada gráfico que fizermos.\n\n1.2.4 Adicionando atributos estéticos e camadas\nGráficos de dispersão são úteis para exibir a relação entre duas variáveis numéricas, mas é sempre uma boa ideia ter uma postura cética em relação a qualquer relação aparente entre duas variáveis e perguntar se pode haver outras variáveis que expliquem ou mudem a natureza dessa relação aparente. Por exemplo, a relação entre o comprimento das nadadeira e a massa corporal difere de acordo com a espécie? Vamos incluir as espécies em nosso gráfico e ver se isso revela alguma ideia adicional sobre a relação aparente entre essas variáveis. Faremos isso representando as espécies com pontos de cores diferentes.\nPara conseguir isso, precisaremos modificar o atributo estético ou a geometria? Se você pensou “no mapeamento estético, dentro de aes()”, você já está pegando o jeito de criar visualizações de dados com o ggplot2! Caso contrário, não se preocupe. Ao longo do livro, você criará muito mais visualizações com ggplot e terá muito mais oportunidades de verificar sua intuição.\n\nggplot(\n data = pinguins,\n mapping = aes(x = comprimento_nadadeira, y = massa_corporal, color = especie)\n) +\n geom_point()\n\n\n\n\nQuando uma variável categórica é mapeada a um atributo estético, o ggplot2 atribui automaticamente um valor único da estética (aqui uma cor única) a cada nível único da variável (cada uma das três espécies), um processo conhecido como dimensionamento. O ggplot2 também adicionará uma legenda que explica quais valores correspondem a quais níveis.\nAgora vamos adicionar mais uma camada: uma curva suave que exibe a relação entre a massa corporal e o comprimento das nadadeiras. Antes de prosseguir, consulte o código acima e pense em como podemos adicionar isso ao nosso gráfico existente.\nComo essa é uma nova geometria que representa nossos dados, adicionaremos uma nova geometria como uma camada sobre o nossa geometria de pontos: geom_smooth(). E especificaremos que queremos desenhar a linha de melhor ajuste com base em um modelo linear (linear model em inglês) com method = \"lm\".\n\nggplot(\n data = pinguins,\n mapping = aes(x = comprimento_nadadeira, y = massa_corporal, color = especie)\n) +\n geom_point() +\n geom_smooth(method = \"lm\")\n\n\n\n\nAdicionamos linhas com sucesso, mas esse gráfico não se parece com o gráfico do Seção 1.2.2, que tem apenas uma linha para todo o conjunto de dados, em vez de linhas separadas para cada espécie de pinguim.\nQuando os mapeamentos estéticos são definidos em ggplot(), no nível global, eles são passados para cada uma das camadas de geometria (geom) subsequentes do gráfico. Entretanto, cada função geom no ggplot2 também pode receber um argumento mapping, que permite mapeamentos estéticos em nível local que são adicionados àqueles herdados do nível global. Como queremos que os pontos sejam coloridos com base na espécie, mas não queremos que as linhas sejam separadas para eles, devemos especificar color = especie somente para geom_point().\n\nggplot(\n data = pinguins,\n mapping = aes(x = comprimento_nadadeira, y = massa_corporal)\n) +\n geom_point(mapping = aes(color = especie)) +\n geom_smooth(method = \"lm\")\n\n\n\n\nPronto! Temos algo que se parece muito com nosso objetivo final, embora ainda não esteja perfeito. Ainda precisamos usar formas diferentes para cada espécie de pinguim e melhorar os rótulos.\nGeralmente, não é uma boa ideia representar informações usando apenas cores em um gráfico, pois as pessoas percebem as cores de forma diferente devido ao daltonismo ou a outras diferenças de visão de cores. Portanto, além da cor, também podemos mapear especie para a estética shape (forma).\n\nggplot(\n data = pinguins,\n mapping = aes(x = comprimento_nadadeira, y = massa_corporal)\n) +\n geom_point(mapping = aes(color = especie, shape = especie)) +\n geom_smooth(method = \"lm\")\n\n\n\n\nObserve que a legenda também é atualizada automaticamente para refletir as diferentes formas dos pontos.\nE, finalmente, podemos melhorar os rótulos do nosso gráfico usando a função labs() em uma nova camada. Alguns dos argumentos de labs() podem ser autoexplicativos: title adiciona um título e subtitle adiciona um subtítulo ao gráfico. Outros argumentos correspondem aos mapeamentos estéticos, x é o rótulo do eixo x, y é o rótulo do eixo y e color e shape definem o rótulo da legenda. Além disso, podemos aprimorar a paleta de cores para que seja segura para pessoas com daltonismo com a função scale_color_colorblind() do pacote ggthemes.\n\nggplot(pinguins, aes(x = comprimento_nadadeira, y = massa_corporal)) +\n geom_point(aes(color = especie, shape = especie)) +\n geom_smooth(method = \"lm\") +\n labs(\n title = \"Massa corporal e comprimento da nadadeira\",\n subtitle = \"Medidas para Pinguim-de-adélia, Pinguim-de-barbicha e Pinguim-gentoo\",\n x = \"Comprimento da nadadeira (mm)\",\n y = \"Massa corporal (g)\",\n color = \"Espécie\",\n shape = \"Espécie\"\n ) +\n scale_color_colorblind()\n\n\n\n\nFinalmente temos um gráfico que corresponde perfeitamente ao nosso “objetivo final”!\n\n1.2.5 Exercícios\n\nQuantas linhas existem em pinguins? E quantas colunas?\nO que a variável profundidade_bico no data frame pinguins descreve? Leia a documentação da base pinguins para descobrir, utilizando o comando ?pinguins .\nFaça um gráfico de dispersão de profundidade_bico em função de comprimento_bico. Ou seja, faça um gráfico de dispersão com profundidade_bico no eixo y e comprimento_bico no eixo x. Descreva a relação entre essas duas variáveis.\nO que acontece se você fizer um gráfico de dispersão de especie em função de profundidade_bico? Qual seria uma melhor escolha de geometria (geom)?\nPor que o seguinte erro ocorre e como você poderia corrigi-lo?\n\n\nggplot(data = pinguins) + \n geom_point()\n\n\nO que o argumento na.rm faz em geom_point()? Qual é o valor padrão do argumento? Crie um gráfico de dispersão em que você use esse argumento definido como TRUE (verdadeiro).\nAdicione a seguinte legenda ao gráfico que você criou no exercício anterior: “Os dados são provenientes do pacote dados”. Dica: dê uma olhada na documentação da função labs().\nRecrie a visualização a seguir. Para qual atributo estético profundidade_bico deve ser mapeada? E ela deve ser mapeada no nível global ou no nível da geometria?\n\n\n\n\n\n\n\nExecute esse código em sua mente e preveja como será o resultado. Em seguida, execute o código no R e verifique suas previsões.\n\n\nggplot(\n data = pinguins,\n mapping = aes(x = comprimento_nadadeira, y = massa_corporal, color = ilha)\n) +\n geom_point() +\n geom_smooth(se = FALSE)\n\n\nEsses dois gráficos serão diferentes? Por que sim ou por que não?\n\n\nggplot(\n data = pinguins,\n mapping = aes(x = comprimento_nadadeira, y = massa_corporal)\n) +\n geom_point() +\n geom_smooth()\n\nggplot() +\n geom_point(\n data = pinguins,\n mapping = aes(x = comprimento_nadadeira, y = massa_corporal)\n ) +\n geom_smooth(\n data = pinguins,\n mapping = aes(x = comprimento_nadadeira, y = massa_corporal)\n )" + "text": "1.2 Primeiros passos\nOs pinguins com nadadeiras mais compridas pesam mais ou menos que pinguins com nadadeiras curtas? Você provavelmente já tem uma resposta, mas tente torná-la mais precisa. Como é a relação entre o comprimento da nadadeira e massa corporal? Ela é positiva? Negativa? Linear? Não linear? A relação varia com a espécie do pinguim? E quanto à ilha onde o pinguim vive? Vamos criar visualizações que podemos usar para responder essas perguntas.\n\n1.2.1 O data frame pinguins\n\nVocê pode testar suas respostas à essas questões usando o data frame pinguins encontrado no pacote dados (usando dados::pinguins). Um data frame é uma coleção tabular (formato de tabela) de variáveis (nas colunas) e observações (nas linhas). pinguins contém 344 observações coletadas e disponibilizadas pela Dra. Kristen Gorman e pelo PELD Estação Palmer, Antártica2.\nPara facilitar a discussão, vamos definir alguns termos:\n\nUma variável é uma quantidade, qualidade ou propriedade que você pode medir.\nUm valor é o estado de uma variável quando você a mede. O valor de uma variável pode mudar de medição para medição.\nUma observação é um conjunto de medições feitas em condições semelhantes (geralmente todas as medições em uma observação são feitas ao mesmo tempo e no mesmo objeto). Uma observação conterá vários valores, cada um associado a uma variável diferente. Às vezes, vamos nos referir a uma observação como um ponto de dados.\nDados tabulares são um conjunto de valores, cada um associado a uma variável e uma observação. Os dados tabulares estarão no formato tidy (arrumado) se cada valor estiver em sua própria “célula”, cada variável em sua própria coluna e cada observação em sua própria linha.\n\nNeste contexto, uma variável refere-se a um atributo de todos os pinguins, e uma observação refere-se a todos os atributos de um único pinguim.\nDigite o nome do data frame no console e o R mostrará uma visualização de seu conteúdo. Observe que aparece escrito tibble no topo desta visualização. No tidyverse, usamos data frames especiais chamados tibbles, dos quais você aprenderá mais em breve.\n\npinguins\n#> # A tibble: 344 × 8\n#> especie ilha comprimento_bico profundidade_bico\n#> <fct> <fct> <dbl> <dbl>\n#> 1 Pinguim-de-adélia Torgersen 39.1 18.7\n#> 2 Pinguim-de-adélia Torgersen 39.5 17.4\n#> 3 Pinguim-de-adélia Torgersen 40.3 18 \n#> 4 Pinguim-de-adélia Torgersen NA NA \n#> 5 Pinguim-de-adélia Torgersen 36.7 19.3\n#> 6 Pinguim-de-adélia Torgersen 39.3 20.6\n#> # ℹ 338 more rows\n#> # ℹ 4 more variables: comprimento_nadadeira <int>, massa_corporal <int>, …\n\nEste data frame contém 8 colunas. Para uma visualização alternativa, onde você pode ver todas as variáveis e as primeiras observações de cada variável, use glimpse(). Ou, se você estiver no RStudio, execute View(pinguins) para abrir um visualizador de dados interativo.\n\nglimpse(pinguins)\n#> Rows: 344\n#> Columns: 8\n#> $ especie <fct> Pinguim-de-adélia, Pinguim-de-adélia, Pinguim…\n#> $ ilha <fct> Torgersen, Torgersen, Torgersen, Torgersen, T…\n#> $ comprimento_bico <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2,…\n#> $ profundidade_bico <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6,…\n#> $ comprimento_nadadeira <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 1…\n#> $ massa_corporal <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675,…\n#> $ sexo <fct> macho, fêmea, fêmea, NA, fêmea, macho, fêmea,…\n#> $ ano <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…\n\nEntre as variáveis em pinguins estão:\n\nespecie: a espécie de um pinguim (Pinguim-de-adélia, Pinguim-de-barbicha e Pinguim-gentoo).\ncomprimento_nadadeira: comprimento da nadadeira de um pinguim, em milímetros.\nmassa_corporal: massa corporal de um pinguim, em gramas.\n\nPara saber mais sobre pinguins, abra sua página de ajuda executando ?pinguins.\n\n1.2.2 Objetivo final\nNosso objetivo final neste capítulo é recriar a seguinte visualização que exibe a relação entre o comprimento da nadadeira e a massa corporal desses pinguins, levando em consideração a espécie do pinguim.\n\n\n\n\n\n\n1.2.3 Criando um gráfico ggplot\nVamos recriar esse gráfico passo a passo.\nNo ggplot2, você inicia um gráfico com a função ggplot(), definindo um objeto de gráfico ao qual você adiciona camadas. O primeiro argumento da função ggplot() é o conjunto de dados a ser usado no gráfico e, portanto, ggplot(data = pinguins) cria um gráfico vazio que está preparado para exibir os dados dos pinguins, mas, como ainda não dissemos como fazer a visualização, por enquanto ele está vazio. Esse não é um gráfico muito interessante, mas você pode pensar nele como uma tela vazia na qual você pintará as camadas restantes do seu gráfico.\n\nggplot(data = pinguins)\n\n\n\n\nEm seguida, precisamos informar ao ggplot() como as informações dos nossos dados serão representadas visualmente. O argumento mapping (mapeamento) da função ggplot() define como as variáveis em seu conjunto de dados são mapeadas para as propriedades visuais (estética) do gráfico. O argumento mapping é sempre definido na função aes(), e os argumentos x e y de aes() especificam quais variáveis devem ser mapeadas nos eixos x e y. Por enquanto, mapearemos apenas o comprimento da nadadeira para o atributo estético x e a massa corporal para o atributo y. O ggplot2 procura as variáveis mapeadas no argumento data, nesse caso, pinguins.\nO gráfico a seguir mostra o resultado da adição desses mapeamentos.\n\nggplot(\n data = pinguins,\n mapping = aes(x = comprimento_nadadeira, y = massa_corporal)\n)\n\n\n\n\nNossa tela vazia agora está mais estruturada: está claro onde os comprimentos das nadadeiras serão exibidos (no eixo x) e onde as massas corporais serão exibidas (no eixo y). Mas os pinguins em si ainda não estão no gráfico. Isso ocorre porque ainda não definimos, em nosso código, como representar as observações de nosso data frame em nosso gráfico.\nPara isso, precisamos definir um geom: A geometria que um gráfico usa para representar os dados. Essas geometrias são disponibilizados no ggplot2 com funções que começam com geom_. As pessoas geralmente descrevem os gráficos pelo tipo de geom que o gráfico usa. Por exemplo, os gráficos de barras usam geometrias de barras (geom_bar()), os gráficos de linhas usam geometrias de linhas (geom_line()), os boxplots usam geometrias de boxplot (geom_boxplot()), os gráficos de dispersão usam geometrias de pontos (geom_point()) e assim por diante.\nA função geom_point() adiciona uma camada de pontos ao seu gráfico, o que cria um gráfico de dispersão. O ggplot2 vem com muitas funções de geometria, cada uma adicionando um tipo diferente de camada a um gráfico. Você aprenderá várias geometrias ao longo do livro, principalmente em Capítulo 9.\n\nggplot(\n data = pinguins,\n mapping = aes(x = comprimento_nadadeira, y = massa_corporal)\n) +\n geom_point()\n#> Warning: Removed 2 rows containing missing values (`geom_point()`).\n\n\n\n\nAgora temos algo que se parece com o que poderíamos chamar de “gráfico de dispersão”. Ele ainda não corresponde ao nosso gráfico mostrado no início da seção “objetivo final”, mas, usando esse gráfico, podemos começar a responder à pergunta que motivou nossa exploração: “Como é a relação entre o comprimento da nadadeira e a massa corporal?” A relação parece ser positiva (à medida que o comprimento da nadadeira aumenta, a massa corporal também aumenta), razoavelmente linear (os pontos estão agrupados em torno de uma linha em vez de uma curva) e moderadamente forte (não há muita dispersão em torno dessa linha). Os pinguins com nadadeiras mais longas geralmente são maiores em termos de massa corporal.\nAntes de adicionarmos mais camadas a esse gráfico, vamos parar por um momento e revisar a mensagem de aviso que recebemos:\n\nRemoved 2 rows containing missing values (geom_point()).\n\nEstamos vendo essa mensagem porque há dois pinguins em nosso conjunto de dados com valores faltantes (missing values - NA*) de massa corporal e/ou comprimento da nadadeira e o ggplot2 não tem como representá-los no gráfico sem esses dois valores. Assim como o R, o ggplot2 adota a filosofia de que os valores faltantes nunca devem desaparecer silenciosamente. Esse tipo de aviso é provavelmente um dos tipos mais comuns de avisos que você verá ao trabalhar com dados reais - os valores faltantes são um problema muito comum e você aprenderá mais sobre eles ao longo do livro, especialmente em Capítulo 18. Nos demais gráficos deste capítulo, vamos suprimir esse aviso para que ele não seja mostrado em cada gráfico que fizermos.\n\n1.2.4 Adicionando atributos estéticos e camadas\nGráficos de dispersão são úteis para exibir a relação entre duas variáveis numéricas, mas é sempre uma boa ideia ter uma postura cética em relação a qualquer relação aparente entre duas variáveis e perguntar se pode haver outras variáveis que expliquem ou mudem a natureza dessa relação aparente. Por exemplo, a relação entre o comprimento das nadadeira e a massa corporal difere de acordo com a espécie? Vamos incluir as espécies em nosso gráfico e ver se isso revela alguma ideia adicional sobre a relação aparente entre essas variáveis. Faremos isso representando as espécies com pontos de cores diferentes.\nPara conseguir isso, precisaremos modificar o atributo estético ou a geometria? Se você pensou “no mapeamento estético, dentro de aes()”, você já está pegando o jeito de criar visualizações de dados com o ggplot2! Caso contrário, não se preocupe. Ao longo do livro, você criará muito mais visualizações com ggplot e terá muito mais oportunidades de verificar sua intuição.\n\nggplot(\n data = pinguins,\n mapping = aes(x = comprimento_nadadeira, y = massa_corporal, color = especie)\n) +\n geom_point()\n\n\n\n\nQuando uma variável categórica é mapeada a um atributo estético, o ggplot2 atribui automaticamente um valor único da estética (aqui uma cor única) a cada nível único da variável (cada uma das três espécies), um processo conhecido como dimensionamento. O ggplot2 também adicionará uma legenda que explica quais valores correspondem a quais níveis.\nAgora vamos adicionar mais uma camada: uma curva suave que exibe a relação entre a massa corporal e o comprimento das nadadeiras. Antes de prosseguir, consulte o código acima e pense em como podemos adicionar isso ao nosso gráfico existente.\nComo essa é uma nova geometria que representa nossos dados, adicionaremos uma nova geometria como uma camada sobre o nossa geometria de pontos: geom_smooth(). E especificaremos que queremos desenhar a linha de melhor ajuste com base em um modelo linear (linear model em inglês) com method = \"lm\".\n\nggplot(\n data = pinguins,\n mapping = aes(x = comprimento_nadadeira, y = massa_corporal, color = especie)\n) +\n geom_point() +\n geom_smooth(method = \"lm\")\n\n\n\n\nAdicionamos linhas com sucesso, mas esse gráfico não se parece com o gráfico do Seção 1.2.2, que tem apenas uma linha para todo o conjunto de dados, em vez de linhas separadas para cada espécie de pinguim.\nQuando os mapeamentos estéticos são definidos em ggplot(), no nível global, eles são passados para cada uma das camadas de geometria (geom) subsequentes do gráfico. Entretanto, cada função geom no ggplot2 também pode receber um argumento mapping, que permite mapeamentos estéticos em nível local que são adicionados àqueles herdados do nível global. Como queremos que os pontos sejam coloridos com base na espécie, mas não queremos que as linhas sejam separadas para eles, devemos especificar color = especie somente para geom_point().\n\nggplot(\n data = pinguins,\n mapping = aes(x = comprimento_nadadeira, y = massa_corporal)\n) +\n geom_point(mapping = aes(color = especie)) +\n geom_smooth(method = \"lm\")\n\n\n\n\nPronto! Temos algo que se parece muito com nosso objetivo final, embora ainda não esteja perfeito. Ainda precisamos usar formas diferentes para cada espécie de pinguim e melhorar os rótulos.\nGeralmente, não é uma boa ideia representar informações usando apenas cores em um gráfico, pois as pessoas percebem as cores de forma diferente devido ao daltonismo ou a outras diferenças de visão de cores. Portanto, além da cor, também podemos mapear especie para a estética shape (forma).\n\nggplot(\n data = pinguins,\n mapping = aes(x = comprimento_nadadeira, y = massa_corporal)\n) +\n geom_point(mapping = aes(color = especie, shape = especie)) +\n geom_smooth(method = \"lm\")\n\n\n\n\nObserve que a legenda também é atualizada automaticamente para refletir as diferentes formas dos pontos.\nE, finalmente, podemos melhorar os rótulos do nosso gráfico usando a função labs() em uma nova camada. Alguns dos argumentos de labs() podem ser autoexplicativos: title adiciona um título e subtitle adiciona um subtítulo ao gráfico. Outros argumentos correspondem aos mapeamentos estéticos, x é o rótulo do eixo x, y é o rótulo do eixo y e color e shape definem o rótulo da legenda. Além disso, podemos aprimorar a paleta de cores para que seja segura para pessoas com daltonismo com a função scale_color_colorblind() do pacote ggthemes.\n\nggplot(pinguins, aes(x = comprimento_nadadeira, y = massa_corporal)) +\n geom_point(aes(color = especie, shape = especie)) +\n geom_smooth(method = \"lm\") +\n labs(\n title = \"Massa corporal e comprimento da nadadeira\",\n subtitle = \"Medidas para Pinguim-de-adélia, Pinguim-de-barbicha e Pinguim-gentoo\",\n x = \"Comprimento da nadadeira (mm)\",\n y = \"Massa corporal (g)\",\n color = \"Espécie\",\n shape = \"Espécie\"\n ) +\n scale_color_colorblind()\n\n\n\n\nFinalmente temos um gráfico que corresponde perfeitamente ao nosso “objetivo final”!\n\n1.2.5 Exercícios\n\nQuantas linhas existem em pinguins? E quantas colunas?\nO que a variável profundidade_bico no data frame pinguins descreve? Leia a documentação da base pinguins para descobrir, utilizando o comando ?pinguins .\nFaça um gráfico de dispersão de profundidade_bico em função de comprimento_bico. Ou seja, faça um gráfico de dispersão com profundidade_bico no eixo y e comprimento_bico no eixo x. Descreva a relação entre essas duas variáveis.\nO que acontece se você fizer um gráfico de dispersão de especie em função de profundidade_bico? Qual seria uma melhor escolha de geometria (geom)?\nPor que o seguinte erro ocorre e como você poderia corrigi-lo?\n\n\nggplot(data = pinguins) + \n geom_point()\n\n\nO que o argumento na.rm faz em geom_point()? Qual é o valor padrão do argumento? Crie um gráfico de dispersão em que você use esse argumento definido como TRUE (verdadeiro).\nAdicione a seguinte legenda ao gráfico que você criou no exercício anterior: “Os dados são provenientes do pacote dados”. Dica: dê uma olhada na documentação da função labs().\nRecrie a visualização a seguir. Para qual atributo estético profundidade_bico deve ser mapeada? E ela deve ser mapeada no nível global ou no nível da geometria?\n\n\n\n\n\n\n\nExecute esse código em sua mente e preveja como será o resultado. Em seguida, execute o código no R e verifique suas previsões.\n\n\nggplot(\n data = pinguins,\n mapping = aes(x = comprimento_nadadeira, y = massa_corporal, color = ilha)\n) +\n geom_point() +\n geom_smooth(se = FALSE)\n\n\nEsses dois gráficos serão diferentes? Por que sim ou por que não?\n\n\nggplot(\n data = pinguins,\n mapping = aes(x = comprimento_nadadeira, y = massa_corporal)\n) +\n geom_point() +\n geom_smooth()\n\nggplot() +\n geom_point(\n data = pinguins,\n mapping = aes(x = comprimento_nadadeira, y = massa_corporal)\n ) +\n geom_smooth(\n data = pinguins,\n mapping = aes(x = comprimento_nadadeira, y = massa_corporal)\n )" }, { "objectID": "data-visualize.html#sec-ggplot2-calls", "href": "data-visualize.html#sec-ggplot2-calls", "title": "1  Visualização de dados", "section": "\n1.3 Chamadas ggplot2", - "text": "1.3 Chamadas ggplot2\nÀ medida que passarmos dessas seções introdutórias, faremos a transição para uma expressão mais concisa do código do ggplot2. Até agora, temos sido muito explícitos, o que é útil quando se está aprendendo:\n\nggplot(\n data = pinguins,\n mapping = aes(x = comprimento_nadadeira, y = massa_corporal)\n) +\n geom_point()\n\nNormalmente, o primeiro ou os dois primeiros argumentos de uma função são tão importantes que você logo saberá usar eles de cor. Os dois primeiros argumentos de ggplot() são data e mapping; no restante do livro, não escreveremos esses nomes. Isso economiza digitação e, ao reduzir a quantidade de texto extra, facilita a visualização das diferenças entre os gráficos. Essa é uma preocupação de programação realmente importante, à qual voltaremos em ?sec-functions.\nReescrevendo o gráfico anterior de forma mais concisa, temos:\n\nggplot(pinguins, aes(x = comprimento_nadadeira, y = massa_corporal)) +\n geom_point()\n\nNo futuro, você também aprenderá sobre o pipe (encadeamento), |>, que permitirá que você crie esse gráfico com a seguinte sintaxe:\n\npinguins |> \n ggplot(aes(x = comprimento_nadadeira, y = massa_corporal)) + \n geom_point()" + "text": "1.3 Chamadas ggplot2\nÀ medida que passarmos dessas seções introdutórias, faremos a transição para uma expressão mais concisa do código do ggplot2. Até agora, temos sido muito explícitos, o que é útil quando se está aprendendo:\n\nggplot(\n data = pinguins,\n mapping = aes(x = comprimento_nadadeira, y = massa_corporal)\n) +\n geom_point()\n\nNormalmente, o primeiro ou os dois primeiros argumentos de uma função são tão importantes que você logo saberá usar eles de cor. Os dois primeiros argumentos de ggplot() são data e mapping; no restante do livro, não escreveremos esses nomes. Isso economiza digitação e, ao reduzir a quantidade de texto extra, facilita a visualização das diferenças entre os gráficos. Essa é uma preocupação de programação realmente importante, à qual voltaremos em Capítulo 25.\nReescrevendo o gráfico anterior de forma mais concisa, temos:\n\nggplot(pinguins, aes(x = comprimento_nadadeira, y = massa_corporal)) +\n geom_point()\n\nNo futuro, você também aprenderá sobre o pipe (encadeamento), |>, que permitirá que você crie esse gráfico com a seguinte sintaxe:\n\npinguins |> \n ggplot(aes(x = comprimento_nadadeira, y = massa_corporal)) + \n geom_point()" }, { "objectID": "data-visualize.html#visualizando-distribuições", "href": "data-visualize.html#visualizando-distribuições", "title": "1  Visualização de dados", "section": "\n1.4 Visualizando distribuições", - "text": "1.4 Visualizando distribuições\nA forma como você visualiza a distribuição de uma variável depende do tipo de variável: categórica ou numérica.\n\n1.4.1 Uma variável categórica\nUma variável é categórica se puder assumir apenas um valor de um pequeno conjunto de valores. Para examinar a distribuição de uma variável categórica, você pode usar um gráfico de barras. A altura das barras exibe quantas observações ocorreram com cada valor x.\n\nggplot(pinguins, aes(x = especie)) +\n geom_bar()\n\n\n\n\nEm gráficos de barras de variáveis categóricas com níveis não ordenados, como a especie de pinguim acima, geralmente é preferível reordenar as barras com base em suas frequências. Para isso, é necessário transformar a variável em um fator (como o R lida com dados categóricos) e, em seguida, reordenar os níveis desse fator.\n\nggplot(pinguins, aes(x = fct_infreq(especie))) +\n geom_bar()\n\n\n\n\nVocê aprenderá mais sobre fatores e funções para lidar com fatores (como fct_infreq() mostrado acima) em ?sec-factors.\n\n1.4.2 Uma variável numérica\nUma variável é numérica (ou quantitativa) se puder assumir uma ampla gama de valores numéricos e se for possível adicionar, subtrair ou calcular médias com esses valores. As variáveis numéricas podem ser contínuas ou discretas.\nUma visualização comumente usada para distribuições de variáveis contínuas é um histograma.\n\nggplot(pinguins, aes(x = massa_corporal)) +\n geom_histogram(binwidth = 200)\n\n\n\n\nUm histograma divide o eixo x em intervalos igualmente espaçados e, em seguida, usa a altura de uma barra para exibir o número de observações que se enquadram em cada intervalo. No gráfico acima, a barra mais alta mostra que 39 observações têm um valor massa_corporal entre 3500 e 3700 gramas, que são as bordas esquerda e direita da barra.\nVocê pode definir a largura dos intervalos em um histograma com o argumento binwidth (largura do intervalo), que é medido nas unidades da variável x. Você deve sempre explorar uma variedade de larguras de intervalos ao trabalhar com histogramas, pois diferentes larguras de intervalos podem revelar padrões diferentes. Nos gráficos abaixo, uma largura de intervalo de 20 é muito estreita, resultando em muitas barras, o que dificulta a determinação da forma da distribuição. Da mesma forma, uma largura de intervalo de 2000 é muito alta, resultando em todos os dados sendo agrupados em apenas três barras, o que também dificulta a determinação da forma da distribuição. Uma largura de intervalo de 200 proporciona um balanço mais adequado.\n\nggplot(pinguins, aes(x = massa_corporal)) +\n geom_histogram(binwidth = 20)\nggplot(pinguins, aes(x = massa_corporal)) +\n geom_histogram(binwidth = 2000)\n\n\n\n\n\n\n\n\n\n\n\nUma visualização alternativa para distribuições de variáveis numéricas é um gráfico de densidade. Um gráfico de densidade é uma versão suavizada de um histograma e uma alternativa prática, especialmente para dados contínuos provenientes de uma distribuição suavizada subjacente. Não entraremos em detalhes sobre como geom_density() estima a densidade (você pode ler mais sobre isso na documentação da função), mas vamos explicar como a curva de densidade é desenhada com uma analogia. Imagine um histograma feito de blocos de madeira. Em seguida, imagine que você jogue um fio de espaguete cozido sobre ele. A forma que o espaguete assumirá sobre os blocos pode ser considerada como a forma da curva de densidade. Ela mostra menos detalhes do que um histograma, mas pode facilitar a obtenção rápida da forma da distribuição, principalmente com relação à moda (valor que ocorre com maior frequência) e à assimetria.\n\nggplot(pinguins, aes(x = massa_corporal)) +\n geom_density()\n#> Warning: Removed 2 rows containing non-finite values (`stat_density()`).\n\n\n\n\n\n1.4.3 Exercícios\n\nFaça um gráfico de barras de especie de pinguins, no qual você atribui especie ao atributo estético y. Como esse gráfico é diferente?\nComo os dois gráficos a seguir são diferentes? Qual atributo estético, color ou fill, é mais útil para alterar a cor das barras?\n\n\nggplot(pinguins, aes(x = especie)) +\n geom_bar(color = \"red\")\n\nggplot(pinguins, aes(x = especie)) +\n geom_bar(fill = \"red\")\n\n\nO que o argumento bins em geom_histogram() faz?\nFaça um histograma da variável quilate no conjunto de dados diamante que está disponível quando você carrega o pacote dados. Faça experiências com diferentes larguras de intervalo (binwidth). Qual largura de intervalo revela os padrões mais interessantes?" + "text": "1.4 Visualizando distribuições\nA forma como você visualiza a distribuição de uma variável depende do tipo de variável: categórica ou numérica.\n\n1.4.1 Uma variável categórica\nUma variável é categórica se puder assumir apenas um valor de um pequeno conjunto de valores. Para examinar a distribuição de uma variável categórica, você pode usar um gráfico de barras. A altura das barras exibe quantas observações ocorreram com cada valor x.\n\nggplot(pinguins, aes(x = especie)) +\n geom_bar()\n\n\n\n\nEm gráficos de barras de variáveis categóricas com níveis não ordenados, como a especie de pinguim acima, geralmente é preferível reordenar as barras com base em suas frequências. Para isso, é necessário transformar a variável em um fator (como o R lida com dados categóricos) e, em seguida, reordenar os níveis desse fator.\n\nggplot(pinguins, aes(x = fct_infreq(especie))) +\n geom_bar()\n\n\n\n\nVocê aprenderá mais sobre fatores e funções para lidar com fatores (como fct_infreq() mostrado acima) em Capítulo 16.\n\n1.4.2 Uma variável numérica\nUma variável é numérica (ou quantitativa) se puder assumir uma ampla gama de valores numéricos e se for possível adicionar, subtrair ou calcular médias com esses valores. As variáveis numéricas podem ser contínuas ou discretas.\nUma visualização comumente usada para distribuições de variáveis contínuas é um histograma.\n\nggplot(pinguins, aes(x = massa_corporal)) +\n geom_histogram(binwidth = 200)\n\n\n\n\nUm histograma divide o eixo x em intervalos igualmente espaçados e, em seguida, usa a altura de uma barra para exibir o número de observações que se enquadram em cada intervalo. No gráfico acima, a barra mais alta mostra que 39 observações têm um valor massa_corporal entre 3500 e 3700 gramas, que são as bordas esquerda e direita da barra.\nVocê pode definir a largura dos intervalos em um histograma com o argumento binwidth (largura do intervalo), que é medido nas unidades da variável x. Você deve sempre explorar uma variedade de larguras de intervalos ao trabalhar com histogramas, pois diferentes larguras de intervalos podem revelar padrões diferentes. Nos gráficos abaixo, uma largura de intervalo de 20 é muito estreita, resultando em muitas barras, o que dificulta a determinação da forma da distribuição. Da mesma forma, uma largura de intervalo de 2000 é muito alta, resultando em todos os dados sendo agrupados em apenas três barras, o que também dificulta a determinação da forma da distribuição. Uma largura de intervalo de 200 proporciona um balanço mais adequado.\n\nggplot(pinguins, aes(x = massa_corporal)) +\n geom_histogram(binwidth = 20)\nggplot(pinguins, aes(x = massa_corporal)) +\n geom_histogram(binwidth = 2000)\n\n\n\n\n\n\n\n\n\n\n\nUma visualização alternativa para distribuições de variáveis numéricas é um gráfico de densidade. Um gráfico de densidade é uma versão suavizada de um histograma e uma alternativa prática, especialmente para dados contínuos provenientes de uma distribuição suavizada subjacente. Não entraremos em detalhes sobre como geom_density() estima a densidade (você pode ler mais sobre isso na documentação da função), mas vamos explicar como a curva de densidade é desenhada com uma analogia. Imagine um histograma feito de blocos de madeira. Em seguida, imagine que você jogue um fio de espaguete cozido sobre ele. A forma que o espaguete assumirá sobre os blocos pode ser considerada como a forma da curva de densidade. Ela mostra menos detalhes do que um histograma, mas pode facilitar a obtenção rápida da forma da distribuição, principalmente com relação à moda (valor que ocorre com maior frequência) e à assimetria.\n\nggplot(pinguins, aes(x = massa_corporal)) +\n geom_density()\n#> Warning: Removed 2 rows containing non-finite values (`stat_density()`).\n\n\n\n\n\n1.4.3 Exercícios\n\nFaça um gráfico de barras de especie de pinguins, no qual você atribui especie ao atributo estético y. Como esse gráfico é diferente?\nComo os dois gráficos a seguir são diferentes? Qual atributo estético, color ou fill, é mais útil para alterar a cor das barras?\n\n\nggplot(pinguins, aes(x = especie)) +\n geom_bar(color = \"red\")\n\nggplot(pinguins, aes(x = especie)) +\n geom_bar(fill = \"red\")\n\n\nO que o argumento bins em geom_histogram() faz?\nFaça um histograma da variável quilate no conjunto de dados diamante que está disponível quando você carrega o pacote dados. Faça experiências com diferentes larguras de intervalo (binwidth). Qual largura de intervalo revela os padrões mais interessantes?" }, { "objectID": "data-visualize.html#visualizando-relações", "href": "data-visualize.html#visualizando-relações", "title": "1  Visualização de dados", "section": "\n1.5 Visualizando relações", - "text": "1.5 Visualizando relações\nPara visualizar uma relação, precisamos ter pelo menos duas variáveis mapeadas para os atributos estéticos de um gráfico. Nas seções a seguir, você aprenderá sobre os gráficos comumente usados para visualizar relações entre duas ou mais variáveis e as geometrias usados para criá-los.\n\n1.5.1 Uma variável numérica e uma variável categórica\nPara visualizar a relação entre uma variável numérica e uma variável categórica, podemos usar diagramas de caixa (chamados boxplots) lado a lado. Um boxplot é um tipo de abreviação visual para medidas de posição (percentis) que descrevem uma distribuição. Também é útil para identificar possíveis outliers. Conforme mostrado em Figura 1.1, cada boxplot consiste em:\n\nUma caixa que indica o intervalo da metade intermediária dos dados, uma distância conhecida como intervalo interquartil (IIQ), que se estende do 25º percentil da distribuição até o 75º percentil. No meio da caixa há uma linha que exibe a mediana, ou seja, o 50º percentil, da distribuição. Essas três linhas lhe dão uma noção da dispersão da distribuição e se a distribuição é ou não simétrica em relação à mediana ou inclinada para um lado.\nPontos que apresentam observações com valores maiores que 1,5 vezes o IIQ de qualquer borda da caixa. Esses pontos discrepantes são incomuns e, por isso, são plotados individualmente.\nUma linha que se estende de cada extremidade da caixa e vai até o ponto mais distante (sem considerar os valores discrepantes - outliers) na distribuição.\n\n\n\n\n\nFigura 1.1: Diagrama mostrando como um boxplot é criado.\n\n\n\nVamos dar uma olhada na distribuição da massa corporal por espécie usando geom_boxplot():\n\nggplot(pinguins, aes(x = especie, y = massa_corporal)) +\n geom_boxplot()\n\n\n\n\nComo alternativa, podemos criar gráficos de densidade com geom_density().\n\nggplot(pinguins, aes(x = massa_corporal, color = especie)) +\n geom_density(linewidth = 0.75)\n\n\n\n\nTambém personalizamos a espessura das linhas usando o argumento linewidth para que elas se destaquem um pouco mais contra o plano de fundo.\nAlém disso, podemos mapear especie para os atributos estéticos color e fill e usar o atributo alpha para adicionar transparência às curvas de densidade preenchidas. Esse atributo assume valores entre 0 (completamente transparente) e 1 (completamente opaco). No gráfico a seguir, ela está definida como 0.5.\n\nggplot(pinguins, aes(x = massa_corporal, color = especie, fill = especie)) +\n geom_density(alpha = 0.5)\n\n\n\n\nObserve a terminologia que usamos aqui:\n\nNós mapeamos variáveis para atributos estéticos se quisermos que o atributo visual representado por esse atributo varie de acordo com os valores dessa variável.\nCaso contrário, nós definimos o valor de um atributo estético.\n\n1.5.2 Duas variáveis categóricas\nPodemos usar gráficos de barras empilhadas para visualizar a relação entre duas variáveis categóricas. Por exemplo, os dois gráficos de barras empilhadas a seguir exibem a relação entre ilha e espécie ou, especificamente, a visualização da distribuição de espécie em cada ilha.\nO primeiro gráfico mostra as frequências de cada espécie de pinguim em cada ilha. O gráfico de frequências mostra que há um número igual de Pinguim-de-adélia em cada ilha. Mas não temos uma boa noção do equilíbrio percentual em cada ilha.\n\nggplot(pinguins, aes(x = ilha, fill = especie)) +\n geom_bar()\n\n\n\n\nO segundo gráfico é um gráfico de frequência relativa, criado pela definição de position = \"fill\" na geometria, que é mais útil para comparar as distribuições de espécies entre as ilhas, pois não é afetado pelo número desigual de pinguins entre as ilhas. Usando esse gráfico, podemos ver que todos os Pinguim-gentoo vivem na ilha Biscoe e constituem aproximadamente 75% dos pinguins dessa ilha, todos os Pinguim-de-barbicha vivem na ilha Dream e constituem aproximadamente 50% dos pinguins dessa ilha, e os Pinguim-de-adélia vivem nas três ilhas e constituem todos os pinguins da ilha Torgersen.\n\nggplot(pinguins, aes(x = ilha, fill = especie)) +\n geom_bar(position = \"fill\")\n\n\n\n\nAo criar esses gráficos de barras, mapeamos a variável que será separada em barras para o atributo estético x e a variável que mudará as cores dentro das barras para a estética fill.\n\n1.5.3 Duas variáveis numéricas\nAté agora, você aprendeu sobre gráficos de dispersão (criados com geom_point()) e curvas suaves (criadas com geom_smooth()) para visualizar a relação entre duas variáveis numéricas. Um gráfico de dispersão é provavelmente o gráfico mais usado para visualizar a relação entre duas variáveis numéricas.\n\nggplot(pinguins, aes(x = comprimento_nadadeira, y = massa_corporal)) +\n geom_point()\n\n\n\n\n\n1.5.4 Três ou mais variáveis\nComo vimos em Seção 1.2.4, podemos incorporar mais variáveis em um gráfico mapeando-as para atributos estéticos adicionais. Por exemplo, no gráfico de dispersão a seguir, as cores dos pontos (color) representam espécies e as formas dos pontos (shape) representam ilhas.\n\nggplot(pinguins, aes(x = comprimento_nadadeira, y = massa_corporal)) +\n geom_point(aes(color = especie, shape = ilha))\n\n\n\n\nNo entanto, mapear muitos atributos estéticos a um gráfico faz com que ele fique desordenado e difícil de entender. Outra maneira, que é particularmente útil para variáveis categóricas, é dividir seu gráfico em facetas (facets), subdivisões ou janelas que exibem um subconjunto dos dados cada uma.\nPara separar seu gráfico em facetas por uma única variável, use facet_wrap(). O primeiro argumento de facet_wrap() é uma fórmula3, que você cria com ~ seguido do nome de uma variável. A variável que você passa para facet_wrap() deve ser categórica.\n\nggplot(pinguins, aes(x = comprimento_nadadeira, y = massa_corporal)) +\n geom_point(aes(color = especie, shape = especie)) +\n facet_wrap(~ilha)\n\n\n\n\nVocê vai aprender sobre muitas outras geometrias para visualizar distribuições de variáveis e relações entre elas em ?sec-layers.\n\n1.5.5 Exercícios\n\nO data frame milhas que acompanha o pacote dados contém observações 234 coletadas pela Agência de Proteção Ambiental dos EUA em modelos de 38 carros. Quais variáveis em milhas são categóricas? Quais variáveis são numéricas? (Dica: digite ?milhas para ler a documentação do conjunto de dados.) Como você pode ver essas informações ao executar milhas?\nFaça um gráfico de dispersão de rodovia (Milhas rodoviárias por galão) em função de cilindrada usando o data frame milhas. Em seguida, mapeie uma terceira variável numérica para color (cor), depois size (tamanho), depois igualmente para color e size e, por fim, shape (forma). Como esses atributos estéticos se comportam de forma diferente para variáveis categóricas e numéricas?\nNo gráfico de dispersão de rodovia vs. cilindrada, o que acontece se você mapear uma terceira variável para linewidth (espessura da linha)?\nO que acontece se você mapear a mesma variável para várias atributos estéticos?\nFaça um gráfico de dispersão de profundidade_bico vs. comprimento_bico e pinte os pontos por especie. O que a adição da coloração por especie revela sobre a relação entre essas duas variáveis? E quanto à separação em facetas por especie?\nPor que o seguinte código produz duas legendas separadas? Como você corrigiria isso para combinar as duas legendas?\n\n\nggplot(\n data = pinguins,\n mapping = aes(\n x = comprimento_bico, y = profundidade_bico, \n color = especie, shape = especie\n )\n) +\n geom_point() +\n labs(color = \"especie\")\n\n\nCrie os dois gráficos de barras empilhadas a seguir. Que pergunta você pode responder com o primeiro? Que pergunta você pode responder com o segundo?\n\n\nggplot(pinguins, aes(x = ilha, fill = especie)) +\n geom_bar(position = \"fill\")\nggplot(pinguins, aes(x = especie, fill = ilha)) +\n geom_bar(position = \"fill\")" + "text": "1.5 Visualizando relações\nPara visualizar uma relação, precisamos ter pelo menos duas variáveis mapeadas para os atributos estéticos de um gráfico. Nas seções a seguir, você aprenderá sobre os gráficos comumente usados para visualizar relações entre duas ou mais variáveis e as geometrias usados para criá-los.\n\n1.5.1 Uma variável numérica e uma variável categórica\nPara visualizar a relação entre uma variável numérica e uma variável categórica, podemos usar diagramas de caixa (chamados boxplots) lado a lado. Um boxplot é um tipo de abreviação visual para medidas de posição (percentis) que descrevem uma distribuição. Também é útil para identificar possíveis outliers. Conforme mostrado em Figura 1.1, cada boxplot consiste em:\n\nUma caixa que indica o intervalo da metade intermediária dos dados, uma distância conhecida como intervalo interquartil (IIQ), que se estende do 25º percentil da distribuição até o 75º percentil. No meio da caixa há uma linha que exibe a mediana, ou seja, o 50º percentil, da distribuição. Essas três linhas lhe dão uma noção da dispersão da distribuição e se a distribuição é ou não simétrica em relação à mediana ou inclinada para um lado.\nPontos que apresentam observações com valores maiores que 1,5 vezes o IIQ de qualquer borda da caixa. Esses pontos discrepantes são incomuns e, por isso, são plotados individualmente.\nUma linha que se estende de cada extremidade da caixa e vai até o ponto mais distante (sem considerar os valores discrepantes - outliers) na distribuição.\n\n\n\n\n\nFigura 1.1: Diagrama mostrando como um boxplot é criado.\n\n\n\nVamos dar uma olhada na distribuição da massa corporal por espécie usando geom_boxplot():\n\nggplot(pinguins, aes(x = especie, y = massa_corporal)) +\n geom_boxplot()\n\n\n\n\nComo alternativa, podemos criar gráficos de densidade com geom_density().\n\nggplot(pinguins, aes(x = massa_corporal, color = especie)) +\n geom_density(linewidth = 0.75)\n\n\n\n\nTambém personalizamos a espessura das linhas usando o argumento linewidth para que elas se destaquem um pouco mais contra o plano de fundo.\nAlém disso, podemos mapear especie para os atributos estéticos color e fill e usar o atributo alpha para adicionar transparência às curvas de densidade preenchidas. Esse atributo assume valores entre 0 (completamente transparente) e 1 (completamente opaco). No gráfico a seguir, ela está definida como 0.5.\n\nggplot(pinguins, aes(x = massa_corporal, color = especie, fill = especie)) +\n geom_density(alpha = 0.5)\n\n\n\n\nObserve a terminologia que usamos aqui:\n\nNós mapeamos variáveis para atributos estéticos se quisermos que o atributo visual representado por esse atributo varie de acordo com os valores dessa variável.\nCaso contrário, nós definimos o valor de um atributo estético.\n\n1.5.2 Duas variáveis categóricas\nPodemos usar gráficos de barras empilhadas para visualizar a relação entre duas variáveis categóricas. Por exemplo, os dois gráficos de barras empilhadas a seguir exibem a relação entre ilha e espécie ou, especificamente, a visualização da distribuição de espécie em cada ilha.\nO primeiro gráfico mostra as frequências de cada espécie de pinguim em cada ilha. O gráfico de frequências mostra que há um número igual de Pinguim-de-adélia em cada ilha. Mas não temos uma boa noção do equilíbrio percentual em cada ilha.\n\nggplot(pinguins, aes(x = ilha, fill = especie)) +\n geom_bar()\n\n\n\n\nO segundo gráfico é um gráfico de frequência relativa, criado pela definição de position = \"fill\" na geometria, que é mais útil para comparar as distribuições de espécies entre as ilhas, pois não é afetado pelo número desigual de pinguins entre as ilhas. Usando esse gráfico, podemos ver que todos os Pinguim-gentoo vivem na ilha Biscoe e constituem aproximadamente 75% dos pinguins dessa ilha, todos os Pinguim-de-barbicha vivem na ilha Dream e constituem aproximadamente 50% dos pinguins dessa ilha, e os Pinguim-de-adélia vivem nas três ilhas e constituem todos os pinguins da ilha Torgersen.\n\nggplot(pinguins, aes(x = ilha, fill = especie)) +\n geom_bar(position = \"fill\")\n\n\n\n\nAo criar esses gráficos de barras, mapeamos a variável que será separada em barras para o atributo estético x e a variável que mudará as cores dentro das barras para a estética fill.\n\n1.5.3 Duas variáveis numéricas\nAté agora, você aprendeu sobre gráficos de dispersão (criados com geom_point()) e curvas suaves (criadas com geom_smooth()) para visualizar a relação entre duas variáveis numéricas. Um gráfico de dispersão é provavelmente o gráfico mais usado para visualizar a relação entre duas variáveis numéricas.\n\nggplot(pinguins, aes(x = comprimento_nadadeira, y = massa_corporal)) +\n geom_point()\n\n\n\n\n\n1.5.4 Três ou mais variáveis\nComo vimos em Seção 1.2.4, podemos incorporar mais variáveis em um gráfico mapeando-as para atributos estéticos adicionais. Por exemplo, no gráfico de dispersão a seguir, as cores dos pontos (color) representam espécies e as formas dos pontos (shape) representam ilhas.\n\nggplot(pinguins, aes(x = comprimento_nadadeira, y = massa_corporal)) +\n geom_point(aes(color = especie, shape = ilha))\n\n\n\n\nNo entanto, mapear muitos atributos estéticos a um gráfico faz com que ele fique desordenado e difícil de entender. Outra maneira, que é particularmente útil para variáveis categóricas, é dividir seu gráfico em facetas (facets), subdivisões ou janelas que exibem um subconjunto dos dados cada uma.\nPara separar seu gráfico em facetas por uma única variável, use facet_wrap(). O primeiro argumento de facet_wrap() é uma fórmula3, que você cria com ~ seguido do nome de uma variável. A variável que você passa para facet_wrap() deve ser categórica.\n\nggplot(pinguins, aes(x = comprimento_nadadeira, y = massa_corporal)) +\n geom_point(aes(color = especie, shape = especie)) +\n facet_wrap(~ilha)\n\n\n\n\nVocê vai aprender sobre muitas outras geometrias para visualizar distribuições de variáveis e relações entre elas em Capítulo 9.\n\n1.5.5 Exercícios\n\nO data frame milhas que acompanha o pacote dados contém observações 234 coletadas pela Agência de Proteção Ambiental dos EUA em modelos de 38 carros. Quais variáveis em milhas são categóricas? Quais variáveis são numéricas? (Dica: digite ?milhas para ler a documentação do conjunto de dados.) Como você pode ver essas informações ao executar milhas?\nFaça um gráfico de dispersão de rodovia (Milhas rodoviárias por galão) em função de cilindrada usando o data frame milhas. Em seguida, mapeie uma terceira variável numérica para color (cor), depois size (tamanho), depois igualmente para color e size e, por fim, shape (forma). Como esses atributos estéticos se comportam de forma diferente para variáveis categóricas e numéricas?\nNo gráfico de dispersão de rodovia vs. cilindrada, o que acontece se você mapear uma terceira variável para linewidth (espessura da linha)?\nO que acontece se você mapear a mesma variável para várias atributos estéticos?\nFaça um gráfico de dispersão de profundidade_bico vs. comprimento_bico e pinte os pontos por especie. O que a adição da coloração por especie revela sobre a relação entre essas duas variáveis? E quanto à separação em facetas por especie?\nPor que o seguinte código produz duas legendas separadas? Como você corrigiria isso para combinar as duas legendas?\n\n\nggplot(\n data = pinguins,\n mapping = aes(\n x = comprimento_bico, y = profundidade_bico, \n color = especie, shape = especie\n )\n) +\n geom_point() +\n labs(color = \"especie\")\n\n\nCrie os dois gráficos de barras empilhadas a seguir. Que pergunta você pode responder com o primeiro? Que pergunta você pode responder com o segundo?\n\n\nggplot(pinguins, aes(x = ilha, fill = especie)) +\n geom_bar(position = \"fill\")\nggplot(pinguins, aes(x = especie, fill = ilha)) +\n geom_bar(position = \"fill\")" }, { "objectID": "data-visualize.html#sec-ggsave", "href": "data-visualize.html#sec-ggsave", "title": "1  Visualização de dados", "section": "\n1.6 Salvando seus gráficos", - "text": "1.6 Salvando seus gráficos\nDepois de criar um gráfico, talvez você queira tirá-lo do R salvando-o como uma imagem que possa ser usada em outro lugar. Esse é o objetivo da função ggsave(), que salvará no computador o gráfico criado mais recentemente:\n\nggplot(pinguins, aes(x = comprimento_nadadeira, y = massa_corporal)) +\n geom_point()\nggsave(filename = \"penguin-plot.png\")\n\nIsso salvará o gráfico no seu diretório de trabalho, um conceito sobre o qual você aprenderá mais em ?sec-workflow-scripts-projects.\nSe você não especificar a largura width e a altura height, elas serão tiradas das dimensões do dispositivo de plotagem atual. Para obter um código reprodutível, você deverá especificá-los. Você pode obter mais informações sobre a função ggsave() na documentação.\nDe modo geral, entretanto, recomendamos que você monte seus relatórios finais usando o Quarto, um sistema de escrita reprodutível que permite intercalar seu código e sua escrita e incluir automaticamente seus gráficos em seus relatórios. Você aprenderá mais sobre o Quarto em ?sec-quarto.\n\n1.6.1 Exercícios\n\nExecute as seguintes linhas de código. Qual dos dois gráficos é salvo como grafico-milhas.png? Por quê?\n\n\nggplot(milhas, aes(x = classe)) +\n geom_bar()\nggplot(milhas, aes(x = cidade, y = rodovia)) +\n geom_point()\nggsave(\"grafico-milhas.png\")\n\n\nO que você precisa alterar no código acima para salvar o gráfico como PDF em vez de PNG? Como você poderia descobrir quais tipos de arquivos de imagem funcionariam em ggsave()?" + "text": "1.6 Salvando seus gráficos\nDepois de criar um gráfico, talvez você queira tirá-lo do R salvando-o como uma imagem que possa ser usada em outro lugar. Esse é o objetivo da função ggsave(), que salvará no computador o gráfico criado mais recentemente:\n\nggplot(pinguins, aes(x = comprimento_nadadeira, y = massa_corporal)) +\n geom_point()\nggsave(filename = \"penguin-plot.png\")\n\nIsso salvará o gráfico no seu diretório de trabalho, um conceito sobre o qual você aprenderá mais em Capítulo 6.\nSe você não especificar a largura width e a altura height, elas serão tiradas das dimensões do dispositivo de plotagem atual. Para obter um código reprodutível, você deverá especificá-los. Você pode obter mais informações sobre a função ggsave() na documentação.\nDe modo geral, entretanto, recomendamos que você monte seus relatórios finais usando o Quarto, um sistema de escrita reprodutível que permite intercalar seu código e sua escrita e incluir automaticamente seus gráficos em seus relatórios. Você aprenderá mais sobre o Quarto em Capítulo 28.\n\n1.6.1 Exercícios\n\nExecute as seguintes linhas de código. Qual dos dois gráficos é salvo como grafico-milhas.png? Por quê?\n\n\nggplot(milhas, aes(x = classe)) +\n geom_bar()\nggplot(milhas, aes(x = cidade, y = rodovia)) +\n geom_point()\nggsave(\"grafico-milhas.png\")\n\n\nO que você precisa alterar no código acima para salvar o gráfico como PDF em vez de PNG? Como você poderia descobrir quais tipos de arquivos de imagem funcionariam em ggsave()?" }, { "objectID": "data-visualize.html#problemas-comuns", @@ -137,7 +137,7 @@ "href": "data-visualize.html#resumo", "title": "1  Visualização de dados", "section": "\n1.8 Resumo", - "text": "1.8 Resumo\nNeste capítulo, você aprendeu os fundamentos da visualização de dados com o ggplot2. Começamos com a ideia básica que sustenta o ggplot2: uma visualização é um mapeamento de variáveis em seus dados para atributos estéticos como posição (position), cor (color), tamanho (size) e forma (shape). Em seguida, você aprendeu a aumentar a complexidade e melhorar a apresentação de seus gráficos camada por camada. Você também aprendeu sobre gráficos comumente usados para visualizar a distribuição de uma única variável, bem como para visualizar relações entre duas ou mais variáveis ao utilizar mapeamentos de atributos estéticos adicionais e/ou dividindo seu gráfico em pequenos gráficos usando facetas.\nUsaremos as visualizações repetidamente ao longo deste livro, introduzindo novas técnicas à medida que precisarmos delas, além de nos aprofundarmos na criação de visualizações com o ggplot2 em ?sec-layers por meio da ?sec-communication.\nCom as noções básicas de visualização em seu currículo, no próximo capítulo mudaremos um pouco a direção e daremos algumas orientações práticas sobre o fluxo de trabalho. Intercalamos conselhos sobre fluxo de trabalho com ferramentas de ciência de dados ao longo desta parte do livro, pois isso te ajudará a manter a organização à medida que você escreve quantidades cada vez maiores de código em R." + "text": "1.8 Resumo\nNeste capítulo, você aprendeu os fundamentos da visualização de dados com o ggplot2. Começamos com a ideia básica que sustenta o ggplot2: uma visualização é um mapeamento de variáveis em seus dados para atributos estéticos como posição (position), cor (color), tamanho (size) e forma (shape). Em seguida, você aprendeu a aumentar a complexidade e melhorar a apresentação de seus gráficos camada por camada. Você também aprendeu sobre gráficos comumente usados para visualizar a distribuição de uma única variável, bem como para visualizar relações entre duas ou mais variáveis ao utilizar mapeamentos de atributos estéticos adicionais e/ou dividindo seu gráfico em pequenos gráficos usando facetas.\nUsaremos as visualizações repetidamente ao longo deste livro, introduzindo novas técnicas à medida que precisarmos delas, além de nos aprofundarmos na criação de visualizações com o ggplot2 em Capítulo 9 por meio da Capítulo 11.\nCom as noções básicas de visualização em seu currículo, no próximo capítulo mudaremos um pouco a direção e daremos algumas orientações práticas sobre o fluxo de trabalho. Intercalamos conselhos sobre fluxo de trabalho com ferramentas de ciência de dados ao longo desta parte do livro, pois isso te ajudará a manter a organização à medida que você escreve quantidades cada vez maiores de código em R." }, { "objectID": "data-visualize.html#footnotes", @@ -165,7 +165,7 @@ "href": "workflow-basics.html#sec-whats-in-a-name", "title": "2  Fluxo de Trabalho: básico", "section": "\n2.3 A importância dos nomes", - "text": "2.3 A importância dos nomes\nNomes de objetos devem começar com uma letra e só podem conter letras, números, _ e .. Você quer que os nomes dos seus objetos sejam descritivos, então você precisará adotar uma convenção para várias palavras. Recomendamos snake_case, onde você separa palavras minúsculas com _.\n\neu_uso_snake_case\noutrasPessoasUsamCamelCase\nalgumas.pessoas.usam.pontos\nE_aLgumas.Pessoas_nAoUsamConvencao\n\nVamos voltar aos nomes quando discutirmos o estilo de código no ?sec-workflow-style.\nVocê pode ver o conteúdo de um objeto (chamaremos isso de inspecionar) digitando seu nome:\n\nx\n#> [1] 12\n\nFazendo outra atribuição:\n\nesse_e_um_nome_bem_longo <- 2.5\n\nPara inspecionar esse objeto, experimente o recurso de autocompletar (autocomplete) do RStudio: digite “esse”, pressione TAB, adicione caracteres até ter um prefixo único e pressione enter.\nVamos supor que você cometeu um erro e que o valor de esse_e_um_nome_bem_longo deveria ser 3.5, não 2.5. Você pode usar outro atalho de teclado para te ajudar a corrigi-lo. Por exemplo, você pode pressionar ↑ para recuperar o último comando que você digitou e editá-lo. Ou, digite “esse” e pressione Cmd/Ctrl + ↑ para listar todos os comandos que você digitou que começam com essas letras. Use as setas para navegar e, em seguida, pressione enter para digitar novamente o comando. Altere 2.5 para 3.5 e execute novamente.\nFazendo mais uma atribuição:\n\nr_rocks <- 2^3\n\nVamos tentar inspecioná-lo:\n\nr_rock\n#> Error: object 'r_rock' not found\nR_rocks\n#> Error: object 'R_rocks' not found\n\nIsso ilustra o contrato implícito entre você e o R: o R fará os cálculos chatos para você, mas em troca, você deve ser escrever suas instruções de forma precisa. Se não, você provavelmente receberá um erro que diz que o objeto que você está procurando não foi encontrado. Erros de digitação importam; o R não pode ler sua mente e dizer: “ah, você provavelmente quis dizer r_rocks quando digitou r_rock”. A caixa alta (letras maiúsculas) importa; da mesma forma, o R não pode ler sua mente e dizer: “ah, você provavelmente quis dizer r_rocks quando digitou R_rocks”." + "text": "2.3 A importância dos nomes\nNomes de objetos devem começar com uma letra e só podem conter letras, números, _ e .. Você quer que os nomes dos seus objetos sejam descritivos, então você precisará adotar uma convenção para várias palavras. Recomendamos snake_case, onde você separa palavras minúsculas com _.\n\neu_uso_snake_case\noutrasPessoasUsamCamelCase\nalgumas.pessoas.usam.pontos\nE_aLgumas.Pessoas_nAoUsamConvencao\n\nVamos voltar aos nomes quando discutirmos o estilo de código no Capítulo 4.\nVocê pode ver o conteúdo de um objeto (chamaremos isso de inspecionar) digitando seu nome:\n\nx\n#> [1] 12\n\nFazendo outra atribuição:\n\nesse_e_um_nome_bem_longo <- 2.5\n\nPara inspecionar esse objeto, experimente o recurso de autocompletar (autocomplete) do RStudio: digite “esse”, pressione TAB, adicione caracteres até ter um prefixo único e pressione enter.\nVamos supor que você cometeu um erro e que o valor de esse_e_um_nome_bem_longo deveria ser 3.5, não 2.5. Você pode usar outro atalho de teclado para te ajudar a corrigi-lo. Por exemplo, você pode pressionar ↑ para recuperar o último comando que você digitou e editá-lo. Ou, digite “esse” e pressione Cmd/Ctrl + ↑ para listar todos os comandos que você digitou que começam com essas letras. Use as setas para navegar e, em seguida, pressione enter para digitar novamente o comando. Altere 2.5 para 3.5 e execute novamente.\nFazendo mais uma atribuição:\n\nr_rocks <- 2^3\n\nVamos tentar inspecioná-lo:\n\nr_rock\n#> Error: object 'r_rock' not found\nR_rocks\n#> Error: object 'R_rocks' not found\n\nIsso ilustra o contrato implícito entre você e o R: o R fará os cálculos chatos para você, mas em troca, você deve ser escrever suas instruções de forma precisa. Se não, você provavelmente receberá um erro que diz que o objeto que você está procurando não foi encontrado. Erros de digitação importam; o R não pode ler sua mente e dizer: “ah, você provavelmente quis dizer r_rocks quando digitou r_rock”. A caixa alta (letras maiúsculas) importa; da mesma forma, o R não pode ler sua mente e dizer: “ah, você provavelmente quis dizer r_rocks quando digitou R_rocks”." }, { "objectID": "workflow-basics.html#chamando-funções", @@ -188,39 +188,1425 @@ "section": "\n2.6 Sumário", "text": "2.6 Sumário\nNesse capítulo você aprendeu um pouco mais sobre como o código R funciona e algumas dicas para te ajudar a entender seu código quando você voltar a ele no futuro. No próximo capítulo, continuaremos sua jornada de ciência de dados, ensinando-o sobre o dplyr, o pacote tidyverse que ajuda você a transformar dados, seja selecionando variáveis importantes, filtrando as linhas de interesse ou calculando estatísticas resumidas." }, + { + "objectID": "data-transform.html#introduction", + "href": "data-transform.html#introduction", + "title": "3  Data transformation", + "section": "\n3.1 Introduction", + "text": "3.1 Introduction\nVisualization is an important tool for generating insight, but it’s rare that you get the data in exactly the right form you need to make the graph you want. Often you’ll need to create some new variables or summaries to answer your questions with your data, or maybe you just want to rename the variables or reorder the observations to make the data a little easier to work with. You’ll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the dplyr package and a new dataset on flights that departed from New York City in 2013.\nThe goal of this chapter is to give you an overview of all the key tools for transforming a data frame. We’ll start with functions that operate on rows and then columns of a data frame, then circle back to talk more about the pipe, an important tool that you use to combine verbs. We will then introduce the ability to work with groups. We will end the chapter with a case study that showcases these functions in action and we’ll come back to the functions in more detail in later chapters, as we start to dig into specific types of data (e.g., numbers, strings, dates).\n\n3.1.1 Prerequisites\nIn this chapter we’ll focus on the dplyr package, another core member of the tidyverse. We’ll illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data.\n\nlibrary(nycflights13)\nlibrary(tidyverse)\n#> ── Attaching core tidyverse packages ───────────────────── tidyverse 2.0.0 ──\n#> ✔ dplyr 1.1.3 ✔ readr 2.1.4\n#> ✔ forcats 1.0.0 ✔ stringr 1.5.1\n#> ✔ ggplot2 3.4.4 ✔ tibble 3.2.1\n#> ✔ lubridate 1.9.3 ✔ tidyr 1.3.0\n#> ✔ purrr 1.0.2 \n#> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──\n#> ✖ dplyr::filter() masks stats::filter()\n#> ✖ dplyr::lag() masks stats::lag()\n#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors\n\nTake careful note of the conflicts message that’s printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag(). So far we’ve mostly ignored which package a function comes from because most of the time it doesn’t matter. However, knowing the package can help you find help and find related functions, so when we need to be precise about which package a function comes from, we’ll use the same syntax as R: packagename::functionname().\n\n3.1.2 nycflights13\nTo explore the basic dplyr verbs, we’re going to use nycflights13::flights. This dataset contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in ?flights.\n\nflights\n#> # A tibble: 336,776 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 336,770 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nflights is a tibble, a special type of data frame used by the tidyverse to avoid some common gotchas. The most important difference between tibbles and data frames is the way tibbles print; they are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen. There are a few options to see everything. If you’re using RStudio, the most convenient is probably View(flights), which will open an interactive scrollable and filterable view. Otherwise you can use print(flights, width = Inf) to show all columns, or use glimpse():\n\nglimpse(flights)\n#> Rows: 336,776\n#> Columns: 19\n#> $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013…\n#> $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…\n#> $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…\n#> $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 55…\n#> $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 60…\n#> $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2,…\n#> $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 8…\n#> $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 8…\n#> $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7,…\n#> $ carrier <chr> \"UA\", \"UA\", \"AA\", \"B6\", \"DL\", \"UA\", \"B6\", \"EV\", \"B6\"…\n#> $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301…\n#> $ tailnum <chr> \"N14228\", \"N24211\", \"N619AA\", \"N804JB\", \"N668DN\", \"N…\n#> $ origin <chr> \"EWR\", \"LGA\", \"JFK\", \"JFK\", \"LGA\", \"EWR\", \"EWR\", \"LG…\n#> $ dest <chr> \"IAH\", \"IAH\", \"MIA\", \"BQN\", \"ATL\", \"ORD\", \"FLL\", \"IA…\n#> $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149…\n#> $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 73…\n#> $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6…\n#> $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59…\n#> $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-0…\n\nIn both views, the variables names are followed by abbreviations that tell you the type of each variable: <int> is short for integer, <dbl> is short for double (aka real numbers), <chr> for character (aka strings), and <dttm> for date-time. These are important because the operations you can perform on a column depend so much on its “type”.\n\n3.1.3 dplyr basics\nYou’re about to learn the primary dplyr verbs (functions) which will allow you to solve the vast majority of your data manipulation challenges. But before we discuss their individual differences, it’s worth stating what they have in common:\n\nThe first argument is always a data frame.\nThe subsequent arguments typically describe which columns to operate on, using the variable names (without quotes).\nThe output is always a new data frame.\n\nBecause each verb does one thing well, solving complex problems will usually require combining multiple verbs, and we’ll do so with the pipe, |>. We’ll discuss the pipe more in Seção 3.4, but in brief, the pipe takes the thing on its left and passes it along to the function on its right so that x |> f(y) is equivalent to f(x, y), and x |> f(y) |> g(z) is equivalent to g(f(x, y), z). The easiest way to pronounce the pipe is “then”. That makes it possible to get a sense of the following code even though you haven’t yet learned the details:\n\nflights |>\n filter(dest == \"IAH\") |> \n group_by(year, month, day) |> \n summarize(\n arr_delay = mean(arr_delay, na.rm = TRUE)\n )\n\ndplyr’s verbs are organized into four groups based on what they operate on: rows, columns, groups, or tables. In the following sections you’ll learn the most important verbs for rows, columns, and groups, then we’ll come back to the join verbs that work on tables in Capítulo 19. Let’s dive in!" + }, + { + "objectID": "data-transform.html#rows", + "href": "data-transform.html#rows", + "title": "3  Data transformation", + "section": "\n3.2 Rows", + "text": "3.2 Rows\nThe most important verbs that operate on rows of a dataset are filter(), which changes which rows are present without changing their order, and arrange(), which changes the order of the rows without changing which are present. Both functions only affect the rows, and the columns are left unchanged. We’ll also discuss distinct() which finds rows with unique values but unlike arrange() and filter() it can also optionally modify the columns.\n\n3.2.1 filter()\n\nfilter() allows you to keep rows based on the values of the columns1. The first argument is the data frame. The second and subsequent arguments are the conditions that must be true to keep the row. For example, we could find all flights that departed more than 120 minutes (two hours) late:\n\nflights |> \n filter(dep_delay > 120)\n#> # A tibble: 9,723 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 848 1835 853 1001 1950\n#> 2 2013 1 1 957 733 144 1056 853\n#> 3 2013 1 1 1114 900 134 1447 1222\n#> 4 2013 1 1 1540 1338 122 2020 1825\n#> 5 2013 1 1 1815 1325 290 2120 1542\n#> 6 2013 1 1 1842 1422 260 1958 1535\n#> # ℹ 9,717 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nAs well as > (greater than), you can use >= (greater than or equal to), < (less than), <= (less than or equal to), == (equal to), and != (not equal to). You can also combine conditions with & or , to indicate “and” (check for both conditions) or with | to indicate “or” (check for either condition):\n\n# Flights that departed on January 1\nflights |> \n filter(month == 1 & day == 1)\n#> # A tibble: 842 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 836 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\n# Flights that departed in January or February\nflights |> \n filter(month == 1 | month == 2)\n#> # A tibble: 51,955 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 51,949 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nThere’s a useful shortcut when you’re combining | and ==: %in%. It keeps rows where the variable equals one of the values on the right:\n\n# A shorter way to select flights that departed in January or February\nflights |> \n filter(month %in% c(1, 2))\n#> # A tibble: 51,955 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 51,949 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nWe’ll come back to these comparisons and logical operators in more detail in Capítulo 12.\nWhen you run filter() dplyr executes the filtering operation, creating a new data frame, and then prints it. It doesn’t modify the existing flights dataset because dplyr functions never modify their inputs. To save the result, you need to use the assignment operator, <-:\n\njan1 <- flights |> \n filter(month == 1 & day == 1)\n\n\n3.2.2 Common mistakes\nWhen you’re starting out with R, the easiest mistake to make is to use = instead of == when testing for equality. filter() will let you know when this happens:\n\nflights |> \n filter(month = 1)\n#> Error in `filter()`:\n#> ! We detected a named input.\n#> ℹ This usually means that you've used `=` instead of `==`.\n#> ℹ Did you mean `month == 1`?\n\nAnother mistakes is you write “or” statements like you would in English:\n\nflights |> \n filter(month == 1 | 2)\n\nThis “works”, in the sense that it doesn’t throw an error, but it doesn’t do what you want because | first checks the condition month == 1 and then checks the condition 2, which is not a sensible condition to check. We’ll learn more about what’s happening here and why in Seção 15.6.2.\n\n3.2.3 arrange()\n\narrange() changes the order of the rows based on the value of the columns. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. For example, the following code sorts by the departure time, which is spread over four columns. We get the earliest years first, then within a year the earliest months, etc.\n\nflights |> \n arrange(year, month, day, dep_time)\n#> # A tibble: 336,776 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 336,770 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nYou can use desc() on a column inside of arrange() to re-order the data frame based on that column in descending (big-to-small) order. For example, this code orders flights from most to least delayed:\n\nflights |> \n arrange(desc(dep_delay))\n#> # A tibble: 336,776 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 9 641 900 1301 1242 1530\n#> 2 2013 6 15 1432 1935 1137 1607 2120\n#> 3 2013 1 10 1121 1635 1126 1239 1810\n#> 4 2013 9 20 1139 1845 1014 1457 2210\n#> 5 2013 7 22 845 1600 1005 1044 1815\n#> 6 2013 4 10 1100 1900 960 1342 2211\n#> # ℹ 336,770 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nNote that the number of rows has not changed – we’re only arranging the data, we’re not filtering it.\n\n3.2.4 distinct()\n\ndistinct() finds all the unique rows in a dataset, so in a technical sense, it primarily operates on the rows. Most of the time, however, you’ll want the distinct combination of some variables, so you can also optionally supply column names:\n\n# Remove duplicate rows, if any\nflights |> \n distinct()\n#> # A tibble: 336,776 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 336,770 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\n# Find all unique origin and destination pairs\nflights |> \n distinct(origin, dest)\n#> # A tibble: 224 × 2\n#> origin dest \n#> <chr> <chr>\n#> 1 EWR IAH \n#> 2 LGA IAH \n#> 3 JFK MIA \n#> 4 JFK BQN \n#> 5 LGA ATL \n#> 6 EWR ORD \n#> # ℹ 218 more rows\n\nAlternatively, if you want to the keep other columns when filtering for unique rows, you can use the .keep_all = TRUE option.\n\nflights |> \n distinct(origin, dest, .keep_all = TRUE)\n#> # A tibble: 224 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 218 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nIt’s not a coincidence that all of these distinct flights are on January 1: distinct() will find the first occurrence of a unique row in the dataset and discard the rest.\nIf you want to find the number of occurrences instead, you’re better off swapping distinct() for count(), and with the sort = TRUE argument you can arrange them in descending order of number of occurrences. You’ll learn more about count in Seção 13.3.\n\nflights |>\n count(origin, dest, sort = TRUE)\n#> # A tibble: 224 × 3\n#> origin dest n\n#> <chr> <chr> <int>\n#> 1 JFK LAX 11262\n#> 2 LGA ATL 10263\n#> 3 LGA ORD 8857\n#> 4 JFK SFO 8204\n#> 5 LGA CLT 6168\n#> 6 EWR ORD 6100\n#> # ℹ 218 more rows\n\n\n3.2.5 Exercises\n\n\nIn a single pipeline for each condition, find all flights that meet the condition:\n\nHad an arrival delay of two or more hours\nFlew to Houston (IAH or HOU)\nWere operated by United, American, or Delta\nDeparted in summer (July, August, and September)\nArrived more than two hours late, but didn’t leave late\nWere delayed by at least an hour, but made up over 30 minutes in flight\n\n\nSort flights to find the flights with longest departure delays. Find the flights that left earliest in the morning.\nSort flights to find the fastest flights. (Hint: Try including a math calculation inside of your function.)\nWas there a flight on every day of 2013?\nWhich flights traveled the farthest distance? Which traveled the least distance?\nDoes it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do." + }, + { + "objectID": "data-transform.html#columns", + "href": "data-transform.html#columns", + "title": "3  Data transformation", + "section": "\n3.3 Columns", + "text": "3.3 Columns\nThere are four important verbs that affect the columns without changing the rows: mutate() creates new columns that are derived from the existing columns, select() changes which columns are present, rename() changes the names of the columns, and relocate() changes the positions of the columns.\n\n3.3.1 mutate()\n\nThe job of mutate() is to add new columns that are calculated from the existing columns. In the transform chapters, you’ll learn a large set of functions that you can use to manipulate different types of variables. For now, we’ll stick with basic algebra, which allows us to compute the gain, how much time a delayed flight made up in the air, and the speed in miles per hour:\n\nflights |> \n mutate(\n gain = dep_delay - arr_delay,\n speed = distance / air_time * 60\n )\n#> # A tibble: 336,776 × 21\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 336,770 more rows\n#> # ℹ 13 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nBy default, mutate() adds new columns on the right hand side of your dataset, which makes it difficult to see what’s happening here. We can use the .before argument to instead add the variables to the left hand side2:\n\nflights |> \n mutate(\n gain = dep_delay - arr_delay,\n speed = distance / air_time * 60,\n .before = 1\n )\n#> # A tibble: 336,776 × 21\n#> gain speed year month day dep_time sched_dep_time dep_delay arr_time\n#> <dbl> <dbl> <int> <int> <int> <int> <int> <dbl> <int>\n#> 1 -9 370. 2013 1 1 517 515 2 830\n#> 2 -16 374. 2013 1 1 533 529 4 850\n#> 3 -31 408. 2013 1 1 542 540 2 923\n#> 4 17 517. 2013 1 1 544 545 -1 1004\n#> 5 19 394. 2013 1 1 554 600 -6 812\n#> 6 -16 288. 2013 1 1 554 558 -4 740\n#> # ℹ 336,770 more rows\n#> # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, …\n\nThe . is a sign that .before is an argument to the function, not the name of a third new variable we are creating. You can also use .after to add after a variable, and in both .before and .after you can use the variable name instead of a position. For example, we could add the new variables after day:\n\nflights |> \n mutate(\n gain = dep_delay - arr_delay,\n speed = distance / air_time * 60,\n .after = day\n )\n\nAlternatively, you can control which variables are kept with the .keep argument. A particularly useful argument is \"used\" which specifies that we only keep the columns that were involved or created in the mutate() step. For example, the following output will contain only the variables dep_delay, arr_delay, air_time, gain, hours, and gain_per_hour.\n\nflights |> \n mutate(\n gain = dep_delay - arr_delay,\n hours = air_time / 60,\n gain_per_hour = gain / hours,\n .keep = \"used\"\n )\n\nNote that since we haven’t assigned the result of the above computation back to flights, the new variables gain, hours, and gain_per_hour will only be printed but will not be stored in a data frame. And if we want them to be available in a data frame for future use, we should think carefully about whether we want the result to be assigned back to flights, overwriting the original data frame with many more variables, or to a new object. Often, the right answer is a new object that is named informatively to indicate its contents, e.g., delay_gain, but you might also have good reasons for overwriting flights.\n\n3.3.2 select()\n\nIt’s not uncommon to get datasets with hundreds or even thousands of variables. In this situation, the first challenge is often just focusing on the variables you’re interested in. select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables:\n\n\nSelect columns by name:\n\nflights |> \n select(year, month, day)\n\n\n\nSelect all columns between year and day (inclusive):\n\nflights |> \n select(year:day)\n\n\n\nSelect all columns except those from year to day (inclusive):\n\nflights |> \n select(!year:day)\n\nHistorically this operation was done with - instead of !, so you’re likely to see that in the wild. These two operators serve the same purpose but with subtle differences in behavior. We recommend using ! because it reads as “not” and combines well with & and |.\n\n\nSelect all columns that are characters:\n\nflights |> \n select(where(is.character))\n\n\n\nThere are a number of helper functions you can use within select():\n\n\nstarts_with(\"abc\"): matches names that begin with “abc”.\n\nends_with(\"xyz\"): matches names that end with “xyz”.\n\ncontains(\"ijk\"): matches names that contain “ijk”.\n\nnum_range(\"x\", 1:3): matches x1, x2 and x3.\n\nSee ?select for more details. Once you know regular expressions (the topic of Capítulo 15) you’ll also be able to use matches() to select variables that match a pattern.\nYou can rename variables as you select() them by using =. The new name appears on the left hand side of the =, and the old variable appears on the right hand side:\n\nflights |> \n select(tail_num = tailnum)\n#> # A tibble: 336,776 × 1\n#> tail_num\n#> <chr> \n#> 1 N14228 \n#> 2 N24211 \n#> 3 N619AA \n#> 4 N804JB \n#> 5 N668DN \n#> 6 N39463 \n#> # ℹ 336,770 more rows\n\n\n3.3.3 rename()\n\nIf you want to keep all the existing variables and just want to rename a few, you can use rename() instead of select():\n\nflights |> \n rename(tail_num = tailnum)\n#> # A tibble: 336,776 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 336,770 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nIf you have a bunch of inconsistently named columns and it would be painful to fix them all by hand, check out janitor::clean_names() which provides some useful automated cleaning.\n\n3.3.4 relocate()\n\nUse relocate() to move variables around. You might want to collect related variables together or move important variables to the front. By default relocate() moves variables to the front:\n\nflights |> \n relocate(time_hour, air_time)\n#> # A tibble: 336,776 × 19\n#> time_hour air_time year month day dep_time sched_dep_time\n#> <dttm> <dbl> <int> <int> <int> <int> <int>\n#> 1 2013-01-01 05:00:00 227 2013 1 1 517 515\n#> 2 2013-01-01 05:00:00 227 2013 1 1 533 529\n#> 3 2013-01-01 05:00:00 160 2013 1 1 542 540\n#> 4 2013-01-01 05:00:00 183 2013 1 1 544 545\n#> 5 2013-01-01 06:00:00 116 2013 1 1 554 600\n#> 6 2013-01-01 05:00:00 150 2013 1 1 554 558\n#> # ℹ 336,770 more rows\n#> # ℹ 12 more variables: dep_delay <dbl>, arr_time <int>, …\n\nYou can also specify where to put them using the .before and .after arguments, just like in mutate():\n\nflights |> \n relocate(year:dep_time, .after = time_hour)\nflights |> \n relocate(starts_with(\"arr\"), .before = dep_time)\n\n\n3.3.5 Exercises\n\nCompare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?\nBrainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.\nWhat happens if you specify the name of the same variable multiple times in a select() call?\n\nWhat does the any_of() function do? Why might it be helpful in conjunction with this vector?\n\nvariables <- c(\"year\", \"month\", \"day\", \"dep_delay\", \"arr_delay\")\n\n\n\nDoes the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?\n\nflights |> select(contains(\"TIME\"))\n\n\nRename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.\n\nWhy doesn’t the following work, and what does the error mean?\n\nflights |> \n select(tailnum) |> \n arrange(arr_delay)\n#> Error in `arrange()`:\n#> ℹ In argument: `..1 = arr_delay`.\n#> Caused by error:\n#> ! object 'arr_delay' not found" + }, + { + "objectID": "data-transform.html#sec-the-pipe", + "href": "data-transform.html#sec-the-pipe", + "title": "3  Data transformation", + "section": "\n3.4 The pipe", + "text": "3.4 The pipe\nWe’ve shown you simple examples of the pipe above, but its real power arises when you start to combine multiple verbs. For example, imagine that you wanted to find the fast flights to Houston’s IAH airport: you need to combine filter(), mutate(), select(), and arrange():\n\nflights |> \n filter(dest == \"IAH\") |> \n mutate(speed = distance / air_time * 60) |> \n select(year:day, dep_time, carrier, flight, speed) |> \n arrange(desc(speed))\n#> # A tibble: 7,198 × 7\n#> year month day dep_time carrier flight speed\n#> <int> <int> <int> <int> <chr> <int> <dbl>\n#> 1 2013 7 9 707 UA 226 522.\n#> 2 2013 8 27 1850 UA 1128 521.\n#> 3 2013 8 28 902 UA 1711 519.\n#> 4 2013 8 28 2122 UA 1022 519.\n#> 5 2013 6 11 1628 UA 1178 515.\n#> 6 2013 8 27 1017 UA 333 515.\n#> # ℹ 7,192 more rows\n\nEven though this pipeline has four steps, it’s easy to skim because the verbs come at the start of each line: start with the flights data, then filter, then mutate, then select, then arrange.\nWhat would happen if we didn’t have the pipe? We could nest each function call inside the previous call:\n\narrange(\n select(\n mutate(\n filter(\n flights, \n dest == \"IAH\"\n ),\n speed = distance / air_time * 60\n ),\n year:day, dep_time, carrier, flight, speed\n ),\n desc(speed)\n)\n\nOr we could use a bunch of intermediate objects:\n\nflights1 <- filter(flights, dest == \"IAH\")\nflights2 <- mutate(flights1, speed = distance / air_time * 60)\nflights3 <- select(flights2, year:day, dep_time, carrier, flight, speed)\narrange(flights3, desc(speed))\n\nWhile both forms have their time and place, the pipe generally produces data analysis code that is easier to write and read.\nTo add the pipe to your code, we recommend using the built-in keyboard shortcut Ctrl/Cmd + Shift + M. You’ll need to make one change to your RStudio options to use |> instead of %>% as shown in Figura 3.1; more on %>% shortly.\n\n\n\n\nFigura 3.1: To insert |>, make sure the “Use native pipe operator” option is checked.\n\n\n\n\n\n\n\n\n\nmagrittr\n\n\n\nIf you’ve been using the tidyverse for a while, you might be familiar with the %>% pipe provided by the magrittr package. The magrittr package is included in the core tidyverse, so you can use %>% whenever you load the tidyverse:\n\nlibrary(tidyverse)\n\nmtcars %>% \n group_by(cyl) %>%\n summarize(n = n())\n\nFor simple cases, |> and %>% behave identically. So why do we recommend the base pipe? Firstly, because it’s part of base R, it’s always available for you to use, even when you’re not using the tidyverse. Secondly, |> is quite a bit simpler than %>%: in the time between the invention of %>% in 2014 and the inclusion of |> in R 4.1.0 in 2021, we gained a better understanding of the pipe. This allowed the base implementation to jettison infrequently used and less important features." + }, + { + "objectID": "data-transform.html#groups", + "href": "data-transform.html#groups", + "title": "3  Data transformation", + "section": "\n3.5 Groups", + "text": "3.5 Groups\nSo far you’ve learned about functions that work with rows and columns. dplyr gets even more powerful when you add in the ability to work with groups. In this section, we’ll focus on the most important functions: group_by(), summarize(), and the slice family of functions.\n\n3.5.1 group_by()\n\nUse group_by() to divide your dataset into groups meaningful for your analysis:\n\nflights |> \n group_by(month)\n#> # A tibble: 336,776 × 19\n#> # Groups: month [12]\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 336,770 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\ngroup_by() doesn’t change the data but, if you look closely at the output, you’ll notice that the output indicates that it is “grouped by” month (Groups: month [12]). This means subsequent operations will now work “by month”. group_by() adds this grouped feature (referred to as class) to the data frame, which changes the behavior of the subsequent verbs applied to the data.\n\n3.5.2 summarize()\n\nThe most important grouped operation is a summary, which, if being used to calculate a single summary statistic, reduces the data frame to have a single row for each group. In dplyr, this operation is performed by summarize()3, as shown by the following example, which computes the average departure delay by month:\n\nflights |> \n group_by(month) |> \n summarize(\n avg_delay = mean(dep_delay)\n )\n#> # A tibble: 12 × 2\n#> month avg_delay\n#> <int> <dbl>\n#> 1 1 NA\n#> 2 2 NA\n#> 3 3 NA\n#> 4 4 NA\n#> 5 5 NA\n#> 6 6 NA\n#> # ℹ 6 more rows\n\nUhoh! Something has gone wrong and all of our results are NAs (pronounced “N-A”), R’s symbol for missing value. This happened because some of the observed flights had missing data in the delay column, and so when we calculated the mean including those values, we got an NA result. We’ll come back to discuss missing values in detail in Capítulo 18, but for now we’ll tell the mean() function to ignore all missing values by setting the argument na.rm to TRUE:\n\nflights |> \n group_by(month) |> \n summarize(\n delay = mean(dep_delay, na.rm = TRUE)\n )\n#> # A tibble: 12 × 2\n#> month delay\n#> <int> <dbl>\n#> 1 1 10.0\n#> 2 2 10.8\n#> 3 3 13.2\n#> 4 4 13.9\n#> 5 5 13.0\n#> 6 6 20.8\n#> # ℹ 6 more rows\n\nYou can create any number of summaries in a single call to summarize(). You’ll learn various useful summaries in the upcoming chapters, but one very useful summary is n(), which returns the number of rows in each group:\n\nflights |> \n group_by(month) |> \n summarize(\n delay = mean(dep_delay, na.rm = TRUE), \n n = n()\n )\n#> # A tibble: 12 × 3\n#> month delay n\n#> <int> <dbl> <int>\n#> 1 1 10.0 27004\n#> 2 2 10.8 24951\n#> 3 3 13.2 28834\n#> 4 4 13.9 28330\n#> 5 5 13.0 28796\n#> 6 6 20.8 28243\n#> # ℹ 6 more rows\n\nMeans and counts can get you a surprisingly long way in data science!\n\n3.5.3 The slice_ functions\nThere are five handy functions that allow you extract specific rows within each group:\n\n\ndf |> slice_head(n = 1) takes the first row from each group.\n\ndf |> slice_tail(n = 1) takes the last row in each group.\n\ndf |> slice_min(x, n = 1) takes the row with the smallest value of column x.\n\ndf |> slice_max(x, n = 1) takes the row with the largest value of column x.\n\ndf |> slice_sample(n = 1) takes one random row.\n\nYou can vary n to select more than one row, or instead of n =, you can use prop = 0.1 to select (e.g.) 10% of the rows in each group. For example, the following code finds the flights that are most delayed upon arrival at each destination:\n\nflights |> \n group_by(dest) |> \n slice_max(arr_delay, n = 1) |>\n relocate(dest)\n#> # A tibble: 108 × 19\n#> # Groups: dest [105]\n#> dest year month day dep_time sched_dep_time dep_delay arr_time\n#> <chr> <int> <int> <int> <int> <int> <dbl> <int>\n#> 1 ABQ 2013 7 22 2145 2007 98 132\n#> 2 ACK 2013 7 23 1139 800 219 1250\n#> 3 ALB 2013 1 25 123 2000 323 229\n#> 4 ANC 2013 8 17 1740 1625 75 2042\n#> 5 ATL 2013 7 22 2257 759 898 121\n#> 6 AUS 2013 7 10 2056 1505 351 2347\n#> # ℹ 102 more rows\n#> # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, …\n\nNote that there are 105 destinations but we get 108 rows here. What’s up? slice_min() and slice_max() keep tied values so n = 1 means give us all rows with the highest value. If you want exactly one row per group you can set with_ties = FALSE.\nThis is similar to computing the max delay with summarize(), but you get the whole corresponding row (or rows if there’s a tie) instead of the single summary statistic.\n\n3.5.4 Grouping by multiple variables\nYou can create groups using more than one variable. For example, we could make a group for each date.\n\ndaily <- flights |> \n group_by(year, month, day)\ndaily\n#> # A tibble: 336,776 × 19\n#> # Groups: year, month, day [365]\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 336,770 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nWhen you summarize a tibble grouped by more than one variable, each summary peels off the last group. In hindsight, this wasn’t a great way to make this function work, but it’s difficult to change without breaking existing code. To make it obvious what’s happening, dplyr displays a message that tells you how you can change this behavior:\n\ndaily_flights <- daily |> \n summarize(n = n())\n#> `summarise()` has grouped output by 'year', 'month'. You can override using\n#> the `.groups` argument.\n\nIf you’re happy with this behavior, you can explicitly request it in order to suppress the message:\n\ndaily_flights <- daily |> \n summarize(\n n = n(), \n .groups = \"drop_last\"\n )\n\nAlternatively, change the default behavior by setting a different value, e.g., \"drop\" to drop all grouping or \"keep\" to preserve the same groups.\n\n3.5.5 Ungrouping\nYou might also want to remove grouping from a data frame without using summarize(). You can do this with ungroup().\n\ndaily |> \n ungroup()\n#> # A tibble: 336,776 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 336,770 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nNow let’s see what happens when you summarize an ungrouped data frame.\n\ndaily |> \n ungroup() |>\n summarize(\n avg_delay = mean(dep_delay, na.rm = TRUE), \n flights = n()\n )\n#> # A tibble: 1 × 2\n#> avg_delay flights\n#> <dbl> <int>\n#> 1 12.6 336776\n\nYou get a single row back because dplyr treats all the rows in an ungrouped data frame as belonging to one group.\n\n3.5.6 .by\n\ndplyr 1.1.0 includes a new, experimental, syntax for per-operation grouping, the .by argument. group_by() and ungroup() aren’t going away, but you can now also use the .by argument to group within a single operation:\n\nflights |> \n summarize(\n delay = mean(dep_delay, na.rm = TRUE), \n n = n(),\n .by = month\n )\n\nOr if you want to group by multiple variables:\n\nflights |> \n summarize(\n delay = mean(dep_delay, na.rm = TRUE), \n n = n(),\n .by = c(origin, dest)\n )\n\n.by works with all verbs and has the advantage that you don’t need to use the .groups argument to suppress the grouping message or ungroup() when you’re done.\nWe didn’t focus on this syntax in this chapter because it was very new when we wrote the book. We did want to mention it because we think it has a lot of promise and it’s likely to be quite popular. You can learn more about it in the dplyr 1.1.0 blog post.\n\n3.5.7 Exercises\n\nWhich carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))\nFind the flights that are most delayed upon departure from each destination.\nHow do delays vary over the course of the day. Illustrate your answer with a plot.\nWhat happens if you supply a negative n to slice_min() and friends?\nExplain what count() does in terms of the dplyr verbs you just learned. What does the sort argument to count() do?\n\nSuppose we have the following tiny data frame:\n\ndf <- tibble(\n x = 1:5,\n y = c(\"a\", \"b\", \"a\", \"a\", \"b\"),\n z = c(\"K\", \"K\", \"L\", \"L\", \"K\")\n)\n\n\n\nWrite down what you think the output will look like, then check if you were correct, and describe what group_by() does.\n\ndf |>\n group_by(y)\n\n\n\nWrite down what you think the output will look like, then check if you were correct, and describe what arrange() does. Also comment on how it’s different from the group_by() in part (a)?\n\ndf |>\n arrange(y)\n\n\n\nWrite down what you think the output will look like, then check if you were correct, and describe what the pipeline does.\n\ndf |>\n group_by(y) |>\n summarize(mean_x = mean(x))\n\n\n\nWrite down what you think the output will look like, then check if you were correct, and describe what the pipeline does. Then, comment on what the message says.\n\ndf |>\n group_by(y, z) |>\n summarize(mean_x = mean(x))\n\n\n\nWrite down what you think the output will look like, then check if you were correct, and describe what the pipeline does. How is the output different from the one in part (d).\n\ndf |>\n group_by(y, z) |>\n summarize(mean_x = mean(x), .groups = \"drop\")\n\n\n\nWrite down what you think the outputs will look like, then check if you were correct, and describe what each pipeline does. How are the outputs of the two pipelines different?\n\ndf |>\n group_by(y, z) |>\n summarize(mean_x = mean(x))\n\ndf |>\n group_by(y, z) |>\n mutate(mean_x = mean(x))" + }, + { + "objectID": "data-transform.html#sec-sample-size", + "href": "data-transform.html#sec-sample-size", + "title": "3  Data transformation", + "section": "\n3.6 Case study: aggregates and sample size", + "text": "3.6 Case study: aggregates and sample size\nWhenever you do any aggregation, it’s always a good idea to include a count (n()). That way, you can ensure that you’re not drawing conclusions based on very small amounts of data. We’ll demonstrate this with some baseball data from the Lahman package. Specifically, we will compare what proportion of times a player gets a hit (H) vs. the number of times they try to put the ball in play (AB):\n\nbatters <- Lahman::Batting |> \n group_by(playerID) |> \n summarize(\n performance = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),\n n = sum(AB, na.rm = TRUE)\n )\nbatters\n#> # A tibble: 20,469 × 3\n#> playerID performance n\n#> <chr> <dbl> <int>\n#> 1 aardsda01 0 4\n#> 2 aaronha01 0.305 12364\n#> 3 aaronto01 0.229 944\n#> 4 aasedo01 0 5\n#> 5 abadan01 0.0952 21\n#> 6 abadfe01 0.111 9\n#> # ℹ 20,463 more rows\n\nWhen we plot the skill of the batter (measured by the batting average, performance) against the number of opportunities to hit the ball (measured by times at bat, n), you see two patterns:\n\nThe variation in performance is larger among players with fewer at-bats. The shape of this plot is very characteristic: whenever you plot a mean (or other summary statistics) vs. group size, you’ll see that the variation decreases as the sample size increases4.\nThere’s a positive correlation between skill (performance) and opportunities to hit the ball (n) because teams want to give their best batters the most opportunities to hit the ball.\n\n\nbatters |> \n filter(n > 100) |> \n ggplot(aes(x = n, y = performance)) +\n geom_point(alpha = 1 / 10) + \n geom_smooth(se = FALSE)\n\n\n\n\nNote the handy pattern for combining ggplot2 and dplyr. You just have to remember to switch from |>, for dataset processing, to + for adding layers to your plot.\nThis also has important implications for ranking. If you naively sort on desc(performance), the people with the best batting averages are clearly the ones who tried to put the ball in play very few times and happened to get a hit, they’re not necessarily the most skilled players:\n\nbatters |> \n arrange(desc(performance))\n#> # A tibble: 20,469 × 3\n#> playerID performance n\n#> <chr> <dbl> <int>\n#> 1 abramge01 1 1\n#> 2 alberan01 1 1\n#> 3 banisje01 1 1\n#> 4 bartocl01 1 1\n#> 5 bassdo01 1 1\n#> 6 birasst01 1 2\n#> # ℹ 20,463 more rows\n\nYou can find a good explanation of this problem and how to overcome it at http://varianceexplained.org/r/empirical_bayes_baseball/ and https://www.evanmiller.org/how-not-to-sort-by-average-rating.html." + }, + { + "objectID": "data-transform.html#summary", + "href": "data-transform.html#summary", + "title": "3  Data transformation", + "section": "\n3.7 Summary", + "text": "3.7 Summary\nIn this chapter, you’ve learned the tools that dplyr provides for working with data frames. The tools are roughly grouped into three categories: those that manipulate the rows (like filter() and arrange(), those that manipulate the columns (like select() and mutate()), and those that manipulate groups (like group_by() and summarize()). In this chapter, we’ve focused on these “whole data frame” tools, but you haven’t yet learned much about what you can do with the individual variable. We’ll come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable.\nIn the next chapter, we’ll pivot back to workflow to discuss the importance of code style, keeping your code well organized in order to make it easy for you and others to read and understand your code." + }, + { + "objectID": "data-transform.html#footnotes", + "href": "data-transform.html#footnotes", + "title": "3  Data transformation", + "section": "", + "text": "Later, you’ll learn about the slice_*() family which allows you to choose rows based on their positions.↩︎\nRemember that in RStudio, the easiest way to see a dataset with many columns is View().↩︎\nOr summarise(), if you prefer British English.↩︎\n*cough* the law of large numbers *cough*.↩︎" + }, + { + "objectID": "workflow-style.html#names", + "href": "workflow-style.html#names", + "title": "4  Workflow: code style", + "section": "\n4.1 Names", + "text": "4.1 Names\nWe talked briefly about names in Seção 2.3. Remember that variable names (those created by <- and those created by mutate()) should use only lowercase letters, numbers, and _. Use _ to separate words within a name.\n\n# Strive for:\nshort_flights <- flights |> filter(air_time < 60)\n\n# Avoid:\nSHORTFLIGHTS <- flights |> filter(air_time < 60)\n\nAs a general rule of thumb, it’s better to prefer long, descriptive names that are easy to understand rather than concise names that are fast to type. Short names save relatively little time when writing code (especially since autocomplete will help you finish typing them), but it can be time-consuming when you come back to old code and are forced to puzzle out a cryptic abbreviation.\nIf you have a bunch of names for related things, do your best to be consistent. It’s easy for inconsistencies to arise when you forget a previous convention, so don’t feel bad if you have to go back and rename things. In general, if you have a bunch of variables that are a variation on a theme, you’re better off giving them a common prefix rather than a common suffix because autocomplete works best on the start of a variable." + }, + { + "objectID": "workflow-style.html#spaces", + "href": "workflow-style.html#spaces", + "title": "4  Workflow: code style", + "section": "\n4.2 Spaces", + "text": "4.2 Spaces\nPut spaces on either side of mathematical operators apart from ^ (i.e. +, -, ==, <, …), and around the assignment operator (<-).\n\n# Strive for\nz <- (a + b)^2 / d\n\n# Avoid\nz<-( a + b ) ^ 2/d\n\nDon’t put spaces inside or outside parentheses for regular function calls. Always put a space after a comma, just like in standard English.\n\n# Strive for\nmean(x, na.rm = TRUE)\n\n# Avoid\nmean (x ,na.rm=TRUE)\n\nIt’s OK to add extra spaces if it improves alignment. For example, if you’re creating multiple variables in mutate(), you might want to add spaces so that all the = line up.1 This makes it easier to skim the code.\n\nflights |> \n mutate(\n speed = distance / air_time,\n dep_hour = dep_time %/% 100,\n dep_minute = dep_time %% 100\n )" + }, + { + "objectID": "workflow-style.html#sec-pipes", + "href": "workflow-style.html#sec-pipes", + "title": "4  Workflow: code style", + "section": "\n4.3 Pipes", + "text": "4.3 Pipes\n|> should always have a space before it and should typically be the last thing on a line. This makes it easier to add new steps, rearrange existing steps, modify elements within a step, and get a 10,000 ft view by skimming the verbs on the left-hand side.\n\n# Strive for \nflights |> \n filter(!is.na(arr_delay), !is.na(tailnum)) |> \n count(dest)\n\n# Avoid\nflights|>filter(!is.na(arr_delay), !is.na(tailnum))|>count(dest)\n\nIf the function you’re piping into has named arguments (like mutate() or summarize()), put each argument on a new line. If the function doesn’t have named arguments (like select() or filter()), keep everything on one line unless it doesn’t fit, in which case you should put each argument on its own line.\n\n# Strive for\nflights |> \n group_by(tailnum) |> \n summarize(\n delay = mean(arr_delay, na.rm = TRUE),\n n = n()\n )\n\n# Avoid\nflights |>\n group_by(\n tailnum\n ) |> \n summarize(delay = mean(arr_delay, na.rm = TRUE), n = n())\n\nAfter the first step of the pipeline, indent each line by two spaces. RStudio will automatically put the spaces in for you after a line break following a |> . If you’re putting each argument on its own line, indent by an extra two spaces. Make sure ) is on its own line, and un-indented to match the horizontal position of the function name.\n\n# Strive for \nflights |> \n group_by(tailnum) |> \n summarize(\n delay = mean(arr_delay, na.rm = TRUE),\n n = n()\n )\n\n# Avoid\nflights|>\n group_by(tailnum) |> \n summarize(\n delay = mean(arr_delay, na.rm = TRUE), \n n = n()\n )\n\n# Avoid\nflights|>\n group_by(tailnum) |> \n summarize(\n delay = mean(arr_delay, na.rm = TRUE), \n n = n()\n )\n\nIt’s OK to shirk some of these rules if your pipeline fits easily on one line. But in our collective experience, it’s common for short snippets to grow longer, so you’ll usually save time in the long run by starting with all the vertical space you need.\n\n# This fits compactly on one line\ndf |> mutate(y = x + 1)\n\n# While this takes up 4x as many lines, it's easily extended to \n# more variables and more steps in the future\ndf |> \n mutate(\n y = x + 1\n )\n\nFinally, be wary of writing very long pipes, say longer than 10-15 lines. Try to break them up into smaller sub-tasks, giving each task an informative name. The names will help cue the reader into what’s happening and makes it easier to check that intermediate results are as expected. Whenever you can give something an informative name, you should give it an informative name, for example when you fundamentally change the structure of the data, e.g., after pivoting or summarizing. Don’t expect to get it right the first time! This means breaking up long pipelines if there are intermediate states that can get good names." + }, + { + "objectID": "workflow-style.html#ggplot2", + "href": "workflow-style.html#ggplot2", + "title": "4  Workflow: code style", + "section": "\n4.4 ggplot2", + "text": "4.4 ggplot2\nThe same basic rules that apply to the pipe also apply to ggplot2; just treat + the same way as |>.\n\nflights |> \n group_by(month) |> \n summarize(\n delay = mean(arr_delay, na.rm = TRUE)\n ) |> \n ggplot(aes(x = month, y = delay)) +\n geom_point() + \n geom_line()\n\nAgain, if you can’t fit all of the arguments to a function on to a single line, put each argument on its own line:\n\nflights |> \n group_by(dest) |> \n summarize(\n distance = mean(distance),\n speed = mean(distance / air_time, na.rm = TRUE)\n ) |> \n ggplot(aes(x = distance, y = speed)) +\n geom_smooth(\n method = \"loess\",\n span = 0.5,\n se = FALSE, \n color = \"white\", \n linewidth = 4\n ) +\n geom_point()\n\nWatch for the transition from |> to +. We wish this transition wasn’t necessary, but unfortunately, ggplot2 was written before the pipe was discovered." + }, + { + "objectID": "workflow-style.html#sectioning-comments", + "href": "workflow-style.html#sectioning-comments", + "title": "4  Workflow: code style", + "section": "\n4.5 Sectioning comments", + "text": "4.5 Sectioning comments\nAs your scripts get longer, you can use sectioning comments to break up your file into manageable pieces:\n\n# Load data --------------------------------------\n\n# Plot data --------------------------------------\n\nRStudio provides a keyboard shortcut to create these headers (Cmd/Ctrl + Shift + R), and will display them in the code navigation drop-down at the bottom-left of the editor, as shown in Figura 4.2.\n\n\n\n\nFigura 4.2: After adding sectioning comments to your script, you can easily navigate to them using the code navigation tool in the bottom-left of the script editor." + }, + { + "objectID": "workflow-style.html#exercises", + "href": "workflow-style.html#exercises", + "title": "4  Workflow: code style", + "section": "\n4.6 Exercises", + "text": "4.6 Exercises\n\n\nRestyle the following pipelines following the guidelines above.\n\nflights|>filter(dest==\"IAH\")|>group_by(year,month,day)|>summarize(n=n(),\ndelay=mean(arr_delay,na.rm=TRUE))|>filter(n>10)\n\nflights|>filter(carrier==\"UA\",dest%in%c(\"IAH\",\"HOU\"),sched_dep_time>\n0900,sched_arr_time<2000)|>group_by(flight)|>summarize(delay=mean(\narr_delay,na.rm=TRUE),cancelled=sum(is.na(arr_delay)),n=n())|>filter(n>10)" + }, + { + "objectID": "workflow-style.html#summary", + "href": "workflow-style.html#summary", + "title": "4  Workflow: code style", + "section": "\n4.7 Summary", + "text": "4.7 Summary\nIn this chapter, you’ve learned the most important principles of code style. These may feel like a set of arbitrary rules to start with (because they are!) but over time, as you write more code, and share code with more people, you’ll see how important a consistent style is. And don’t forget about the styler package: it’s a great way to quickly improve the quality of poorly styled code.\nIn the next chapter, we switch back to data science tools, learning about tidy data. Tidy data is a consistent way of organizing your data frames that is used throughout the tidyverse. This consistency makes your life easier because once you have tidy data, it just works with the vast majority of tidyverse functions. Of course, life is never easy, and most datasets you encounter in the wild will not already be tidy. So we’ll also teach you how to use the tidyr package to tidy your untidy data." + }, + { + "objectID": "workflow-style.html#footnotes", + "href": "workflow-style.html#footnotes", + "title": "4  Workflow: code style", + "section": "", + "text": "Since dep_time is in HMM or HHMM format, we use integer division (%/%) to get hour and remainder (also known as modulo, %%) to get minute.↩︎" + }, + { + "objectID": "data-tidy.html#introduction", + "href": "data-tidy.html#introduction", + "title": "5  Data tidying", + "section": "\n5.1 Introduction", + "text": "5.1 Introduction\n\n“Happy families are all alike; every unhappy family is unhappy in its own way.”\n— Leo Tolstoy\n\n\n“Tidy datasets are all alike, but every messy dataset is messy in its own way.”\n— Hadley Wickham\n\nIn this chapter, you will learn a consistent way to organize your data in R using a system called tidy data. Getting your data into this format requires some work up front, but that work pays off in the long term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the data questions you care about.\nIn this chapter, you’ll first learn the definition of tidy data and see it applied to a simple toy dataset. Then we’ll dive into the primary tool you’ll use for tidying data: pivoting. Pivoting allows you to change the form of your data without changing any of the values.\n\n5.1.1 Prerequisites\nIn this chapter, we’ll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidyr is a member of the core tidyverse.\n\nlibrary(tidyverse)\n\nFrom this chapter on, we’ll suppress the loading message from library(tidyverse)." + }, + { + "objectID": "data-tidy.html#sec-tidy-data", + "href": "data-tidy.html#sec-tidy-data", + "title": "5  Data tidying", + "section": "\n5.2 Tidy data", + "text": "5.2 Tidy data\nYou can represent the same underlying data in multiple ways. The example below shows the same data organized in three different ways. Each dataset shows the same values of four variables: country, year, population, and number of documented cases of TB (tuberculosis), but each dataset organizes the values in a different way.\n\ntable1\n#> # A tibble: 6 × 4\n#> country year cases population\n#> <chr> <dbl> <dbl> <dbl>\n#> 1 Afghanistan 1999 745 19987071\n#> 2 Afghanistan 2000 2666 20595360\n#> 3 Brazil 1999 37737 172006362\n#> 4 Brazil 2000 80488 174504898\n#> 5 China 1999 212258 1272915272\n#> 6 China 2000 213766 1280428583\n\ntable2\n#> # A tibble: 12 × 4\n#> country year type count\n#> <chr> <dbl> <chr> <dbl>\n#> 1 Afghanistan 1999 cases 745\n#> 2 Afghanistan 1999 population 19987071\n#> 3 Afghanistan 2000 cases 2666\n#> 4 Afghanistan 2000 population 20595360\n#> 5 Brazil 1999 cases 37737\n#> 6 Brazil 1999 population 172006362\n#> # ℹ 6 more rows\n\ntable3\n#> # A tibble: 6 × 3\n#> country year rate \n#> <chr> <dbl> <chr> \n#> 1 Afghanistan 1999 745/19987071 \n#> 2 Afghanistan 2000 2666/20595360 \n#> 3 Brazil 1999 37737/172006362 \n#> 4 Brazil 2000 80488/174504898 \n#> 5 China 1999 212258/1272915272\n#> 6 China 2000 213766/1280428583\n\nThese are all representations of the same underlying data, but they are not equally easy to use. One of them, table1, will be much easier to work with inside the tidyverse because it’s tidy.\nThere are three interrelated rules that make a dataset tidy:\n\nEach variable is a column; each column is a variable.\nEach observation is a row; each row is an observation.\nEach value is a cell; each cell is a single value.\n\nFigura 5.1 shows the rules visually.\n\n\n\n\nFigura 5.1: The following three rules make a dataset tidy: variables are columns, observations are rows, and values are cells.\n\n\n\nWhy ensure that your data is tidy? There are two main advantages:\n\nThere’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.\nThere’s a specific advantage to placing variables in columns because it allows R’s vectorized nature to shine. As you learned in Seção 3.3.1 and Seção 3.5.2, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.\n\ndplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data. Here are a few small examples showing how you might work with table1.\n\n# Compute rate per 10,000\ntable1 |>\n mutate(rate = cases / population * 10000)\n#> # A tibble: 6 × 5\n#> country year cases population rate\n#> <chr> <dbl> <dbl> <dbl> <dbl>\n#> 1 Afghanistan 1999 745 19987071 0.373\n#> 2 Afghanistan 2000 2666 20595360 1.29 \n#> 3 Brazil 1999 37737 172006362 2.19 \n#> 4 Brazil 2000 80488 174504898 4.61 \n#> 5 China 1999 212258 1272915272 1.67 \n#> 6 China 2000 213766 1280428583 1.67\n\n# Compute total cases per year\ntable1 |> \n group_by(year) |> \n summarize(total_cases = sum(cases))\n#> # A tibble: 2 × 2\n#> year total_cases\n#> <dbl> <dbl>\n#> 1 1999 250740\n#> 2 2000 296920\n\n# Visualize changes over time\nggplot(table1, aes(x = year, y = cases)) +\n geom_line(aes(group = country), color = \"grey50\") +\n geom_point(aes(color = country, shape = country)) +\n scale_x_continuous(breaks = c(1999, 2000)) # x-axis breaks at 1999 and 2000\n\n\n\n\n\n5.2.1 Exercises\n\nFor each of the sample tables, describe what each observation and each column represents.\n\nSketch out the process you’d use to calculate the rate for table2 and table3. You will need to perform four operations:\n\nExtract the number of TB cases per country per year.\nExtract the matching population per country per year.\nDivide cases by population, and multiply by 10000.\nStore back in the appropriate place.\n\nYou haven’t yet learned all the functions you’d need to actually perform these operations, but you should still be able to think through the transformations you’d need." + }, + { + "objectID": "data-tidy.html#sec-pivoting", + "href": "data-tidy.html#sec-pivoting", + "title": "5  Data tidying", + "section": "\n5.3 Lengthening data", + "text": "5.3 Lengthening data\nThe principles of tidy data might seem so obvious that you wonder if you’ll ever encounter a dataset that isn’t tidy. Unfortunately, however, most real data is untidy. There are two main reasons:\n\nData is often organized to facilitate some goal other than analysis. For example, it’s common for data to be structured to make data entry, not analysis, easy.\nMost people aren’t familiar with the principles of tidy data, and it’s hard to derive them yourself unless you spend a lot of time working with data.\n\nThis means that most real analyses will require at least a little tidying. You’ll begin by figuring out what the underlying variables and observations are. Sometimes this is easy; other times you’ll need to consult with the people who originally generated the data. Next, you’ll pivot your data into a tidy form, with variables in the columns and observations in the rows.\ntidyr provides two functions for pivoting data: pivot_longer() and pivot_wider(). We’ll first start with pivot_longer() because it’s the most common case. Let’s dive into some examples.\n\n5.3.1 Data in column names\nThe billboard dataset records the billboard rank of songs in the year 2000:\n\nbillboard\n#> # A tibble: 317 × 79\n#> artist track date.entered wk1 wk2 wk3 wk4 wk5\n#> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>\n#> 1 2 Pac Baby Don't Cry (Ke… 2000-02-26 87 82 72 77 87\n#> 2 2Ge+her The Hardest Part O… 2000-09-02 91 87 92 NA NA\n#> 3 3 Doors Down Kryptonite 2000-04-08 81 70 68 67 66\n#> 4 3 Doors Down Loser 2000-10-21 76 76 72 69 67\n#> 5 504 Boyz Wobble Wobble 2000-04-15 57 34 25 17 17\n#> 6 98^0 Give Me Just One N… 2000-08-19 51 39 34 26 26\n#> # ℹ 311 more rows\n#> # ℹ 71 more variables: wk6 <dbl>, wk7 <dbl>, wk8 <dbl>, wk9 <dbl>, …\n\nIn this dataset, each observation is a song. The first three columns (artist, track and date.entered) are variables that describe the song. Then we have 76 columns (wk1-wk76) that describe the rank of the song in each week1. Here, the column names are one variable (the week) and the cell values are another (the rank).\nTo tidy this data, we’ll use pivot_longer():\n\nbillboard |> \n pivot_longer(\n cols = starts_with(\"wk\"), \n names_to = \"week\", \n values_to = \"rank\"\n )\n#> # A tibble: 24,092 × 5\n#> artist track date.entered week rank\n#> <chr> <chr> <date> <chr> <dbl>\n#> 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87\n#> 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82\n#> 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72\n#> 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77\n#> 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87\n#> 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94\n#> 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99\n#> 8 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk8 NA\n#> 9 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk9 NA\n#> 10 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk10 NA\n#> # ℹ 24,082 more rows\n\nAfter the data, there are three key arguments:\n\n\ncols specifies which columns need to be pivoted, i.e. which columns aren’t variables. This argument uses the same syntax as select() so here we could use !c(artist, track, date.entered) or starts_with(\"wk\").\n\nnames_to names the variable stored in the column names, we named that variable week.\n\nvalues_to names the variable stored in the cell values, we named that variable rank.\n\nNote that in the code \"week\" and \"rank\" are quoted because those are new variables we’re creating, they don’t yet exist in the data when we run the pivot_longer() call.\nNow let’s turn our attention to the resulting, longer data frame. What happens if a song is in the top 100 for less than 76 weeks? Take 2 Pac’s “Baby Don’t Cry”, for example. The above output suggests that it was only in the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values. These NAs don’t really represent unknown observations; they were forced to exist by the structure of the dataset2, so we can ask pivot_longer() to get rid of them by setting values_drop_na = TRUE:\n\nbillboard |> \n pivot_longer(\n cols = starts_with(\"wk\"), \n names_to = \"week\", \n values_to = \"rank\",\n values_drop_na = TRUE\n )\n#> # A tibble: 5,307 × 5\n#> artist track date.entered week rank\n#> <chr> <chr> <date> <chr> <dbl>\n#> 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87\n#> 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82\n#> 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72\n#> 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77\n#> 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87\n#> 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94\n#> # ℹ 5,301 more rows\n\nThe number of rows is now much lower, indicating that many rows with NAs were dropped.\nYou might also wonder what happens if a song is in the top 100 for more than 76 weeks? We can’t tell from this data, but you might guess that additional columns wk77, wk78, … would be added to the dataset.\nThis data is now tidy, but we could make future computation a bit easier by converting values of week from character strings to numbers using mutate() and readr::parse_number(). parse_number() is a handy function that will extract the first number from a string, ignoring all other text.\n\nbillboard_longer <- billboard |> \n pivot_longer(\n cols = starts_with(\"wk\"), \n names_to = \"week\", \n values_to = \"rank\",\n values_drop_na = TRUE\n ) |> \n mutate(\n week = parse_number(week)\n )\nbillboard_longer\n#> # A tibble: 5,307 × 5\n#> artist track date.entered week rank\n#> <chr> <chr> <date> <dbl> <dbl>\n#> 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87\n#> 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82\n#> 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72\n#> 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77\n#> 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87\n#> 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94\n#> # ℹ 5,301 more rows\n\nNow that we have all the week numbers in one variable and all the rank values in another, we’re in a good position to visualize how song ranks vary over time. The code is shown below and the result is in Figura 5.2. We can see that very few songs stay in the top 100 for more than 20 weeks.\n\nbillboard_longer |> \n ggplot(aes(x = week, y = rank, group = track)) + \n geom_line(alpha = 0.25) + \n scale_y_reverse()\n\n\n\nFigura 5.2: A line plot showing how the rank of a song changes over time.\n\n\n\n\n5.3.2 How does pivoting work?\nNow that you’ve seen how we can use pivoting to reshape our data, let’s take a little time to gain some intuition about what pivoting does to the data. Let’s start with a very simple dataset to make it easier to see what’s happening. Suppose we have three patients with ids A, B, and C, and we take two blood pressure measurements on each patient. We’ll create the data with tribble(), a handy function for constructing small tibbles by hand:\n\ndf <- tribble(\n ~id, ~bp1, ~bp2,\n \"A\", 100, 120,\n \"B\", 140, 115,\n \"C\", 120, 125\n)\n\nWe want our new dataset to have three variables: id (already exists), measurement (the column names), and value (the cell values). To achieve this, we need to pivot df longer:\n\ndf |> \n pivot_longer(\n cols = bp1:bp2,\n names_to = \"measurement\",\n values_to = \"value\"\n )\n#> # A tibble: 6 × 3\n#> id measurement value\n#> <chr> <chr> <dbl>\n#> 1 A bp1 100\n#> 2 A bp2 120\n#> 3 B bp1 140\n#> 4 B bp2 115\n#> 5 C bp1 120\n#> 6 C bp2 125\n\nHow does the reshaping work? It’s easier to see if we think about it column by column. As shown in Figura 5.3, the values in a column that was already a variable in the original dataset (id) need to be repeated, once for each column that is pivoted.\n\n\n\n\nFigura 5.3: Columns that are already variables need to be repeated, once for each column that is pivoted.\n\n\n\nThe column names become values in a new variable, whose name is defined by names_to, as shown in Figura 5.4. They need to be repeated once for each row in the original dataset.\n\n\n\n\nFigura 5.4: The column names of pivoted columns become values in a new column. The values need to be repeated once for each row of the original dataset.\n\n\n\nThe cell values also become values in a new variable, with a name defined by values_to. They are unwound row by row. Figura 5.5 illustrates the process.\n\n\n\n\nFigura 5.5: The number of values is preserved (not repeated), but unwound row-by-row.\n\n\n\n\n5.3.3 Many variables in column names\nA more challenging situation occurs when you have multiple pieces of information crammed into the column names, and you would like to store these in separate new variables. For example, take the who2 dataset, the source of table1 and friends that you saw above:\n\nwho2\n#> # A tibble: 7,240 × 58\n#> country year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554\n#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>\n#> 1 Afghanistan 1980 NA NA NA NA NA\n#> 2 Afghanistan 1981 NA NA NA NA NA\n#> 3 Afghanistan 1982 NA NA NA NA NA\n#> 4 Afghanistan 1983 NA NA NA NA NA\n#> 5 Afghanistan 1984 NA NA NA NA NA\n#> 6 Afghanistan 1985 NA NA NA NA NA\n#> # ℹ 7,234 more rows\n#> # ℹ 51 more variables: sp_m_5564 <dbl>, sp_m_65 <dbl>, sp_f_014 <dbl>, …\n\nThis dataset, collected by the World Health Organisation, records information about tuberculosis diagnoses. There are two columns that are already variables and are easy to interpret: country and year. They are followed by 56 columns like sp_m_014, ep_m_4554, and rel_m_3544. If you stare at these columns for long enough, you’ll notice there’s a pattern. Each column name is made up of three pieces separated by _. The first piece, sp/rel/ep, describes the method used for the diagnosis, the second piece, m/f is the gender (coded as a binary variable in this dataset), and the third piece, 014/1524/2534/3544/4554/5564/65 is the age range (014 represents 0-14, for example).\nSo in this case we have six pieces of information recorded in who2: the country and the year (already columns); the method of diagnosis, the gender category, and the age range category (contained in the other column names); and the count of patients in that category (cell values). To organize these six pieces of information in six separate columns, we use pivot_longer() with a vector of column names for names_to and instructors for splitting the original variable names into pieces for names_sep as well as a column name for values_to:\n\nwho2 |> \n pivot_longer(\n cols = !(country:year),\n names_to = c(\"diagnosis\", \"gender\", \"age\"), \n names_sep = \"_\",\n values_to = \"count\"\n )\n#> # A tibble: 405,440 × 6\n#> country year diagnosis gender age count\n#> <chr> <dbl> <chr> <chr> <chr> <dbl>\n#> 1 Afghanistan 1980 sp m 014 NA\n#> 2 Afghanistan 1980 sp m 1524 NA\n#> 3 Afghanistan 1980 sp m 2534 NA\n#> 4 Afghanistan 1980 sp m 3544 NA\n#> 5 Afghanistan 1980 sp m 4554 NA\n#> 6 Afghanistan 1980 sp m 5564 NA\n#> # ℹ 405,434 more rows\n\nAn alternative to names_sep is names_pattern, which you can use to extract variables from more complicated naming scenarios, once you’ve learned about regular expressions in Capítulo 15.\nConceptually, this is only a minor variation on the simpler case you’ve already seen. Figura 5.6 shows the basic idea: now, instead of the column names pivoting into a single column, they pivot into multiple columns. You can imagine this happening in two steps (first pivoting and then separating) but under the hood it happens in a single step because that’s faster.\n\n\n\n\nFigura 5.6: Pivoting columns with multiple pieces of information in the names means that each column name now fills in values in multiple output columns.\n\n\n\n\n5.3.4 Data and variable names in the column headers\nThe next step up in complexity is when the column names include a mix of variable values and variable names. For example, take the household dataset:\n\nhousehold\n#> # A tibble: 5 × 5\n#> family dob_child1 dob_child2 name_child1 name_child2\n#> <int> <date> <date> <chr> <chr> \n#> 1 1 1998-11-26 2000-01-29 Susan Jose \n#> 2 2 1996-06-22 NA Mark <NA> \n#> 3 3 2002-07-11 2004-04-05 Sam Seth \n#> 4 4 2004-10-10 2009-08-27 Craig Khai \n#> 5 5 2000-12-05 2005-02-28 Parker Gracie\n\nThis dataset contains data about five families, with the names and dates of birth of up to two children. The new challenge in this dataset is that the column names contain the names of two variables (dob, name) and the values of another (child, with values 1 or 2). To solve this problem we again need to supply a vector to names_to but this time we use the special \".value\" sentinel; this isn’t the name of a variable but a unique value that tells pivot_longer() to do something different. This overrides the usual values_to argument to use the first component of the pivoted column name as a variable name in the output.\n\nhousehold |> \n pivot_longer(\n cols = !family, \n names_to = c(\".value\", \"child\"), \n names_sep = \"_\", \n values_drop_na = TRUE\n )\n#> # A tibble: 9 × 4\n#> family child dob name \n#> <int> <chr> <date> <chr>\n#> 1 1 child1 1998-11-26 Susan\n#> 2 1 child2 2000-01-29 Jose \n#> 3 2 child1 1996-06-22 Mark \n#> 4 3 child1 2002-07-11 Sam \n#> 5 3 child2 2004-04-05 Seth \n#> 6 4 child1 2004-10-10 Craig\n#> # ℹ 3 more rows\n\nWe again use values_drop_na = TRUE, since the shape of the input forces the creation of explicit missing variables (e.g., for families with only one child).\nFigura 5.7 illustrates the basic idea with a simpler example. When you use \".value\" in names_to, the column names in the input contribute to both values and variable names in the output.\n\n\n\n\nFigura 5.7: Pivoting with names_to = c(\".value\", \"num\") splits the column names into two components: the first part determines the output column name (x or y), and the second part determines the value of the num column." + }, + { + "objectID": "data-tidy.html#widening-data", + "href": "data-tidy.html#widening-data", + "title": "5  Data tidying", + "section": "\n5.4 Widening data", + "text": "5.4 Widening data\nSo far we’ve used pivot_longer() to solve the common class of problems where values have ended up in column names. Next we’ll pivot (HA HA) to pivot_wider(), which makes datasets wider by increasing columns and reducing rows and helps when one observation is spread across multiple rows. This seems to arise less commonly in the wild, but it does seem to crop up a lot when dealing with governmental data.\nWe’ll start by looking at cms_patient_experience, a dataset from the Centers of Medicare and Medicaid services that collects data about patient experiences:\n\ncms_patient_experience\n#> # A tibble: 500 × 5\n#> org_pac_id org_nm measure_cd measure_title prf_rate\n#> <chr> <chr> <chr> <chr> <dbl>\n#> 1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_1 CAHPS for MIPS… 63\n#> 2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_2 CAHPS for MIPS… 87\n#> 3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_3 CAHPS for MIPS… 86\n#> 4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_5 CAHPS for MIPS… 57\n#> 5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_8 CAHPS for MIPS… 85\n#> 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS… 24\n#> # ℹ 494 more rows\n\nThe core unit being studied is an organization, but each organization is spread across six rows, with one row for each measurement taken in the survey organization. We can see the complete set of values for measure_cd and measure_title by using distinct():\n\ncms_patient_experience |> \n distinct(measure_cd, measure_title)\n#> # A tibble: 6 × 2\n#> measure_cd measure_title \n#> <chr> <chr> \n#> 1 CAHPS_GRP_1 CAHPS for MIPS SSM: Getting Timely Care, Appointments, and In…\n#> 2 CAHPS_GRP_2 CAHPS for MIPS SSM: How Well Providers Communicate \n#> 3 CAHPS_GRP_3 CAHPS for MIPS SSM: Patient's Rating of Provider \n#> 4 CAHPS_GRP_5 CAHPS for MIPS SSM: Health Promotion and Education \n#> 5 CAHPS_GRP_8 CAHPS for MIPS SSM: Courteous and Helpful Office Staff \n#> 6 CAHPS_GRP_12 CAHPS for MIPS SSM: Stewardship of Patient Resources\n\nNeither of these columns will make particularly great variable names: measure_cd doesn’t hint at the meaning of the variable and measure_title is a long sentence containing spaces. We’ll use measure_cd as the source for our new column names for now, but in a real analysis you might want to create your own variable names that are both short and meaningful.\npivot_wider() has the opposite interface to pivot_longer(): instead of choosing new column names, we need to provide the existing columns that define the values (values_from) and the column name (names_from):\n\ncms_patient_experience |> \n pivot_wider(\n names_from = measure_cd,\n values_from = prf_rate\n )\n#> # A tibble: 500 × 9\n#> org_pac_id org_nm measure_title CAHPS_GRP_1 CAHPS_GRP_2\n#> <chr> <chr> <chr> <dbl> <dbl>\n#> 1 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… 63 NA\n#> 2 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA 87\n#> 3 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA\n#> 4 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA\n#> 5 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA\n#> 6 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA\n#> # ℹ 494 more rows\n#> # ℹ 4 more variables: CAHPS_GRP_3 <dbl>, CAHPS_GRP_5 <dbl>, …\n\nThe output doesn’t look quite right; we still seem to have multiple rows for each organization. That’s because, we also need to tell pivot_wider() which column or columns have values that uniquely identify each row; in this case those are the variables starting with \"org\":\n\ncms_patient_experience |> \n pivot_wider(\n id_cols = starts_with(\"org\"),\n names_from = measure_cd,\n values_from = prf_rate\n )\n#> # A tibble: 95 × 8\n#> org_pac_id org_nm CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3 CAHPS_GRP_5\n#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>\n#> 1 0446157747 USC CARE MEDICA… 63 87 86 57\n#> 2 0446162697 ASSOCIATION OF … 59 85 83 63\n#> 3 0547164295 BEAVER MEDICAL … 49 NA 75 44\n#> 4 0749333730 CAPE PHYSICIANS… 67 84 85 65\n#> 5 0840104360 ALLIANCE PHYSIC… 66 87 87 64\n#> 6 0840109864 REX HOSPITAL INC 73 87 84 67\n#> # ℹ 89 more rows\n#> # ℹ 2 more variables: CAHPS_GRP_8 <dbl>, CAHPS_GRP_12 <dbl>\n\nThis gives us the output that we’re looking for.\n\n5.4.1 How does pivot_wider() work?\nTo understand how pivot_wider() works, let’s again start with a very simple dataset. This time we have two patients with ids A and B, we have three blood pressure measurements on patient A and two on patient B:\n\ndf <- tribble(\n ~id, ~measurement, ~value,\n \"A\", \"bp1\", 100,\n \"B\", \"bp1\", 140,\n \"B\", \"bp2\", 115, \n \"A\", \"bp2\", 120,\n \"A\", \"bp3\", 105\n)\n\nWe’ll take the values from the value column and the names from the measurement column:\n\ndf |> \n pivot_wider(\n names_from = measurement,\n values_from = value\n )\n#> # A tibble: 2 × 4\n#> id bp1 bp2 bp3\n#> <chr> <dbl> <dbl> <dbl>\n#> 1 A 100 120 105\n#> 2 B 140 115 NA\n\nTo begin the process pivot_wider() needs to first figure out what will go in the rows and columns. The new column names will be the unique values of measurement.\n\ndf |> \n distinct(measurement) |> \n pull()\n#> [1] \"bp1\" \"bp2\" \"bp3\"\n\nBy default, the rows in the output are determined by all the variables that aren’t going into the new names or values. These are called the id_cols. Here there is only one column, but in general there can be any number.\n\ndf |> \n select(-measurement, -value) |> \n distinct()\n#> # A tibble: 2 × 1\n#> id \n#> <chr>\n#> 1 A \n#> 2 B\n\npivot_wider() then combines these results to generate an empty data frame:\n\ndf |> \n select(-measurement, -value) |> \n distinct() |> \n mutate(x = NA, y = NA, z = NA)\n#> # A tibble: 2 × 4\n#> id x y z \n#> <chr> <lgl> <lgl> <lgl>\n#> 1 A NA NA NA \n#> 2 B NA NA NA\n\nIt then fills in all the missing values using the data in the input. In this case, not every cell in the output has a corresponding value in the input as there’s no third blood pressure measurement for patient B, so that cell remains missing. We’ll come back to this idea that pivot_wider() can “make” missing values in Capítulo 18.\nYou might also wonder what happens if there are multiple rows in the input that correspond to one cell in the output. The example below has two rows that correspond to id “A” and measurement “bp1”:\n\ndf <- tribble(\n ~id, ~measurement, ~value,\n \"A\", \"bp1\", 100,\n \"A\", \"bp1\", 102,\n \"A\", \"bp2\", 120,\n \"B\", \"bp1\", 140, \n \"B\", \"bp2\", 115\n)\n\nIf we attempt to pivot this we get an output that contains list-columns, which you’ll learn more about in Capítulo 23:\n\ndf |>\n pivot_wider(\n names_from = measurement,\n values_from = value\n )\n#> Warning: Values from `value` are not uniquely identified; output will contain\n#> list-cols.\n#> • Use `values_fn = list` to suppress this warning.\n#> • Use `values_fn = {summary_fun}` to summarise duplicates.\n#> • Use the following dplyr code to identify duplicates.\n#> {data} %>%\n#> dplyr::group_by(id, measurement) %>%\n#> dplyr::summarise(n = dplyr::n(), .groups = \"drop\") %>%\n#> dplyr::filter(n > 1L)\n#> # A tibble: 2 × 3\n#> id bp1 bp2 \n#> <chr> <list> <list> \n#> 1 A <dbl [2]> <dbl [1]>\n#> 2 B <dbl [1]> <dbl [1]>\n\nSince you don’t know how to work with this sort of data yet, you’ll want to follow the hint in the warning to figure out where the problem is:\n\ndf |> \n group_by(id, measurement) |> \n summarize(n = n(), .groups = \"drop\") |> \n filter(n > 1)\n#> # A tibble: 1 × 3\n#> id measurement n\n#> <chr> <chr> <int>\n#> 1 A bp1 2\n\nIt’s then up to you to figure out what’s gone wrong with your data and either repair the underlying damage or use your grouping and summarizing skills to ensure that each combination of row and column values only has a single row." + }, + { + "objectID": "data-tidy.html#summary", + "href": "data-tidy.html#summary", + "title": "5  Data tidying", + "section": "\n5.5 Summary", + "text": "5.5 Summary\nIn this chapter you learned about tidy data: data that has variables in columns and observations in rows. Tidy data makes working in the tidyverse easier, because it’s a consistent structure understood by most functions, the main challenge is transforming the data from whatever structure you receive it in to a tidy format. To that end, you learned about pivot_longer() and pivot_wider() which allow you to tidy up many untidy datasets. The examples we presented here are a selection of those from vignette(\"pivot\", package = \"tidyr\"), so if you encounter a problem that this chapter doesn’t help you with, that vignette is a good place to try next.\nAnother challenge is that, for a given dataset, it can be impossible to label the longer or the wider version as the “tidy” one. This is partly a reflection of our definition of tidy data, where we said tidy data has one variable in each column, but we didn’t actually define what a variable is (and it’s surprisingly hard to do so). It’s totally fine to be pragmatic and to say a variable is whatever makes your analysis easiest. So if you’re stuck figuring out how to do some computation, consider switching up the organisation of your data; don’t be afraid to untidy, transform, and re-tidy as needed!\nIf you enjoyed this chapter and want to learn more about the underlying theory, you can learn more about the history and theoretical underpinnings in the Tidy Data paper published in the Journal of Statistical Software.\nNow that you’re writing a substantial amount of R code, it’s time to learn more about organizing your code into files and directories. In the next chapter, you’ll learn all about the advantages of scripts and projects, and some of the many tools that they provide to make your life easier." + }, + { + "objectID": "data-tidy.html#footnotes", + "href": "data-tidy.html#footnotes", + "title": "5  Data tidying", + "section": "", + "text": "The song will be included as long as it was in the top 100 at some point in 2000, and is tracked for up to 72 weeks after it appears.↩︎\nWe’ll come back to this idea in Capítulo 18.↩︎" + }, + { + "objectID": "workflow-scripts.html#scripts", + "href": "workflow-scripts.html#scripts", + "title": "6  Workflow: scripts and projects", + "section": "\n6.1 Scripts", + "text": "6.1 Scripts\nSo far, you have used the console to run code. That’s a great place to start, but you’ll find it gets cramped pretty quickly as you create more complex ggplot2 graphics and longer dplyr pipelines. To give yourself more room to work, use the script editor. Open it up by clicking the File menu, selecting New File, then R script, or using the keyboard shortcut Cmd/Ctrl + Shift + N. Now you’ll see four panes, as in Figura 6.1. The script editor is a great place to experiment with your code. When you want to change something, you don’t have to re-type the whole thing, you can just edit the script and re-run it. And once you have written code that works and does what you want, you can save it as a script file to easily return to later.\n\n\n\n\nFigura 6.1: Opening the script editor adds a new pane at the top-left of the IDE.\n\n\n\n\n6.1.1 Running code\nThe script editor is an excellent place for building complex ggplot2 plots or long sequences of dplyr manipulations. The key to using the script editor effectively is to memorize one of the most important keyboard shortcuts: Cmd/Ctrl + Enter. This executes the current R expression in the console. For example, take the code below.\n\nlibrary(dplyr)\nlibrary(nycflights13)\n\nnot_cancelled <- flights |> \n filter(!is.na(dep_delay)█, !is.na(arr_delay))\n\nnot_cancelled |> \n group_by(year, month, day) |> \n summarize(mean = mean(dep_delay))\n\nIf your cursor is at █, pressing Cmd/Ctrl + Enter will run the complete command that generates not_cancelled. It will also move the cursor to the following statement (beginning with not_cancelled |>). That makes it easy to step through your complete script by repeatedly pressing Cmd/Ctrl + Enter.\nInstead of running your code expression-by-expression, you can also execute the complete script in one step with Cmd/Ctrl + Shift + S. Doing this regularly is a great way to ensure that you’ve captured all the important parts of your code in the script.\nWe recommend you always start your script with the packages you need. That way, if you share your code with others, they can easily see which packages they need to install. Note, however, that you should never include install.packages() in a script you share. It’s inconsiderate to hand off a script that will change something on their computer if they’re not being careful!\nWhen working through future chapters, we highly recommend starting in the script editor and practicing your keyboard shortcuts. Over time, sending code to the console in this way will become so natural that you won’t even think about it.\n\n6.1.2 RStudio diagnostics\nIn the script editor, RStudio will highlight syntax errors with a red squiggly line and a cross in the sidebar:\n\n\n\n\n\nHover over the cross to see what the problem is:\n\n\n\n\n\nRStudio will also let you know about potential problems:\n\n\n\n\n\n\n6.1.3 Saving and naming\nRStudio automatically saves the contents of the script editor when you quit, and automatically reloads it when you re-open. Nevertheless, it’s a good idea to avoid Untitled1, Untitled2, Untitled3, and so on and instead save your scripts and to give them informative names.\nIt might be tempting to name your files code.R or myscript.R, but you should think a bit harder before choosing a name for your file. Three important principles for file naming are as follows:\n\nFile names should be machine readable: avoid spaces, symbols, and special characters. Don’t rely on case sensitivity to distinguish files.\nFile names should be human readable: use file names to describe what’s in the file.\nFile names should play well with default ordering: start file names with numbers so that alphabetical sorting puts them in the order they get used.\n\nFor example, suppose you have the following files in a project folder.\nalternative model.R\ncode for exploratory analysis.r\nfinalreport.qmd\nFinalReport.qmd\nfig 1.png\nFigure_02.png\nmodel_first_try.R\nrun-first.r\ntemp.txt\nThere are a variety of problems here: it’s hard to find which file to run first, file names contain spaces, there are two files with the same name but different capitalization (finalreport vs. FinalReport1), and some names don’t describe their contents (run-first and temp).\nHere’s a better way of naming and organizing the same set of files:\n01-load-data.R\n02-exploratory-analysis.R\n03-model-approach-1.R\n04-model-approach-2.R\nfig-01.png\nfig-02.png\nreport-2022-03-20.qmd\nreport-2022-04-02.qmd\nreport-draft-notes.txt\nNumbering the key scripts make it obvious in which order to run them and a consistent naming scheme makes it easier to see what varies. Additionally, the figures are labelled similarly, the reports are distinguished by dates included in the file names, and temp is renamed to report-draft-notes to better describe its contents. If you have a lot of files in a directory, taking organization one step further and placing different types of files (scripts, figures, etc.) in different directories is recommended." + }, + { + "objectID": "workflow-scripts.html#projects", + "href": "workflow-scripts.html#projects", + "title": "6  Workflow: scripts and projects", + "section": "\n6.2 Projects", + "text": "6.2 Projects\nOne day, you will need to quit R, go do something else, and return to your analysis later. One day, you will be working on multiple analyses simultaneously and you want to keep them separate. One day, you will need to bring data from the outside world into R and send numerical results and figures from R back out into the world.\nTo handle these real life situations, you need to make two decisions:\n\nWhat is the source of truth? What will you save as your lasting record of what happened?\nWhere does your analysis live?\n\n\n6.2.1 What is the source of truth?\nAs a beginner, it’s okay to rely on your current Environment to contain all the objects you have created throughout your analysis. However, to make it easier to work on larger projects or collaborate with others, your source of truth should be the R scripts. With your R scripts (and your data files), you can recreate the environment. With only your environment, it’s much harder to recreate your R scripts: you’ll either have to retype a lot of code from memory (inevitably making mistakes along the way) or you’ll have to carefully mine your R history.\nTo help keep your R scripts as the source of truth for your analysis, we highly recommend that you instruct RStudio not to preserve your workspace between sessions. You can do this either by running usethis::use_blank_slate()2 or by mimicking the options shown in Figura 6.2. This will cause you some short-term pain, because now when you restart RStudio, it will no longer remember the code that you ran last time nor will the objects you created or the datasets you read be available to use. But this short-term pain saves you long-term agony because it forces you to capture all important procedures in your code. There’s nothing worse than discovering three months after the fact that you’ve only stored the results of an important calculation in your environment, not the calculation itself in your code.\n\n\n\n\nFigura 6.2: Copy these options in your RStudio options to always start your RStudio session with a clean slate.\n\n\n\nThere is a great pair of keyboard shortcuts that will work together to make sure you’ve captured the important parts of your code in the editor:\n\nPress Cmd/Ctrl + Shift + 0/F10 to restart R.\nPress Cmd/Ctrl + Shift + S to re-run the current script.\n\nWe collectively use this pattern hundreds of times a week.\nAlternatively, if you don’t use keyboard shortcuts, you can go to Session > Restart R and then highlight and re-run your current script.\n\n\n\n\n\n\nRStudio server\n\n\n\nIf you’re using RStudio server, your R session is never restarted by default. When you close your RStudio server tab, it might feel like you’re closing R, but the server actually keeps it running in the background. The next time you return, you’ll be in exactly the same place you left. This makes it even more important to regularly restart R so that you’re starting with a clean slate.\n\n\n\n6.2.2 Where does your analysis live?\nR has a powerful notion of the working directory. This is where R looks for files that you ask it to load, and where it will put any files that you ask it to save. RStudio shows your current working directory at the top of the console:\n\n\n\n\n\nAnd you can print this out in R code by running getwd():\n\ngetwd()\n#> [1] \"/Users/hadley/Documents/r4ds\"\n\nIn this R session, the current working directory (think of it as “home”) is in hadley’s Documents folder, in a subfolder called r4ds. This code will return a different result when you run it, because your computer has a different directory structure than Hadley’s!\nAs a beginning R user, it’s OK to let your working directory be your home directory, documents directory, or any other weird directory on your computer. But you’re seven chapters into this book, and you’re no longer a beginner. Very soon now you should evolve to organizing your projects into directories and, when working on a project, set R’s working directory to the associated directory.\nYou can set the working directory from within R but we do not recommend it:\n\nsetwd(\"/path/to/my/CoolProject\")\n\nThere’s a better way; a way that also puts you on the path to managing your R work like an expert. That way is the RStudio project.\n\n6.2.3 RStudio projects\nKeeping all the files associated with a given project (input data, R scripts, analytical results, and figures) together in one directory is such a wise and common practice that RStudio has built-in support for this via projects. Let’s make a project for you to use while you’re working through the rest of this book. Click File > New Project, then follow the steps shown in Figura 6.3.\n\n\n\n\nFigura 6.3: To create new project: (top) first click New Directory, then (middle) click New Project, then (bottom) fill in the directory (project) name, choose a good subdirectory for its home and click Create Project.\n\n\n\nCall your project r4ds and think carefully about which subdirectory you put the project in. If you don’t store it somewhere sensible, it will be hard to find it in the future!\nOnce this process is complete, you’ll get a new RStudio project just for this book. Check that the “home” of your project is the current working directory:\n\ngetwd()\n#> [1] /Users/hadley/Documents/r4ds\n\nNow enter the following commands in the script editor, and save the file, calling it “diamonds.R”. Then, create a new folder called “data”. You can do this by clicking on the “New Folder” button in the Files pane in RStudio. Finally, run the complete script which will save a PNG and CSV file into your project directory. Don’t worry about the details, you’ll learn them later in the book.\n\nlibrary(tidyverse)\n\nggplot(diamonds, aes(x = carat, y = price)) + \n geom_hex()\nggsave(\"diamonds.png\")\n\nwrite_csv(diamonds, \"data/diamonds.csv\")\n\nQuit RStudio. Inspect the folder associated with your project — notice the .Rproj file. Double-click that file to re-open the project. Notice you get back to where you left off: it’s the same working directory and command history, and all the files you were working on are still open. Because you followed our instructions above, you will, however, have a completely fresh environment, guaranteeing that you’re starting with a clean slate.\nIn your favorite OS-specific way, search your computer for diamonds.png and you will find the PNG (no surprise) but also the script that created it (diamonds.R). This is a huge win! One day, you will want to remake a figure or just understand where it came from. If you rigorously save figures to files with R code and never with the mouse or the clipboard, you will be able to reproduce old work with ease!\n\n6.2.4 Relative and absolute paths\nOnce you’re inside a project, you should only ever use relative paths not absolute paths. What’s the difference? A relative path is relative to the working directory, i.e. the project’s home. When Hadley wrote data/diamonds.csv above it was a shortcut for /Users/hadley/Documents/r4ds/data/diamonds.csv. But importantly, if Mine ran this code on her computer, it would point to /Users/Mine/Documents/r4ds/data/diamonds.csv. This is why relative paths are important: they’ll work regardless of where the R project folder ends up.\nAbsolute paths point to the same place regardless of your working directory. They look a little different depending on your operating system. On Windows they start with a drive letter (e.g., C:) or two backslashes (e.g., \\\\servername) and on Mac/Linux they start with a slash “/” (e.g., /users/hadley). You should never use absolute paths in your scripts, because they hinder sharing: no one else will have exactly the same directory configuration as you.\nThere’s another important difference between operating systems: how you separate the components of the path. Mac and Linux uses slashes (e.g., data/diamonds.csv) and Windows uses backslashes (e.g., data\\diamonds.csv). R can work with either type (no matter what platform you’re currently using), but unfortunately, backslashes mean something special to R, and to get a single backslash in the path, you need to type two backslashes! That makes life frustrating, so we recommend always using the Linux/Mac style with forward slashes." + }, + { + "objectID": "workflow-scripts.html#exercises", + "href": "workflow-scripts.html#exercises", + "title": "6  Workflow: scripts and projects", + "section": "\n6.3 Exercises", + "text": "6.3 Exercises\n\nGo to the RStudio Tips Twitter account, https://twitter.com/rstudiotips and find one tip that looks interesting. Practice using it!\nWhat other common mistakes will RStudio diagnostics report? Read https://support.posit.co/hc/en-us/articles/205753617-Code-Diagnostics to find out." + }, + { + "objectID": "workflow-scripts.html#summary", + "href": "workflow-scripts.html#summary", + "title": "6  Workflow: scripts and projects", + "section": "\n6.4 Summary", + "text": "6.4 Summary\nIn this chapter, you’ve learned how to organize your R code in scripts (files) and projects (directories). Much like code style, this may feel like busywork at first. But as you accumulate more code across multiple projects, you’ll learn to appreciate how a little up front organisation can save you a bunch of time down the road.\nIn summary, scripts and projects give you a solid workflow that will serve you well in the future:\n\nCreate one RStudio project for each data analysis project.\nSave your scripts (with informative names) in the project, edit them, run them in bits or as a whole. Restart R frequently to make sure you’ve captured everything in your scripts.\nOnly ever use relative paths, not absolute paths.\n\nThen everything you need is in one place and cleanly separated from all the other projects that you are working on.\nSo far, we’ve worked with datasets bundled inside of R packages. This makes it easier to get some practice on pre-prepared data, but obviously your data won’t be available in this way. So in the next chapter, you’re going to learn how load data from disk into your R session using the readr package." + }, + { + "objectID": "workflow-scripts.html#footnotes", + "href": "workflow-scripts.html#footnotes", + "title": "6  Workflow: scripts and projects", + "section": "", + "text": "Not to mention that you’re tempting fate by using “final” in the name 😆 The comic Piled Higher and Deeper has a fun strip on this.↩︎\nIf you don’t have usethis installed, you can install it with install.packages(\"usethis\").↩︎" + }, + { + "objectID": "data-import.html#introduction", + "href": "data-import.html#introduction", + "title": "7  Data import", + "section": "\n7.1 Introduction", + "text": "7.1 Introduction\nWorking with data provided by R packages is a great way to learn data science tools, but you want to apply what you’ve learned to your own data at some point. In this chapter, you’ll learn the basics of reading data files into R.\nSpecifically, this chapter will focus on reading plain-text rectangular files. We’ll start with practical advice for handling features like column names, types, and missing data. You will then learn about reading data from multiple files at once and writing data from R to a file. Finally, you’ll learn how to handcraft data frames in R.\n\n7.1.1 Prerequisites\nIn this chapter, you’ll learn how to load flat files in R with the readr package, which is part of the core tidyverse.\n\nlibrary(tidyverse)" + }, + { + "objectID": "data-import.html#reading-data-from-a-file", + "href": "data-import.html#reading-data-from-a-file", + "title": "7  Data import", + "section": "\n7.2 Reading data from a file", + "text": "7.2 Reading data from a file\nTo begin, we’ll focus on the most common rectangular data file type: CSV, which is short for comma-separated values. Here is what a simple CSV file looks like. The first row, commonly called the header row, gives the column names, and the following six rows provide the data. The columns are separated, aka delimited, by commas.\n\nStudent ID,Full Name,favourite.food,mealPlan,AGE\n1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4\n2,Barclay Lynn,French fries,Lunch only,5\n3,Jayendra Lyne,N/A,Breakfast and lunch,7\n4,Leon Rossini,Anchovies,Lunch only,\n5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five\n6,Güvenç Attila,Ice cream,Lunch only,6\n\nTabela 7.1 shows a representation of the same data as a table.\n\n\n\n\nTabela 7.1: Data from the students.csv file as a table.\n\n\n\n\n\n\n\n\nStudent ID\nFull Name\nfavourite.food\nmealPlan\nAGE\n\n\n\n1\nSunil Huffmann\nStrawberry yoghurt\nLunch only\n4\n\n\n2\nBarclay Lynn\nFrench fries\nLunch only\n5\n\n\n3\nJayendra Lyne\nN/A\nBreakfast and lunch\n7\n\n\n4\nLeon Rossini\nAnchovies\nLunch only\nNA\n\n\n5\nChidiegwu Dunkel\nPizza\nBreakfast and lunch\nfive\n\n\n6\nGüvenç Attila\nIce cream\nLunch only\n6\n\n\n\n\n\n\nWe can read this file into R using read_csv(). The first argument is the most important: the path to the file. You can think about the path as the address of the file: the file is called students.csv and that it lives in the data folder.\n\nstudents <- read_csv(\"data/students.csv\")\n#> Rows: 6 Columns: 5\n#> ── Column specification ─────────────────────────────────────────────────────\n#> Delimiter: \",\"\n#> chr (4): Full Name, favourite.food, mealPlan, AGE\n#> dbl (1): Student ID\n#> \n#> ℹ Use `spec()` to retrieve the full column specification for this data.\n#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n\nThe code above will work if you have the students.csv file in a data folder in your project. You can download the students.csv file from https://pos.it/r4ds-students-csv or you can read it directly from that URL with:\n\nstudents <- read_csv(\"https://pos.it/r4ds-students-csv\")\n\nWhen you run read_csv(), it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains). It also prints out some information about retrieving the full column specification and how to quiet this message. This message is an integral part of readr, and we’ll return to it in Seção 7.3.\n\n7.2.1 Practical advice\nOnce you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis. Let’s take another look at the students data with that in mind.\n\nstudents\n#> # A tibble: 6 × 5\n#> `Student ID` `Full Name` favourite.food mealPlan AGE \n#> <dbl> <chr> <chr> <chr> <chr>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 \n#> 2 2 Barclay Lynn French fries Lunch only 5 \n#> 3 3 Jayendra Lyne N/A Breakfast and lunch 7 \n#> 4 4 Leon Rossini Anchovies Lunch only <NA> \n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five \n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\nIn the favourite.food column, there are a bunch of food items, and then the character string N/A, which should have been a real NA that R will recognize as “not available”. This is something we can address using the na argument. By default, read_csv() only recognizes empty strings (\"\") in this dataset as NAs, we want it to also recognize the character string \"N/A\".\n\nstudents <- read_csv(\"data/students.csv\", na = c(\"N/A\", \"\"))\n\nstudents\n#> # A tibble: 6 × 5\n#> `Student ID` `Full Name` favourite.food mealPlan AGE \n#> <dbl> <chr> <chr> <chr> <chr>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 \n#> 2 2 Barclay Lynn French fries Lunch only 5 \n#> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7 \n#> 4 4 Leon Rossini Anchovies Lunch only <NA> \n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five \n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\nYou might also notice that the Student ID and Full Name columns are surrounded by backticks. That’s because they contain spaces, breaking R’s usual rules for variable names; they’re non-syntactic names. To refer to these variables, you need to surround them with backticks, `:\n\nstudents |> \n rename(\n student_id = `Student ID`,\n full_name = `Full Name`\n )\n#> # A tibble: 6 × 5\n#> student_id full_name favourite.food mealPlan AGE \n#> <dbl> <chr> <chr> <chr> <chr>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 \n#> 2 2 Barclay Lynn French fries Lunch only 5 \n#> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7 \n#> 4 4 Leon Rossini Anchovies Lunch only <NA> \n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five \n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\nAn alternative approach is to use janitor::clean_names() to use some heuristics to turn them all into snake case at once1.\n\nstudents |> janitor::clean_names()\n#> # A tibble: 6 × 5\n#> student_id full_name favourite_food meal_plan age \n#> <dbl> <chr> <chr> <chr> <chr>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 \n#> 2 2 Barclay Lynn French fries Lunch only 5 \n#> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7 \n#> 4 4 Leon Rossini Anchovies Lunch only <NA> \n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five \n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\nAnother common task after reading in data is to consider variable types. For example, meal_plan is a categorical variable with a known set of possible values, which in R should be represented as a factor:\n\nstudents |>\n janitor::clean_names() |>\n mutate(meal_plan = factor(meal_plan))\n#> # A tibble: 6 × 5\n#> student_id full_name favourite_food meal_plan age \n#> <dbl> <chr> <chr> <fct> <chr>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 \n#> 2 2 Barclay Lynn French fries Lunch only 5 \n#> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7 \n#> 4 4 Leon Rossini Anchovies Lunch only <NA> \n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five \n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\nNote that the values in the meal_plan variable have stayed the same, but the type of variable denoted underneath the variable name has changed from character (<chr>) to factor (<fct>). You’ll learn more about factors in Capítulo 16.\nBefore you analyze these data, you’ll probably want to fix the age and id columns. Currently, age is a character variable because one of the observations is typed out as five instead of a numeric 5. We discuss the details of fixing this issue in Capítulo 20.\n\nstudents <- students |>\n janitor::clean_names() |>\n mutate(\n meal_plan = factor(meal_plan),\n age = parse_number(if_else(age == \"five\", \"5\", age))\n )\n\nstudents\n#> # A tibble: 6 × 5\n#> student_id full_name favourite_food meal_plan age\n#> <dbl> <chr> <chr> <fct> <dbl>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4\n#> 2 2 Barclay Lynn French fries Lunch only 5\n#> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7\n#> 4 4 Leon Rossini Anchovies Lunch only NA\n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5\n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\nA new function here is if_else(), which has three arguments. The first argument test should be a logical vector. The result will contain the value of the second argument, yes, when test is TRUE, and the value of the third argument, no, when it is FALSE. Here we’re saying if age is the character string \"five\", make it \"5\", and if not leave it as age. You will learn more about if_else() and logical vectors in Capítulo 12.\n\n7.2.2 Other arguments\nThere are a couple of other important arguments that we need to mention, and they’ll be easier to demonstrate if we first show you a handy trick: read_csv() can read text strings that you’ve created and formatted like a CSV file:\n\nread_csv(\n \"a,b,c\n 1,2,3\n 4,5,6\"\n)\n#> # A tibble: 2 × 3\n#> a b c\n#> <dbl> <dbl> <dbl>\n#> 1 1 2 3\n#> 2 4 5 6\n\nUsually, read_csv() uses the first line of the data for the column names, which is a very common convention. But it’s not uncommon for a few lines of metadata to be included at the top of the file. You can use skip = n to skip the first n lines or use comment = \"#\" to drop all lines that start with (e.g.) #:\n\nread_csv(\n \"The first line of metadata\n The second line of metadata\n x,y,z\n 1,2,3\",\n skip = 2\n)\n#> # A tibble: 1 × 3\n#> x y z\n#> <dbl> <dbl> <dbl>\n#> 1 1 2 3\n\nread_csv(\n \"# A comment I want to skip\n x,y,z\n 1,2,3\",\n comment = \"#\"\n)\n#> # A tibble: 1 × 3\n#> x y z\n#> <dbl> <dbl> <dbl>\n#> 1 1 2 3\n\nIn other cases, the data might not have column names. You can use col_names = FALSE to tell read_csv() not to treat the first row as headings and instead label them sequentially from X1 to Xn:\n\nread_csv(\n \"1,2,3\n 4,5,6\",\n col_names = FALSE\n)\n#> # A tibble: 2 × 3\n#> X1 X2 X3\n#> <dbl> <dbl> <dbl>\n#> 1 1 2 3\n#> 2 4 5 6\n\nAlternatively, you can pass col_names a character vector which will be used as the column names:\n\nread_csv(\n \"1,2,3\n 4,5,6\",\n col_names = c(\"x\", \"y\", \"z\")\n)\n#> # A tibble: 2 × 3\n#> x y z\n#> <dbl> <dbl> <dbl>\n#> 1 1 2 3\n#> 2 4 5 6\n\nThese arguments are all you need to know to read the majority of CSV files that you’ll encounter in practice. (For the rest, you’ll need to carefully inspect your .csv file and read the documentation for read_csv()’s many other arguments.)\n\n7.2.3 Other file types\nOnce you’ve mastered read_csv(), using readr’s other functions is straightforward; it’s just a matter of knowing which function to reach for:\n\nread_csv2() reads semicolon-separated files. These use ; instead of , to separate fields and are common in countries that use , as the decimal marker.\nread_tsv() reads tab-delimited files.\nread_delim() reads in files with any delimiter, attempting to automatically guess the delimiter if you don’t specify it.\nread_fwf() reads fixed-width files. You can specify fields by their widths with fwf_widths() or by their positions with fwf_positions().\nread_table() reads a common variation of fixed-width files where columns are separated by white space.\nread_log() reads Apache-style log files.\n\n7.2.4 Exercises\n\nWhat function would you use to read a file where fields were separated with “|”?\nApart from file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?\nWhat are the most important arguments to read_fwf()?\n\nSometimes strings in a CSV file contain commas. To prevent them from causing problems, they need to be surrounded by a quoting character, like \" or '. By default, read_csv() assumes that the quoting character will be \". To read the following text into a data frame, what argument to read_csv() do you need to specify?\n\n\"x,y\\n1,'a,b'\"\n\n\n\nIdentify what is wrong with each of the following inline CSV files. What happens when you run the code?\n\nread_csv(\"a,b\\n1,2,3\\n4,5,6\")\nread_csv(\"a,b,c\\n1,2\\n1,2,3,4\")\nread_csv(\"a,b\\n\\\"1\")\nread_csv(\"a,b\\n1,2\\na,b\")\nread_csv(\"a;b\\n1;3\")\n\n\n\nPractice referring to non-syntactic names in the following data frame by:\n\nExtracting the variable called 1.\nPlotting a scatterplot of 1 vs. 2.\nCreating a new column called 3, which is 2 divided by 1.\nRenaming the columns to one, two, and three.\n\n\nannoying <- tibble(\n `1` = 1:10,\n `2` = `1` * 2 + rnorm(length(`1`))\n)" + }, + { + "objectID": "data-import.html#sec-col-types", + "href": "data-import.html#sec-col-types", + "title": "7  Data import", + "section": "\n7.3 Controlling column types", + "text": "7.3 Controlling column types\nA CSV file doesn’t contain any information about the type of each variable (i.e. whether it’s a logical, number, string, etc.), so readr will try to guess the type. This section describes how the guessing process works, how to resolve some common problems that cause it to fail, and, if needed, how to supply the column types yourself. Finally, we’ll mention a few general strategies that are useful if readr is failing catastrophically and you need to get more insight into the structure of your file.\n\n7.3.1 Guessing types\nreadr uses a heuristic to figure out the column types. For each column, it pulls the values of 1,0002 rows spaced evenly from the first row to the last, ignoring missing values. It then works through the following questions:\n\nDoes it contain only F, T, FALSE, or TRUE (ignoring case)? If so, it’s a logical.\nDoes it contain only numbers (e.g., 1, -4.5, 5e6, Inf)? If so, it’s a number.\nDoes it match the ISO8601 standard? If so, it’s a date or date-time. (We’ll return to date-times in more detail in Seção 17.2).\nOtherwise, it must be a string.\n\nYou can see that behavior in action in this simple example:\n\nread_csv(\"\n logical,numeric,date,string\n TRUE,1,2021-01-15,abc\n false,4.5,2021-02-15,def\n T,Inf,2021-02-16,ghi\n\")\n#> # A tibble: 3 × 4\n#> logical numeric date string\n#> <lgl> <dbl> <date> <chr> \n#> 1 TRUE 1 2021-01-15 abc \n#> 2 FALSE 4.5 2021-02-15 def \n#> 3 TRUE Inf 2021-02-16 ghi\n\nThis heuristic works well if you have a clean dataset, but in real life, you’ll encounter a selection of weird and beautiful failures.\n\n7.3.2 Missing values, column types, and problems\nThe most common way column detection fails is that a column contains unexpected values, and you get a character column instead of a more specific type. One of the most common causes for this is a missing value, recorded using something other than the NA that readr expects.\nTake this simple 1 column CSV file as an example:\n\nsimple_csv <- \"\n x\n 10\n .\n 20\n 30\"\n\nIf we read it without any additional arguments, x becomes a character column:\n\nread_csv(simple_csv)\n#> # A tibble: 4 × 1\n#> x \n#> <chr>\n#> 1 10 \n#> 2 . \n#> 3 20 \n#> 4 30\n\nIn this very small case, you can easily see the missing value .. But what happens if you have thousands of rows with only a few missing values represented by .s sprinkled among them? One approach is to tell readr that x is a numeric column, and then see where it fails. You can do that with the col_types argument, which takes a named list where the names match the column names in the CSV file:\n\ndf <- read_csv(\n simple_csv, \n col_types = list(x = col_double())\n)\n#> Warning: One or more parsing issues, call `problems()` on your data frame for\n#> details, e.g.:\n#> dat <- vroom(...)\n#> problems(dat)\n\nNow read_csv() reports that there was a problem, and tells us we can find out more with problems():\n\nproblems(df)\n#> # A tibble: 1 × 5\n#> row col expected actual file \n#> <int> <int> <chr> <chr> <chr> \n#> 1 3 1 a double . /tmp/Rtmp7ye2gf/file228416ab4e78\n\nThis tells us that there was a problem in row 3, col 1 where readr expected a double but got a .. That suggests this dataset uses . for missing values. So then we set na = \".\", the automatic guessing succeeds, giving us the numeric column that we want:\n\nread_csv(simple_csv, na = \".\")\n#> # A tibble: 4 × 1\n#> x\n#> <dbl>\n#> 1 10\n#> 2 NA\n#> 3 20\n#> 4 30\n\n\n7.3.3 Column types\nreadr provides a total of nine column types for you to use:\n\n\ncol_logical() and col_double() read logicals and real numbers. They’re relatively rarely needed (except as above), since readr will usually guess them for you.\n\ncol_integer() reads integers. We seldom distinguish integers and doubles in this book because they’re functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.\n\ncol_character() reads strings. This can be useful to specify explicitly when you have a column that is a numeric identifier, i.e., long series of digits that identifies an object but doesn’t make sense to apply mathematical operations to. Examples include phone numbers, social security numbers, credit card numbers, etc.\n\ncol_factor(), col_date(), and col_datetime() create factors, dates, and date-times respectively; you’ll learn more about those when we get to those data types in Capítulo 16 and Capítulo 17.\n\ncol_number() is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies. You’ll learn more about it in Capítulo 13.\n\ncol_skip() skips a column so it’s not included in the result, which can be useful for speeding up reading the data if you have a large CSV file and you only want to use some of the columns.\n\nIt’s also possible to override the default column by switching from list() to cols() and specifying .default:\n\nanother_csv <- \"\nx,y,z\n1,2,3\"\n\nread_csv(\n another_csv, \n col_types = cols(.default = col_character())\n)\n#> # A tibble: 1 × 3\n#> x y z \n#> <chr> <chr> <chr>\n#> 1 1 2 3\n\nAnother useful helper is cols_only() which will read in only the columns you specify:\n\nread_csv(\n another_csv,\n col_types = cols_only(x = col_character())\n)\n#> # A tibble: 1 × 1\n#> x \n#> <chr>\n#> 1 1" + }, + { + "objectID": "data-import.html#sec-readr-directory", + "href": "data-import.html#sec-readr-directory", + "title": "7  Data import", + "section": "\n7.4 Reading data from multiple files", + "text": "7.4 Reading data from multiple files\nSometimes your data is split across multiple files instead of being contained in a single file. For example, you might have sales data for multiple months, with each month’s data in a separate file: 01-sales.csv for January, 02-sales.csv for February, and 03-sales.csv for March. With read_csv() you can read these data in at once and stack them on top of each other in a single data frame.\n\nsales_files <- c(\"data/01-sales.csv\", \"data/02-sales.csv\", \"data/03-sales.csv\")\nread_csv(sales_files, id = \"file\")\n#> # A tibble: 19 × 6\n#> file month year brand item n\n#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>\n#> 1 data/01-sales.csv January 2019 1 1234 3\n#> 2 data/01-sales.csv January 2019 1 8721 9\n#> 3 data/01-sales.csv January 2019 1 1822 2\n#> 4 data/01-sales.csv January 2019 2 3333 1\n#> 5 data/01-sales.csv January 2019 2 2156 9\n#> 6 data/01-sales.csv January 2019 2 3987 6\n#> # ℹ 13 more rows\n\nOnce again, the code above will work if you have the CSV files in a data folder in your project. You can download these files from https://pos.it/r4ds-01-sales, https://pos.it/r4ds-02-sales, and https://pos.it/r4ds-03-sales or you can read them directly with:\n\nsales_files <- c(\n \"https://pos.it/r4ds-01-sales\",\n \"https://pos.it/r4ds-02-sales\",\n \"https://pos.it/r4ds-03-sales\"\n)\nread_csv(sales_files, id = \"file\")\n\nThe id argument adds a new column called file to the resulting data frame that identifies the file the data come from. This is especially helpful in circumstances where the files you’re reading in do not have an identifying column that can help you trace the observations back to their original sources.\nIf you have many files you want to read in, it can get cumbersome to write out their names as a list. Instead, you can use the base list.files() function to find the files for you by matching a pattern in the file names. You’ll learn more about these patterns in Capítulo 15.\n\nsales_files <- list.files(\"data\", pattern = \"sales\\\\.csv$\", full.names = TRUE)\nsales_files\n#> [1] \"data/01-sales.csv\" \"data/02-sales.csv\" \"data/03-sales.csv\"" + }, + { + "objectID": "data-import.html#sec-writing-to-a-file", + "href": "data-import.html#sec-writing-to-a-file", + "title": "7  Data import", + "section": "\n7.5 Writing to a file", + "text": "7.5 Writing to a file\nreadr also comes with two useful functions for writing data back to disk: write_csv() and write_tsv(). The most important arguments to these functions are x (the data frame to save) and file (the location to save it). You can also specify how missing values are written with na, and if you want to append to an existing file.\n\nwrite_csv(students, \"students.csv\")\n\nNow let’s read that csv file back in. Note that the variable type information that you just set up is lost when you save to CSV because you’re starting over with reading from a plain text file again:\n\nstudents\n#> # A tibble: 6 × 5\n#> student_id full_name favourite_food meal_plan age\n#> <dbl> <chr> <chr> <fct> <dbl>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4\n#> 2 2 Barclay Lynn French fries Lunch only 5\n#> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7\n#> 4 4 Leon Rossini Anchovies Lunch only NA\n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5\n#> 6 6 Güvenç Attila Ice cream Lunch only 6\nwrite_csv(students, \"students-2.csv\")\nread_csv(\"students-2.csv\")\n#> # A tibble: 6 × 5\n#> student_id full_name favourite_food meal_plan age\n#> <dbl> <chr> <chr> <chr> <dbl>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4\n#> 2 2 Barclay Lynn French fries Lunch only 5\n#> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7\n#> 4 4 Leon Rossini Anchovies Lunch only NA\n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5\n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\nThis makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in. There are two main alternatives:\n\n\nwrite_rds() and read_rds() are uniform wrappers around the base functions readRDS() and saveRDS(). These store data in R’s custom binary format called RDS. This means that when you reload the object, you are loading the exact same R object that you stored.\n\nwrite_rds(students, \"students.rds\")\nread_rds(\"students.rds\")\n#> # A tibble: 6 × 5\n#> student_id full_name favourite_food meal_plan age\n#> <dbl> <chr> <chr> <fct> <dbl>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4\n#> 2 2 Barclay Lynn French fries Lunch only 5\n#> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7\n#> 4 4 Leon Rossini Anchovies Lunch only NA\n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5\n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\n\n\nThe arrow package allows you to read and write parquet files, a fast binary file format that can be shared across programming languages. We’ll return to arrow in more depth in Capítulo 22.\n\nlibrary(arrow)\nwrite_parquet(students, \"students.parquet\")\nread_parquet(\"students.parquet\")\n#> # A tibble: 6 × 5\n#> student_id full_name favourite_food meal_plan age\n#> <dbl> <chr> <chr> <fct> <dbl>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4\n#> 2 2 Barclay Lynn French fries Lunch only 5\n#> 3 3 Jayendra Lyne NA Breakfast and lunch 7\n#> 4 4 Leon Rossini Anchovies Lunch only NA\n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5\n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\n\n\nParquet tends to be much faster than RDS and is usable outside of R, but does require the arrow package." + }, + { + "objectID": "data-import.html#data-entry", + "href": "data-import.html#data-entry", + "title": "7  Data import", + "section": "\n7.6 Data entry", + "text": "7.6 Data entry\nSometimes you’ll need to assemble a tibble “by hand” doing a little data entry in your R script. There are two useful functions to help you do this which differ in whether you layout the tibble by columns or by rows. tibble() works by column:\n\ntibble(\n x = c(1, 2, 5), \n y = c(\"h\", \"m\", \"g\"),\n z = c(0.08, 0.83, 0.60)\n)\n#> # A tibble: 3 × 3\n#> x y z\n#> <dbl> <chr> <dbl>\n#> 1 1 h 0.08\n#> 2 2 m 0.83\n#> 3 5 g 0.6\n\nLaying out the data by column can make it hard to see how the rows are related, so an alternative is tribble(), short for transposed tibble, which lets you lay out your data row by row. tribble() is customized for data entry in code: column headings start with ~ and entries are separated by commas. This makes it possible to lay out small amounts of data in an easy to read form:\n\ntribble(\n ~x, ~y, ~z,\n 1, \"h\", 0.08,\n 2, \"m\", 0.83,\n 5, \"g\", 0.60\n)\n#> # A tibble: 3 × 3\n#> x y z\n#> <dbl> <chr> <dbl>\n#> 1 1 h 0.08\n#> 2 2 m 0.83\n#> 3 5 g 0.6" + }, + { + "objectID": "data-import.html#summary", + "href": "data-import.html#summary", + "title": "7  Data import", + "section": "\n7.7 Summary", + "text": "7.7 Summary\nIn this chapter, you’ve learned how to load CSV files with read_csv() and to do your own data entry with tibble() and tribble(). You’ve learned how csv files work, some of the problems you might encounter, and how to overcome them. We’ll come to data import a few times in this book: Capítulo 20 from Excel and Google Sheets, Capítulo 21 will show you how to load data from databases, Capítulo 22 from parquet files, Capítulo 23 from JSON, and Capítulo 24 from websites.\nWe’re just about at the end of this section of the book, but there’s one important last topic to cover: how to get help. So in the next chapter, you’ll learn some good places to look for help, how to create a reprex to maximize your chances of getting good help, and some general advice on keeping up with the world of R." + }, + { + "objectID": "data-import.html#footnotes", + "href": "data-import.html#footnotes", + "title": "7  Data import", + "section": "", + "text": "The janitor package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that use |>.↩︎\nYou can override the default of 1000 with the guess_max argument.↩︎" + }, + { + "objectID": "workflow-help.html#google-is-your-friend", + "href": "workflow-help.html#google-is-your-friend", + "title": "8  Workflow: getting help", + "section": "\n8.1 Google is your friend", + "text": "8.1 Google is your friend\nIf you get stuck, start with Google. Typically adding “R” to a query is enough to restrict it to relevant results: if the search isn’t useful, it often means that there aren’t any R-specific results available. Additionally, adding package names like “tidyverse” or “ggplot2” will help narrow down the results to code that will feel more familiar to you as well, e.g., “how to make a boxplot in R” vs. “how to make a boxplot in R with ggplot2”. Google is particularly useful for error messages. If you get an error message and you have no idea what it means, try googling it! Chances are that someone else has been confused by it in the past, and there will be help somewhere on the web. (If the error message isn’t in English, run Sys.setenv(LANGUAGE = \"en\") and re-run the code; you’re more likely to find help for English error messages.)\nIf Google doesn’t help, try Stack Overflow. Start by spending a little time searching for an existing answer, including [R], to restrict your search to questions and answers that use R." + }, + { + "objectID": "workflow-help.html#making-a-reprex", + "href": "workflow-help.html#making-a-reprex", + "title": "8  Workflow: getting help", + "section": "\n8.2 Making a reprex", + "text": "8.2 Making a reprex\nIf your googling doesn’t find anything useful, it’s a really good idea to prepare a reprex, short for minimal reproducible example. A good reprex makes it easier for other people to help you, and often you’ll figure out the problem yourself in the course of making it. There are two parts to creating a reprex:\n\nFirst, you need to make your code reproducible. This means that you need to capture everything, i.e. include any library() calls and create all necessary objects. The easiest way to make sure you’ve done this is using the reprex package.\nSecond, you need to make it minimal. Strip away everything that is not directly related to your problem. This usually involves creating a much smaller and simpler R object than the one you’re facing in real life or even using built-in data.\n\nThat sounds like a lot of work! And it can be, but it has a great payoff:\n\n80% of the time, creating an excellent reprex reveals the source of your problem. It’s amazing how often the process of writing up a self-contained and minimal example allows you to answer your own question.\nThe other 20% of the time, you will have captured the essence of your problem in a way that is easy for others to play with. This substantially improves your chances of getting help!\n\nWhen creating a reprex by hand, it’s easy to accidentally miss something, meaning your code can’t be run on someone else’s computer. Avoid this problem by using the reprex package, which is installed as part of the tidyverse. Let’s say you copy this code onto your clipboard (or, on RStudio Server or Cloud, select it):\n\ny <- 1:4\nmean(y)\n\nThen call reprex(), where the default output is formatted for GitHub:\nreprex::reprex()\nA nicely rendered HTML preview will display in RStudio’s Viewer (if you’re in RStudio) or your default browser otherwise. The reprex is automatically copied to your clipboard (on RStudio Server or Cloud, you will need to copy this yourself):\n``` r\ny <- 1:4\nmean(y)\n#> [1] 2.5\n```\nThis text is formatted in a special way, called Markdown, which can be pasted to sites like StackOverflow or Github and they will automatically render it to look like code. Here’s what that Markdown would look like rendered on GitHub:\n\ny <- 1:4\nmean(y)\n#> [1] 2.5\n\nAnyone else can copy, paste, and run this immediately.\nThere are three things you need to include to make your example reproducible: required packages, data, and code.\n\nPackages should be loaded at the top of the script so it’s easy to see which ones the example needs. This is a good time to check that you’re using the latest version of each package; you may have discovered a bug that’s been fixed since you installed or last updated the package. For packages in the tidyverse, the easiest way to check is to run tidyverse_update().\n\nThe easiest way to include data is to use dput() to generate the R code needed to recreate it. For example, to recreate the mtcars dataset in R, perform the following steps:\n\nRun dput(mtcars) in R\nCopy the output\nIn reprex, type mtcars <-, then paste.\n\nTry to use the smallest subset of your data that still reveals the problem.\n\n\nSpend a little bit of time ensuring that your code is easy for others to read:\n\nMake sure you’ve used spaces and your variable names are concise yet informative.\nUse comments to indicate where your problem lies.\nDo your best to remove everything that is not related to the problem.\n\nThe shorter your code is, the easier it is to understand and the easier it is to fix.\n\n\nFinish by checking that you have actually made a reproducible example by starting a fresh R session and copying and pasting your script.\nCreating reprexes is not trivial, and it will take some practice to learn to create good, truly minimal reprexes. However, learning to ask questions that include the code, and investing the time to make it reproducible will continue to pay off as you learn and master R." + }, + { + "objectID": "workflow-help.html#investing-in-yourself", + "href": "workflow-help.html#investing-in-yourself", + "title": "8  Workflow: getting help", + "section": "\n8.3 Investing in yourself", + "text": "8.3 Investing in yourself\nYou should also spend some time preparing yourself to solve problems before they occur. Investing a little time in learning R each day will pay off handsomely in the long run. One way is to follow what the tidyverse team is doing on the tidyverse blog. To keep up with the R community more broadly, we recommend reading R Weekly: it’s a community effort to aggregate the most interesting news in the R community each week." + }, + { + "objectID": "workflow-help.html#summary", + "href": "workflow-help.html#summary", + "title": "8  Workflow: getting help", + "section": "\n8.4 Summary", + "text": "8.4 Summary\nThis chapter concludes the Whole Game part of the book. You’ve now seen the most important parts of the data science process: visualization, transformation, tidying and importing. Now you’ve got a holistic view of the whole process, and we start to get into the details of small pieces.\nThe next part of the book, Visualize, does a deeper dive into the grammar of graphics and creating data visualizations with ggplot2, showcases how to use the tools you’ve learned so far to conduct exploratory data analysis, and introduces good practices for creating plots for communication." + }, { "objectID": "visualize.html", "href": "visualize.html", "title": "Visualizar", "section": "", - "text": "Depois de ler a primeira parte do livro, você compreende (pelo menos superficialmente) as ferramentas mais importantes para fazer ciência de dados. Agora é hora de começar a se aprofundar nos detalhes. Nesta parte do livro, você aprenderá a visualizar dados com mais detalhes.\n\n\n\n\nFigura 1: A visualização de dados geralmente é o primeiro passo na exploração de dados.\n\n\n\nCada capítulo aborda um ou mais aspectos da criação de uma visualização de dados.\n\nNo ?sec-layers, você irá conhecer a gramática dos gráficos.\nNo ?sec-exploratory-data-analysis, você irá combinar a visualização com a sua curiosidade e ceticismo para fazer e responder perguntas interessantes sobre os dados.\nPor fim, no ?sec-communication, você irá aprender a usar seus gráficos exploratórios, melhorá-los e transformá-los em gráficos expositivos, gráficos que ajudam o recém-chegado à sua análise a entender o que está acontecendo da maneira mais rápida e fácil possível.\n\nEstes três capítulos te permitem iniciar no mundo da visualização, mas há muito mais para aprender. O melhor lugar para aprender mais é o livro sobre o ggplot2: ggplot2: Elegant graphics for data analysis. Este livro aprofunda muito mais a teoria subjacente e tem muitos exemplos de como combinar as diversas funções do pacote para resolver problemas práticos. Outro grande recurso é a galeria de extensões do ggplot2 https://exts.ggplot2.tidyverse.org/gallery/. Este site lista diversos pacotes que expandem o ggplot2 com novas geometrias e escalas. É um ótimo lugar para começar se estiver tentando fazer algo que parece difícil com o ggplot2." + "text": "Depois de ler a primeira parte do livro, você compreende (pelo menos superficialmente) as ferramentas mais importantes para fazer ciência de dados. Agora é hora de começar a se aprofundar nos detalhes. Nesta parte do livro, você aprenderá a visualizar dados com mais detalhes.\n\n\n\n\nFigura 1: A visualização de dados geralmente é o primeiro passo na exploração de dados.\n\n\n\nCada capítulo aborda um ou mais aspectos da criação de uma visualização de dados.\n\nNo Capítulo 9, você irá conhecer a gramática dos gráficos.\nNo Capítulo 10, você irá combinar a visualização com a sua curiosidade e ceticismo para fazer e responder perguntas interessantes sobre os dados.\nPor fim, no Capítulo 11, você irá aprender a usar seus gráficos exploratórios, melhorá-los e transformá-los em gráficos expositivos, gráficos que ajudam o recém-chegado à sua análise a entender o que está acontecendo da maneira mais rápida e fácil possível.\n\nEstes três capítulos te permitem iniciar no mundo da visualização, mas há muito mais para aprender. O melhor lugar para aprender mais é o livro sobre o ggplot2: ggplot2: Elegant graphics for data analysis. Este livro aprofunda muito mais a teoria subjacente e tem muitos exemplos de como combinar as diversas funções do pacote para resolver problemas práticos. Outro grande recurso é a galeria de extensões do ggplot2 https://exts.ggplot2.tidyverse.org/gallery/. Este site lista diversos pacotes que expandem o ggplot2 com novas geometrias e escalas. É um ótimo lugar para começar se estiver tentando fazer algo que parece difícil com o ggplot2." + }, + { + "objectID": "layers.html#introduction", + "href": "layers.html#introduction", + "title": "9  Layers", + "section": "\n9.1 Introduction", + "text": "9.1 Introduction\nIn Capítulo 1, you learned much more than just how to make scatterplots, bar charts, and boxplots. You learned a foundation that you can use to make any type of plot with ggplot2.\nIn this chapter, you’ll expand on that foundation as you learn about the layered grammar of graphics. We’ll start with a deeper dive into aesthetic mappings, geometric objects, and facets. Then, you will learn about statistical transformations ggplot2 makes under the hood when creating a plot. These transformations are used to calculate new values to plot, such as the heights of bars in a bar plot or medians in a box plot. You will also learn about position adjustments, which modify how geoms are displayed in your plots. Finally, we’ll briefly introduce coordinate systems.\nWe will not cover every single function and option for each of these layers, but we will walk you through the most important and commonly used functionality provided by ggplot2 as well as introduce you to packages that extend ggplot2.\n\n9.1.1 Prerequisites\nThis chapter focuses on ggplot2. To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running this code:\n\nlibrary(tidyverse)" + }, + { + "objectID": "layers.html#aesthetic-mappings", + "href": "layers.html#aesthetic-mappings", + "title": "9  Layers", + "section": "\n9.2 Aesthetic mappings", + "text": "9.2 Aesthetic mappings\n\n“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey\n\nRemember that the mpg data frame bundled with the ggplot2 package contains 234 observations on 38 car models.\n\nmpg\n#> # A tibble: 234 × 11\n#> manufacturer model displ year cyl trans drv cty hwy fl \n#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>\n#> 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p \n#> 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p \n#> 3 audi a4 2 2008 4 manual(m6) f 20 31 p \n#> 4 audi a4 2 2008 4 auto(av) f 21 30 p \n#> 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p \n#> 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p \n#> # ℹ 228 more rows\n#> # ℹ 1 more variable: class <chr>\n\nAmong the variables in mpg are:\n\ndispl: A car’s engine size, in liters. A numerical variable.\nhwy: A car’s fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance. A numerical variable.\nclass: Type of car. A categorical variable.\n\nLet’s start by visualizing the relationship between displ and hwy for various classes of cars. We can do this with a scatterplot where the numerical variables are mapped to the x and y aesthetics and the categorical variable is mapped to an aesthetic like color or shape.\n\n# Left\nggplot(mpg, aes(x = displ, y = hwy, color = class)) +\n geom_point()\n\n# Right\nggplot(mpg, aes(x = displ, y = hwy, shape = class)) +\n geom_point()\n#> Warning: The shape palette can deal with a maximum of 6 discrete values\n#> because more than 6 becomes difficult to discriminate; you have 7.\n#> Consider specifying shapes manually if you must have them.\n#> Warning: Removed 62 rows containing missing values (`geom_point()`).\n\n\n\n\n\n\n\n\n\n\n\nWhen class is mapped to shape, we get two warnings:\n\n1: The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes difficult to discriminate; you have 7. Consider specifying shapes manually if you must have them.\n2: Removed 62 rows containing missing values (geom_point()).\n\nSince ggplot2 will only use six shapes at a time, by default, additional groups will go unplotted when you use the shape aesthetic. The second warning is related – there are 62 SUVs in the dataset and they’re not plotted.\nSimilarly, we can map class to size or alpha aesthetics as well, which control the shape and the transparency of the points, respectively.\n\n# Left\nggplot(mpg, aes(x = displ, y = hwy, size = class)) +\n geom_point()\n#> Warning: Using size for a discrete variable is not advised.\n\n# Right\nggplot(mpg, aes(x = displ, y = hwy, alpha = class)) +\n geom_point()\n#> Warning: Using alpha for a discrete variable is not advised.\n\n\n\n\n\n\n\n\n\n\n\nBoth of these produce warnings as well:\n\nUsing alpha for a discrete variable is not advised.\n\nMapping an unordered discrete (categorical) variable (class) to an ordered aesthetic (size or alpha) is generally not a good idea because it implies a ranking that does not in fact exist.\nOnce you map an aesthetic, ggplot2 takes care of the rest. It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values. For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. The axis line provides the same information as a legend; it explains the mapping between locations and values.\nYou can also set the visual properties of your geom manually as an argument of your geom function (outside of aes()) instead of relying on a variable mapping to determine the appearance. For example, we can make all of the points in our plot blue:\n\nggplot(mpg, aes(x = displ, y = hwy)) + \n geom_point(color = \"blue\")\n\n\n\n\nHere, the color doesn’t convey information about a variable, but only changes the appearance of the plot. You’ll need to pick a value that makes sense for that aesthetic:\n\nThe name of a color as a character string, e.g., color = \"blue\"\n\nThe size of a point in mm, e.g., size = 1\n\nThe shape of a point as a number, e.g, shape = 1, as shown in Figura 9.1.\n\n\n\n\n\nFigura 9.1: R has 25 built-in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the color and fill aesthetics. The hollow shapes (0–14) have a border determined by color; the solid shapes (15–20) are filled with color; the filled shapes (21–24) have a border of color and are filled with fill. Shapes are arranged to keep similar shapes next to each other.\n\n\n\nSo far we have discussed aesthetics that we can map or set in a scatterplot, when using a point geom. You can learn more about all possible aesthetic mappings in the aesthetic specifications vignette at https://ggplot2.tidyverse.org/articles/ggplot2-specs.html.\nThe specific aesthetics you can use for a plot depend on the geom you use to represent the data. In the next section we dive deeper into geoms.\n\n9.2.1 Exercises\n\nCreate a scatterplot of hwy vs. displ where the points are pink filled in triangles.\n\nWhy did the following code not result in a plot with blue points?\n\nggplot(mpg) + \n geom_point(aes(x = displ, y = hwy, color = \"blue\"))\n\n\nWhat does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)\nWhat happens if you map an aesthetic to something other than a variable name, like aes(color = displ < 5)? Note, you’ll also need to specify x and y." + }, + { + "objectID": "layers.html#sec-geometric-objects", + "href": "layers.html#sec-geometric-objects", + "title": "9  Layers", + "section": "\n9.3 Geometric objects", + "text": "9.3 Geometric objects\nHow are these two plots similar?\n\n\n\n\n\n\n\n\n\n\nBoth plots contain the same x variable, the same y variable, and both describe the same data. But the plots are not identical. Each plot uses a different geometric object, geom, to represent the data. The plot on the left uses the point geom, and the plot on the right uses the smooth geom, a smooth line fitted to the data.\nTo change the geom in your plot, change the geom function that you add to ggplot(). For instance, to make the plots above, you can use the following code:\n\n# Left\nggplot(mpg, aes(x = displ, y = hwy)) + \n geom_point()\n\n# Right\nggplot(mpg, aes(x = displ, y = hwy)) + \n geom_smooth()\n#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n\nEvery geom function in ggplot2 takes a mapping argument, either defined locally in the geom layer or globally in the ggplot() layer. However, not every aesthetic works with every geom. You could set the shape of a point, but you couldn’t set the “shape” of a line. If you try, ggplot2 will silently ignore that aesthetic mapping. On the other hand, you could set the linetype of a line. geom_smooth() will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype.\n\n# Left\nggplot(mpg, aes(x = displ, y = hwy, shape = drv)) + \n geom_smooth()\n\n# Right\nggplot(mpg, aes(x = displ, y = hwy, linetype = drv)) + \n geom_smooth()\n\n\n\n\n\n\n\n\n\n\n\nHere, geom_smooth() separates the cars into three lines based on their drv value, which describes a car’s drive train. One line describes all of the points that have a 4 value, one line describes all of the points that have an f value, and one line describes all of the points that have an r value. Here, 4 stands for four-wheel drive, f for front-wheel drive, and r for rear-wheel drive.\nIf this sounds strange, we can make it clearer by overlaying the lines on top of the raw data and then coloring everything according to drv.\n\nggplot(mpg, aes(x = displ, y = hwy, color = drv)) + \n geom_point() +\n geom_smooth(aes(linetype = drv))\n\n\n\n\nNotice that this plot contains two geoms in the same graph.\nMany geoms, like geom_smooth(), use a single geometric object to display multiple rows of data. For these geoms, you can set the group aesthetic to a categorical variable to draw multiple objects. ggplot2 will draw a separate object for each unique value of the grouping variable. In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the linetype example). It is convenient to rely on this feature because the group aesthetic by itself does not add a legend or distinguishing features to the geoms.\n\n# Left\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_smooth()\n\n# Middle\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_smooth(aes(group = drv))\n\n# Right\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_smooth(aes(color = drv), show.legend = FALSE)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIf you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. It will use these mappings to extend or overwrite the global mappings for that layer only. This makes it possible to display different aesthetics in different layers.\n\nggplot(mpg, aes(x = displ, y = hwy)) + \n geom_point(aes(color = class)) + \n geom_smooth()\n\n\n\n\nYou can use the same idea to specify different data for each layer. Here, we use red points as well as open circles to highlight two-seater cars. The local data argument in geom_point() overrides the global data argument in ggplot() for that layer only.\n\nggplot(mpg, aes(x = displ, y = hwy)) + \n geom_point() + \n geom_point(\n data = mpg |> filter(class == \"2seater\"), \n color = \"red\"\n ) +\n geom_point(\n data = mpg |> filter(class == \"2seater\"), \n shape = \"circle open\", size = 3, color = \"red\"\n )\n\n\n\n\nGeoms are the fundamental building blocks of ggplot2. You can completely transform the look of your plot by changing its geom, and different geoms can reveal different features of your data. For example, the histogram and density plot below reveal that the distribution of highway mileage is bimodal and right skewed while the boxplot reveals two potential outliers.\n\n# Left\nggplot(mpg, aes(x = hwy)) +\n geom_histogram(binwidth = 2)\n\n# Middle\nggplot(mpg, aes(x = hwy)) +\n geom_density()\n\n# Right\nggplot(mpg, aes(x = hwy)) +\n geom_boxplot()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nggplot2 provides more than 40 geoms but these don’t cover all possible plots one could make. If you need a different geom, we recommend looking into extension packages first to see if someone else has already implemented it (see https://exts.ggplot2.tidyverse.org/gallery/ for a sampling). For example, the ggridges package (https://wilkelab.org/ggridges) is useful for making ridgeline plots, which can be useful for visualizing the density of a numerical variable for different levels of a categorical variable. In the following plot not only did we use a new geom (geom_density_ridges()), but we have also mapped the same variable to multiple aesthetics (drv to y, fill, and color) as well as set an aesthetic (alpha = 0.5) to make the density curves transparent.\n\nlibrary(ggridges)\n\nggplot(mpg, aes(x = hwy, y = drv, fill = drv, color = drv)) +\n geom_density_ridges(alpha = 0.5, show.legend = FALSE)\n#> Picking joint bandwidth of 1.28\n\n\n\n\nThe best place to get a comprehensive overview of all of the geoms ggplot2 offers, as well as all functions in the package, is the reference page: https://ggplot2.tidyverse.org/reference. To learn more about any single geom, use the help (e.g., ?geom_smooth).\n\n9.3.1 Exercises\n\nWhat geom would you use to draw a line chart? A boxplot? A histogram? An area chart?\n\nEarlier in this chapter we used show.legend without explaining it:\n\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_smooth(aes(color = drv), show.legend = FALSE)\n\nWhat does show.legend = FALSE do here? What happens if you remove it? Why do you think we used it earlier?\n\nWhat does the se argument to geom_smooth() do?\n\nRecreate the R code necessary to generate the following graphs. Note that wherever a categorical variable is used in the plot, it’s drv." + }, + { + "objectID": "layers.html#facets", + "href": "layers.html#facets", + "title": "9  Layers", + "section": "\n9.4 Facets", + "text": "9.4 Facets\nIn Capítulo 1 you learned about faceting with facet_wrap(), which splits a plot into subplots that each display one subset of the data based on a categorical variable.\n\nggplot(mpg, aes(x = displ, y = hwy)) + \n geom_point() + \n facet_wrap(~cyl)\n\n\n\n\nTo facet your plot with the combination of two variables, switch from facet_wrap() to facet_grid(). The first argument of facet_grid() is also a formula, but now it’s a double sided formula: rows ~ cols.\n\nggplot(mpg, aes(x = displ, y = hwy)) + \n geom_point() + \n facet_grid(drv ~ cyl)\n\n\n\n\nBy default each of the facets share the same scale and range for x and y axes. This is useful when you want to compare data across facets but it can be limiting when you want to visualize the relationship within each facet better. Setting the scales argument in a faceting function to \"free\" will allow for different axis scales across both rows and columns, \"free_x\" will allow for different scales across rows, and \"free_y\" will allow for different scales across columns.\n\nggplot(mpg, aes(x = displ, y = hwy)) + \n geom_point() + \n facet_grid(drv ~ cyl, scales = \"free_y\")\n\n\n\n\n\n9.4.1 Exercises\n\nWhat happens if you facet on a continuous variable?\n\nWhat do the empty cells in the plot above with facet_grid(drv ~ cyl) mean? Run the following code. How do they relate to the resulting plot?\n\nggplot(mpg) + \n geom_point(aes(x = drv, y = cyl))\n\n\n\nWhat plots does the following code make? What does . do?\n\nggplot(mpg) + \n geom_point(aes(x = displ, y = hwy)) +\n facet_grid(drv ~ .)\n\nggplot(mpg) + \n geom_point(aes(x = displ, y = hwy)) +\n facet_grid(. ~ cyl)\n\n\n\nTake the first faceted plot in this section:\n\nggplot(mpg) + \n geom_point(aes(x = displ, y = hwy)) + \n facet_wrap(~ class, nrow = 2)\n\nWhat are the advantages to using faceting instead of the color aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?\n\nRead ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?\n\nWhich of the following plots makes it easier to compare engine size (displ) across cars with different drive trains? What does this say about when to place a faceting variable across rows or columns?\n\nggplot(mpg, aes(x = displ)) + \n geom_histogram() + \n facet_grid(drv ~ .)\n\nggplot(mpg, aes(x = displ)) + \n geom_histogram() +\n facet_grid(. ~ drv)\n\n\n\nRecreate the following plot using facet_wrap() instead of facet_grid(). How do the positions of the facet labels change?\n\nggplot(mpg) + \n geom_point(aes(x = displ, y = hwy)) +\n facet_grid(drv ~ .)" + }, + { + "objectID": "layers.html#statistical-transformations", + "href": "layers.html#statistical-transformations", + "title": "9  Layers", + "section": "\n9.5 Statistical transformations", + "text": "9.5 Statistical transformations\nConsider a basic bar chart, drawn with geom_bar() or geom_col(). The following chart displays the total number of diamonds in the diamonds dataset, grouped by cut. The diamonds dataset is in the ggplot2 package and contains information on ~54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.\n\nggplot(diamonds, aes(x = cut)) + \n geom_bar()\n\n\n\n\nOn the x-axis, the chart displays cut, a variable from diamonds. On the y-axis, it displays count, but count is not a variable in diamonds! Where does count come from? Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:\n\nBar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.\nSmoothers fit a model to your data and then plot predictions from the model.\nBoxplots compute the five-number summary of the distribution and then display that summary as a specially formatted box.\n\nThe algorithm used to calculate new values for a graph is called a stat, short for statistical transformation. Figura 9.2 shows how this process works with geom_bar().\n\n\n\n\nFigura 9.2: When creating a bar chart we first start with the raw data, then aggregate it to count the number of observations in each bar, and finally map those computed variables to plot aesthetics.\n\n\n\nYou can learn which stat a geom uses by inspecting the default value for the stat argument. For example, ?geom_bar shows that the default value for stat is “count”, which means that geom_bar() uses stat_count(). stat_count() is documented on the same page as geom_bar(). If you scroll down, the section called “Computed variables” explains that it computes two new variables: count and prop.\nEvery geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation. However, there are three reasons why you might need to use a stat explicitly:\n\n\nYou might want to override the default stat. In the code below, we change the stat of geom_bar() from count (the default) to identity. This lets us map the height of the bars to the raw values of a y variable.\n\ndiamonds |>\n count(cut) |>\n ggplot(aes(x = cut, y = n)) +\n geom_bar(stat = \"identity\")\n\n\n\n\n\n\nYou might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportions, rather than counts:\n\nggplot(diamonds, aes(x = cut, y = after_stat(prop), group = 1)) + \n geom_bar()\n\n\n\n\nTo find the possible variables that can be computed by the stat, look for the section titled “computed variables” in the help for geom_bar().\n\n\nYou might want to draw greater attention to the statistical transformation in your code. For example, you might use stat_summary(), which summarizes the y values for each unique x value, to draw attention to the summary that you’re computing:\n\nggplot(diamonds) + \n stat_summary(\n aes(x = cut, y = depth),\n fun.min = min,\n fun.max = max,\n fun = median\n )\n\n\n\n\n\n\nggplot2 provides more than 20 stats for you to use. Each stat is a function, so you can get help in the usual way, e.g., ?stat_bin.\n\n9.5.1 Exercises\n\nWhat is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?\nWhat does geom_col() do? How is it different from geom_bar()?\nMost geoms and stats come in pairs that are almost always used in concert. Make a list of all the pairs. What do they have in common? (Hint: Read through the documentation.)\nWhat variables does stat_smooth() compute? What arguments control its behavior?\n\nIn our proportion bar chart, we needed to set group = 1. Why? In other words, what is the problem with these two graphs?\n\nggplot(diamonds, aes(x = cut, y = after_stat(prop))) + \n geom_bar()\nggplot(diamonds, aes(x = cut, fill = color, y = after_stat(prop))) + \n geom_bar()" + }, + { + "objectID": "layers.html#position-adjustments", + "href": "layers.html#position-adjustments", + "title": "9  Layers", + "section": "\n9.6 Position adjustments", + "text": "9.6 Position adjustments\nThere’s one more piece of magic associated with bar charts. You can color a bar chart using either the color aesthetic, or, more usefully, the fill aesthetic:\n\n# Left\nggplot(mpg, aes(x = drv, color = drv)) + \n geom_bar()\n\n# Right\nggplot(mpg, aes(x = drv, fill = drv)) + \n geom_bar()\n\n\n\n\n\n\n\n\n\n\n\nNote what happens if you map the fill aesthetic to another variable, like class: the bars are automatically stacked. Each colored rectangle represents a combination of drv and class.\n\nggplot(mpg, aes(x = drv, fill = class)) + \n geom_bar()\n\n\n\n\nThe stacking is performed automatically using the position adjustment specified by the position argument. If you don’t want a stacked bar chart, you can use one of three other options: \"identity\", \"dodge\" or \"fill\".\n\n\nposition = \"identity\" will place each object exactly where it falls in the context of the graph. This is not very useful for bars, because it overlaps them. To see that overlapping we either need to make the bars slightly transparent by setting alpha to a small value, or completely transparent by setting fill = NA.\n\n# Left\nggplot(mpg, aes(x = drv, fill = class)) + \n geom_bar(alpha = 1/5, position = \"identity\")\n\n# Right\nggplot(mpg, aes(x = drv, color = class)) + \n geom_bar(fill = NA, position = \"identity\")\n\n\n\n\n\n\n\n\n\n\n\nThe identity position adjustment is more useful for 2d geoms, like points, where it is the default.\n\nposition = \"fill\" works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.\n\nposition = \"dodge\" places overlapping objects directly beside one another. This makes it easier to compare individual values.\n\n# Left\nggplot(mpg, aes(x = drv, fill = class)) + \n geom_bar(position = \"fill\")\n\n# Right\nggplot(mpg, aes(x = drv, fill = class)) + \n geom_bar(position = \"dodge\")\n\n\n\n\n\n\n\n\n\n\n\n\n\nThere’s one other type of adjustment that’s not useful for bar charts, but can be very useful for scatterplots. Recall our first scatterplot. Did you notice that the plot displays only 126 points, even though there are 234 observations in the dataset?\n\n\n\n\n\nThe underlying values of hwy and displ are rounded so the points appear on a grid and many points overlap each other. This problem is known as overplotting. This arrangement makes it difficult to see the distribution of the data. Are the data points spread equally throughout the graph, or is there one special combination of hwy and displ that contains 109 values?\nYou can avoid this gridding by setting the position adjustment to “jitter”. position = \"jitter\" adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise.\n\nggplot(mpg, aes(x = displ, y = hwy)) + \n geom_point(position = \"jitter\")\n\n\n\n\nAdding randomness seems like a strange way to improve your plot, but while it makes your graph less accurate at small scales, it makes your graph more revealing at large scales. Because this is such a useful operation, ggplot2 comes with a shorthand for geom_point(position = \"jitter\"): geom_jitter().\nTo learn more about a position adjustment, look up the help page associated with each adjustment: ?position_dodge, ?position_fill, ?position_identity, ?position_jitter, and ?position_stack.\n\n9.6.1 Exercises\n\n\nWhat is the problem with the following plot? How could you improve it?\n\nggplot(mpg, aes(x = cty, y = hwy)) + \n geom_point()\n\n\n\nWhat, if anything, is the difference between the two plots? Why?\n\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_point()\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_point(position = \"identity\")\n\n\nWhat parameters to geom_jitter() control the amount of jittering?\nCompare and contrast geom_jitter() with geom_count().\nWhat’s the default position adjustment for geom_boxplot()? Create a visualization of the mpg dataset that demonstrates it." + }, + { + "objectID": "layers.html#coordinate-systems", + "href": "layers.html#coordinate-systems", + "title": "9  Layers", + "section": "\n9.7 Coordinate systems", + "text": "9.7 Coordinate systems\nCoordinate systems are probably the most complicated part of ggplot2. The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are two other coordinate systems that are occasionally helpful.\n\n\ncoord_quickmap() sets the aspect ratio correctly for geographic maps. This is very important if you’re plotting spatial data with ggplot2. We don’t have the space to discuss maps in this book, but you can learn more in the Maps chapter of ggplot2: Elegant graphics for data analysis.\n\nnz <- map_data(\"nz\")\n\nggplot(nz, aes(x = long, y = lat, group = group)) +\n geom_polygon(fill = \"white\", color = \"black\")\n\nggplot(nz, aes(x = long, y = lat, group = group)) +\n geom_polygon(fill = \"white\", color = \"black\") +\n coord_quickmap()\n\n\n\n\n\n\n\n\n\n\n\n\n\ncoord_polar() uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.\n\nbar <- ggplot(data = diamonds) + \n geom_bar(\n mapping = aes(x = clarity, fill = clarity), \n show.legend = FALSE,\n width = 1\n ) + \n theme(aspect.ratio = 1)\n\nbar + coord_flip()\nbar + coord_polar()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n9.7.1 Exercises\n\nTurn a stacked bar chart into a pie chart using coord_polar().\nWhat’s the difference between coord_quickmap() and coord_map()?\n\nWhat does the following plot tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?\n\nggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +\n geom_point() + \n geom_abline() +\n coord_fixed()" + }, + { + "objectID": "layers.html#the-layered-grammar-of-graphics", + "href": "layers.html#the-layered-grammar-of-graphics", + "title": "9  Layers", + "section": "\n9.8 The layered grammar of graphics", + "text": "9.8 The layered grammar of graphics\nWe can expand on the graphing template you learned in Seção 1.3 by adding position adjustments, stats, coordinate systems, and faceting:\nggplot(data = <DATA>) + \n <GEOM_FUNCTION>(\n mapping = aes(<MAPPINGS>),\n stat = <STAT>, \n position = <POSITION>\n ) +\n <COORDINATE_FUNCTION> +\n <FACET_FUNCTION>\nOur new template takes seven parameters, the bracketed words that appear in the template. In practice, you rarely need to supply all seven parameters to make a graph because ggplot2 will provide useful defaults for everything except the data, the mappings, and the geom function.\nThe seven parameters in the template compose the grammar of graphics, a formal system for building plots. The grammar of graphics is based on the insight that you can uniquely describe any plot as a combination of a dataset, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, a faceting scheme, and a theme.\nTo see how this works, consider how you could build a basic plot from scratch: you could start with a dataset and then transform it into the information that you want to display (with a stat). Next, you could choose a geometric object to represent each observation in the transformed data. You could then use the aesthetic properties of the geoms to represent variables in the data. You would map the values of each variable to the levels of an aesthetic. These steps are illustrated in Figura 9.3. You’d then select a coordinate system to place the geoms into, using the location of the objects (which is itself an aesthetic property) to display the values of the x and y variables.\n\n\n\n\nFigura 9.3: Steps for going from raw data to a table of frequencies to a bar plot where the heights of the bar represent the frequencies.\n\n\n\nAt this point, you would have a complete graph, but you could further adjust the positions of the geoms within the coordinate system (a position adjustment) or split the graph into subplots (faceting). You could also extend the plot by adding one or more additional layers, where each additional layer uses a dataset, a geom, a set of mappings, a stat, and a position adjustment.\nYou could use this method to build any plot that you imagine. In other words, you can use the code template that you’ve learned in this chapter to build hundreds of thousands of unique plots.\nIf you’d like to learn more about the theoretical underpinnings of ggplot2, you might enjoy reading “The Layered Grammar of Graphics”, the scientific paper that describes the theory of ggplot2 in detail." + }, + { + "objectID": "layers.html#summary", + "href": "layers.html#summary", + "title": "9  Layers", + "section": "\n9.9 Summary", + "text": "9.9 Summary\nIn this chapter you learned about the layered grammar of graphics starting with aesthetics and geometries to build a simple plot, facets for splitting the plot into subsets, statistics for understanding how geoms are calculated, position adjustments for controlling the fine details of position when geoms might otherwise overlap, and coordinate systems which allow you to fundamentally change what x and y mean. One layer we have not yet touched on is theme, which we will introduce in Seção 11.5.\nTwo very useful resources for getting an overview of the complete ggplot2 functionality are the ggplot2 cheatsheet (which you can find at https://posit.co/resources/cheatsheets) and the ggplot2 package website (https://ggplot2.tidyverse.org).\nAn important lesson you should take from this chapter is that when you feel the need for a geom that is not provided by ggplot2, it’s always a good idea to look into whether someone else has already solved your problem by creating a ggplot2 extension package that offers that geom." + }, + { + "objectID": "EDA.html#introduction", + "href": "EDA.html#introduction", + "title": "10  Exploratory data analysis", + "section": "\n10.1 Introduction", + "text": "10.1 Introduction\nThis chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. EDA is an iterative cycle. You:\n\nGenerate questions about your data.\nSearch for answers by visualizing, transforming, and modelling your data.\nUse what you learn to refine your questions and/or generate new questions.\n\nEDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive insights that you’ll eventually write up and communicate to others.\nEDA is an important part of any data analysis, even if the primary research questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you’ll need to deploy all the tools of EDA: visualization, transformation, and modelling.\n\n10.1.1 Prerequisites\nIn this chapter we’ll combine what you’ve learned about dplyr and ggplot2 to interactively ask questions, answer them with data, and then ask new questions.\n\nlibrary(tidyverse)" + }, + { + "objectID": "EDA.html#questions", + "href": "EDA.html#questions", + "title": "10  Exploratory data analysis", + "section": "\n10.2 Questions", + "text": "10.2 Questions\n\n“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox\n\n\n“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey\n\nYour goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.\nEDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights can be gleaned from your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data—and develop a set of thought-provoking questions—if you follow up each question with a new question based on what you find.\nThere is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:\n\nWhat type of variation occurs within my variables?\nWhat type of covariation occurs between my variables?\n\nThe rest of this chapter will look at these two questions. We’ll explain what variation and covariation are, and we’ll show you several ways to answer each question." + }, + { + "objectID": "EDA.html#variation", + "href": "EDA.html#variation", + "title": "10  Exploratory data analysis", + "section": "\n10.3 Variation", + "text": "10.3 Variation\nVariation is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error that varies from measurement to measurement. Variables can also vary if you measure across different subjects (e.g., the eye colors of different people) or at different times (e.g., the energy levels of an electron at different moments). Every variable has its own pattern of variation, which can reveal interesting information about how that it varies between measurements on the same observation as well as across observations. The best way to understand that pattern is to visualize the distribution of the variable’s values, which you’ve learned about in Capítulo 1.\nWe’ll start our exploration by visualizing the distribution of weights (carat) of ~54,000 diamonds from the diamonds dataset. Since carat is a numerical variable, we can use a histogram:\n\nggplot(diamonds, aes(x = carat)) +\n geom_histogram(binwidth = 0.5)\n\n\n\n\nNow that you can visualize variation, what should you look for in your plots? And what type of follow-up questions should you ask? We’ve put together a list below of the most useful types of information that you will find in your graphs, along with some follow-up questions for each type of information. The key to asking good follow-up questions will be to rely on your curiosity (What do you want to learn more about?) as well as your skepticism (How could this be misleading?).\n\n10.3.1 Typical values\nIn both bar charts and histograms, tall bars show the common values of a variable, and shorter bars show less-common values. Places that do not have bars reveal values that were not seen in your data. To turn this information into useful questions, look for anything unexpected:\n\nWhich values are the most common? Why?\nWhich values are rare? Why? Does that match your expectations?\nCan you see any unusual patterns? What might explain them?\n\nLet’s take a look at the distribution of carat for smaller diamonds.\n\nsmaller <- diamonds |> \n filter(carat < 3)\n\nggplot(smaller, aes(x = carat)) +\n geom_histogram(binwidth = 0.01)\n\n\n\n\nThis histogram suggests several interesting questions:\n\nWhy are there more diamonds at whole carats and common fractions of carats?\nWhy are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?\n\nVisualizations can also reveal clusters, which suggest that subgroups exist in your data. To understand the subgroups, ask:\n\nHow are the observations within each subgroup similar to each other?\nHow are the observations in separate clusters different from each other?\nHow can you explain or describe the clusters?\nWhy might the appearance of clusters be misleading?\n\nSome of these questions can be answered with the data while some will require domain expertise about the data. Many of them will prompt you to explore a relationship between variables, for example, to see if the values of one variable can explain the behavior of another variable. We’ll get to that shortly.\n\n10.3.2 Unusual values\nOutliers are observations that are unusual; data points that don’t seem to fit the pattern. Sometimes outliers are data entry errors, sometimes they are simply values at the extremes that happened to be observed in this data collection, and other times they suggest important new discoveries. When you have a lot of data, outliers are sometimes difficult to see in a histogram. For example, take the distribution of the y variable from the diamonds dataset. The only evidence of outliers is the unusually wide limits on the x-axis.\n\nggplot(diamonds, aes(x = y)) + \n geom_histogram(binwidth = 0.5)\n\n\n\n\nThere are so many observations in the common bins that the rare bins are very short, making it very difficult to see them (although maybe if you stare intently at 0 you’ll spot something). To make it easy to see the unusual values, we need to zoom to small values of the y-axis with coord_cartesian():\n\nggplot(diamonds, aes(x = y)) + \n geom_histogram(binwidth = 0.5) +\n coord_cartesian(ylim = c(0, 50))\n\n\n\n\ncoord_cartesian() also has an xlim() argument for when you need to zoom into the x-axis. ggplot2 also has xlim() and ylim() functions that work slightly differently: they throw away the data outside the limits.\nThis allows us to see that there are three unusual values: 0, ~30, and ~60. We pluck them out with dplyr:\n\nunusual <- diamonds |> \n filter(y < 3 | y > 20) |> \n select(price, x, y, z) |>\n arrange(y)\nunusual\n#> # A tibble: 9 × 4\n#> price x y z\n#> <int> <dbl> <dbl> <dbl>\n#> 1 5139 0 0 0 \n#> 2 6381 0 0 0 \n#> 3 12800 0 0 0 \n#> 4 15686 0 0 0 \n#> 5 18034 0 0 0 \n#> 6 2130 0 0 0 \n#> 7 2130 0 0 0 \n#> 8 2075 5.15 31.8 5.12\n#> 9 12210 8.09 58.9 8.06\n\nThe y variable measures one of the three dimensions of these diamonds, in mm. We know that diamonds can’t have a width of 0mm, so these values must be incorrect. By doing EDA, we have discovered missing data that was coded as 0, which we never would have found by simply searching for NAs. Going forward we might choose to re-code these values as NAs in order to prevent misleading calculations. We might also suspect that measurements of 32mm and 59mm are implausible: those diamonds are over an inch long, but don’t cost hundreds of thousands of dollars!\nIt’s good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can’t figure out why they’re there, it’s reasonable to omit them, and move on. However, if they have a substantial effect on your results, you shouldn’t drop them without justification. You’ll need to figure out what caused them (e.g., a data entry error) and disclose that you removed them in your write-up.\n\n10.3.3 Exercises\n\nExplore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.\nExplore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)\nHow many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?\nCompare and contrast coord_cartesian() vs. xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?" + }, + { + "objectID": "EDA.html#sec-unusual-values-eda", + "href": "EDA.html#sec-unusual-values-eda", + "title": "10  Exploratory data analysis", + "section": "\n10.4 Unusual values", + "text": "10.4 Unusual values\nIf you’ve encountered unusual values in your dataset, and simply want to move on to the rest of your analysis, you have two options.\n\n\nDrop the entire row with the strange values:\n\ndiamonds2 <- diamonds |> \n filter(between(y, 3, 20))\n\nWe don’t recommend this option because one invalid value doesn’t imply that all the other values for that observation are also invalid. Additionally, if you have low quality data, by the time that you’ve applied this approach to every variable you might find that you don’t have any data left!\n\n\nInstead, we recommend replacing the unusual values with missing values. The easiest way to do this is to use mutate() to replace the variable with a modified copy. You can use the if_else() function to replace unusual values with NA:\n\ndiamonds2 <- diamonds |> \n mutate(y = if_else(y < 3 | y > 20, NA, y))\n\n\n\nIt’s not obvious where you should plot missing values, so ggplot2 doesn’t include them in the plot, but it does warn that they’ve been removed:\n\nggplot(diamonds2, aes(x = x, y = y)) + \n geom_point()\n#> Warning: Removed 9 rows containing missing values (`geom_point()`).\n\n\n\n\nTo suppress that warning, set na.rm = TRUE:\n\nggplot(diamonds2, aes(x = x, y = y)) + \n geom_point(na.rm = TRUE)\n\nOther times you want to understand what makes observations with missing values different to observations with recorded values. For example, in nycflights13::flights1, missing values in the dep_time variable indicate that the flight was cancelled. So you might want to compare the scheduled departure times for cancelled and non-cancelled times. You can do this by making a new variable, using is.na() to check if dep_time is missing.\n\nnycflights13::flights |> \n mutate(\n cancelled = is.na(dep_time),\n sched_hour = sched_dep_time %/% 100,\n sched_min = sched_dep_time %% 100,\n sched_dep_time = sched_hour + (sched_min / 60)\n ) |> \n ggplot(aes(x = sched_dep_time)) + \n geom_freqpoly(aes(color = cancelled), binwidth = 1/4)\n\n\n\n\nHowever this plot isn’t great because there are many more non-cancelled flights than cancelled flights. In the next section we’ll explore some techniques for improving this comparison.\n\n10.4.1 Exercises\n\nWhat happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference in how missing values are handled in histograms and bar charts?\nWhat does na.rm = TRUE do in mean() and sum()?\nRecreate the frequency plot of scheduled_dep_time colored by whether the flight was cancelled or not. Also facet by the cancelled variable. Experiment with different values of the scales variable in the faceting function to mitigate the effect of more non-cancelled flights than cancelled flights." + }, + { + "objectID": "EDA.html#covariation", + "href": "EDA.html#covariation", + "title": "10  Exploratory data analysis", + "section": "\n10.5 Covariation", + "text": "10.5 Covariation\nIf variation describes the behavior within a variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualize the relationship between two or more variables.\n\n10.5.1 A categorical and a numerical variable\nFor example, let’s explore how the price of a diamond varies with its quality (measured by cut) using geom_freqpoly():\n\nggplot(diamonds, aes(x = price)) + \n geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)\n\n\n\n\nNote that ggplot2 uses an ordered color scale for cut because it’s defined as an ordered factor variable in the data. You’ll learn more about these in Seção 16.6.\nThe default appearance of geom_freqpoly() is not that useful here because the height, determined by the overall count, differs so much across cuts, making it hard to see the differences in the shapes of their distributions.\nTo make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we’ll display the density, which is the count standardized so that the area under each frequency polygon is one.\n\nggplot(diamonds, aes(x = price, y = after_stat(density))) + \n geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)\n\n\n\n\nNote that we’re mapping the density to y, but since density is not a variable in the diamonds dataset, we need to first calculate it. We use the after_stat() function to do so.\nThere’s something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price! But maybe that’s because frequency polygons are a little hard to interpret - there’s a lot going on in this plot.\nA visually simpler plot for exploring this relationship is using side-by-side boxplots.\n\nggplot(diamonds, aes(x = cut, y = price)) +\n geom_boxplot()\n\n\n\n\nWe see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counter-intuitive finding that better quality diamonds are typically cheaper! In the exercises, you’ll be challenged to figure out why.\ncut is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables don’t have such an intrinsic order, so you might want to reorder them to make a more informative display. One way to do that is with fct_reorder(). You’ll learn more about that function in Seção 16.4, but we want to give you a quick preview here because it’s so useful. For example, take the class variable in the mpg dataset. You might be interested to know how highway mileage varies across classes:\n\nggplot(mpg, aes(x = class, y = hwy)) +\n geom_boxplot()\n\n\n\n\nTo make the trend easier to see, we can reorder class based on the median value of hwy:\n\nggplot(mpg, aes(x = fct_reorder(class, hwy, median), y = hwy)) +\n geom_boxplot()\n\n\n\n\nIf you have long variable names, geom_boxplot() will work better if you flip it 90°. You can do that by exchanging the x and y aesthetic mappings.\n\nggplot(mpg, aes(x = hwy, y = fct_reorder(class, hwy, median))) +\n geom_boxplot()\n\n\n\n\n\n10.5.1.1 Exercises\n\nUse what you’ve learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.\nBased on EDA, what variable in the diamonds dataset appears to be most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?\nInstead of exchanging the x and y variables, add coord_flip() as a new layer to the vertical boxplot to create a horizontal one. How does this compare to exchanging the variables?\nOne problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs. cut. What do you learn? How do you interpret the plots?\nCreate a visualization of diamond prices vs. a categorical variable from the diamonds dataset using geom_violin(), then a faceted geom_histogram(), then a colored geom_freqpoly(), and then a colored geom_density(). Compare and contrast the four plots. What are the pros and cons of each method of visualizing the distribution of a numerical variable based on the levels of a categorical variable?\nIf you have a small dataset, it’s sometimes useful to use geom_jitter() to avoid overplotting to more easily see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.\n\n10.5.2 Two categorical variables\nTo visualize the covariation between categorical variables, you’ll need to count the number of observations for each combination of levels of these categorical variables. One way to do that is to rely on the built-in geom_count():\n\nggplot(diamonds, aes(x = cut, y = color)) +\n geom_count()\n\n\n\n\nThe size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specific x values and specific y values.\nAnother approach for exploring the relationship between these variables is computing the counts with dplyr:\n\ndiamonds |> \n count(color, cut)\n#> # A tibble: 35 × 3\n#> color cut n\n#> <ord> <ord> <int>\n#> 1 D Fair 163\n#> 2 D Good 662\n#> 3 D Very Good 1513\n#> 4 D Premium 1603\n#> 5 D Ideal 2834\n#> 6 E Fair 224\n#> # ℹ 29 more rows\n\nThen visualize with geom_tile() and the fill aesthetic:\n\ndiamonds |> \n count(color, cut) |> \n ggplot(aes(x = color, y = cut)) +\n geom_tile(aes(fill = n))\n\n\n\n\nIf the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the heatmaply package, which creates interactive plots.\n\n10.5.2.1 Exercises\n\nHow could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?\nWhat different data insights do you get with a segmented bar chart if color is mapped to the x aesthetic and cut is mapped to the fill aesthetic? Calculate the counts that fall into each of the segments.\nUse geom_tile() together with dplyr to explore how average flight departure delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?\n\n10.5.3 Two numerical variables\nYou’ve already seen one great way to visualize the covariation between two numerical variables: draw a scatterplot with geom_point(). You can see covariation as a pattern in the points. For example, you can see a positive relationship between the carat size and price of a diamond: diamonds with more carats have a higher price. The relationship is exponential.\n\nggplot(smaller, aes(x = carat, y = price)) +\n geom_point()\n\n\n\n\n(In this section we’ll use the smaller dataset to stay focused on the bulk of the diamonds that are smaller than 3 carats)\nScatterplots become less useful as the size of your dataset grows, because points begin to overplot, and pile up into areas of uniform black, making it hard to judge differences in the density of the data across the 2-dimensional space as well as making it hard to spot the trend. You’ve already seen one way to fix the problem: using the alpha aesthetic to add transparency.\n\nggplot(smaller, aes(x = carat, y = price)) + \n geom_point(alpha = 1 / 100)\n\n\n\n\nBut using transparency can be challenging for very large datasets. Another solution is to use bin. Previously you used geom_histogram() and geom_freqpoly() to bin in one dimension. Now you’ll learn how to use geom_bin2d() and geom_hex() to bin in two dimensions.\ngeom_bin2d() and geom_hex() divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. geom_bin2d() creates rectangular bins. geom_hex() creates hexagonal bins. You will need to install the hexbin package to use geom_hex().\n\nggplot(smaller, aes(x = carat, y = price)) +\n geom_bin2d()\n\n# install.packages(\"hexbin\")\nggplot(smaller, aes(x = carat, y = price)) +\n geom_hex()\n\n\n\n\n\n\n\n\n\n\n\nAnother option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualizing the combination of a categorical and a continuous variable that you learned about. For example, you could bin carat and then for each group, display a boxplot:\n\nggplot(smaller, aes(x = carat, y = price)) + \n geom_boxplot(aes(group = cut_width(carat, 0.1)))\n\n\n\n\ncut_width(x, width), as used above, divides x into bins of width width. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it’s difficult to tell that each boxplot summaries a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with varwidth = TRUE.\n\n10.5.3.1 Exercises\n\nInstead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs. cut_number()? How does that impact a visualization of the 2d distribution of carat and price?\nVisualize the distribution of carat, partitioned by price.\nHow does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?\nCombine two of the techniques you’ve learned to visualize the combined distribution of cut, carat, and price.\n\nTwo dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the following plot have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately. Why is a scatterplot a better display than a binned plot for this case?\n\ndiamonds |> \n filter(x >= 4) |> \n ggplot(aes(x = x, y = y)) +\n geom_point() +\n coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))\n\n\n\nInstead of creating boxes of equal width with cut_width(), we could create boxes that contain roughly equal number of points with cut_number(). What are the advantages and disadvantages of this approach?\n\nggplot(smaller, aes(x = carat, y = price)) + \n geom_boxplot(aes(group = cut_number(carat, 20)))" + }, + { + "objectID": "EDA.html#patterns-and-models", + "href": "EDA.html#patterns-and-models", + "title": "10  Exploratory data analysis", + "section": "\n10.6 Patterns and models", + "text": "10.6 Patterns and models\nIf a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:\n\nCould this pattern be due to coincidence (i.e. random chance)?\nHow can you describe the relationship implied by the pattern?\nHow strong is the relationship implied by the pattern?\nWhat other variables might affect the relationship?\nDoes the relationship change if you look at individual subgroups of the data?\n\nPatterns in your data provide clues about relationships, i.e., they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.\nModels are a tool for extracting patterns out of data. For example, consider the diamonds data. It’s hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. It’s possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain. The following code fits a model that predicts price from carat and then computes the residuals (the difference between the predicted value and the actual value). The residuals give us a view of the price of the diamond, once the effect of carat has been removed. Note that instead of using the raw values of price and carat, we log transform them first, and fit a model to the log-transformed values. Then, we exponentiate the residuals to put them back in the scale of raw prices.\n\nlibrary(tidymodels)\n\ndiamonds <- diamonds |>\n mutate(\n log_price = log(price),\n log_carat = log(carat)\n )\n\ndiamonds_fit <- linear_reg() |>\n fit(log_price ~ log_carat, data = diamonds)\n\ndiamonds_aug <- augment(diamonds_fit, new_data = diamonds) |>\n mutate(.resid = exp(.resid))\n\nggplot(diamonds_aug, aes(x = carat, y = .resid)) + \n geom_point()\n\n\n\n\nOnce you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.\n\nggplot(diamonds_aug, aes(x = cut, y = .resid)) + \n geom_boxplot()\n\n\n\n\nWe’re not discussing modelling in this book because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand." + }, + { + "objectID": "EDA.html#summary", + "href": "EDA.html#summary", + "title": "10  Exploratory data analysis", + "section": "\n10.7 Summary", + "text": "10.7 Summary\nIn this chapter you’ve learned a variety of tools to help you understand the variation within your data. You’ve seen techniques that work with a single variable at a time and with a pair of variables. This might seem painfully restrictive if you have tens or hundreds of variables in your data, but they’re the foundation upon which all other techniques are built.\nIn the next chapter, we’ll focus on the tools we can use to communicate our results." + }, + { + "objectID": "EDA.html#footnotes", + "href": "EDA.html#footnotes", + "title": "10  Exploratory data analysis", + "section": "", + "text": "Remember that when we need to be explicit about where a function (or dataset) comes from, we’ll use the special form package::function() or package::dataset.↩︎" + }, + { + "objectID": "communication.html#introduction", + "href": "communication.html#introduction", + "title": "11  Communication", + "section": "\n11.1 Introduction", + "text": "11.1 Introduction\nIn Capítulo 10, you learned how to use plots as tools for exploration. When you make exploratory plots, you know—even before looking—which variables the plot will display. You made each plot for a purpose, could quickly look at it, and then move on to the next plot. In the course of most analyses, you’ll produce tens or hundreds of plots, most of which are immediately thrown away.\nNow that you understand your data, you need to communicate your understanding to others. Your audience will likely not share your background knowledge and will not be deeply invested in the data. To help others quickly build up a good mental model of the data, you will need to invest considerable effort in making your plots as self-explanatory as possible. In this chapter, you’ll learn some of the tools that ggplot2 provides to do so.\nThis chapter focuses on the tools you need to create good graphics. We assume that you know what you want, and just need to know how to do it. For that reason, we highly recommend pairing this chapter with a good general visualization book. We particularly like The Truthful Art, by Albert Cairo. It doesn’t teach the mechanics of creating visualizations, but instead focuses on what you need to think about in order to create effective graphics.\n\n11.1.1 Prerequisites\nIn this chapter, we’ll focus once again on ggplot2. We’ll also use a little dplyr for data manipulation, scales to override the default breaks, labels, transformations and palettes, and a few ggplot2 extension packages, including ggrepel (https://ggrepel.slowkow.com) by Kamil Slowikowski and patchwork (https://patchwork.data-imaginist.com) by Thomas Lin Pedersen. Don’t forget that you’ll need to install those packages with install.packages() if you don’t already have them.\n\nlibrary(tidyverse)\nlibrary(scales)\nlibrary(ggrepel)\nlibrary(patchwork)" + }, + { + "objectID": "communication.html#labels", + "href": "communication.html#labels", + "title": "11  Communication", + "section": "\n11.2 Labels", + "text": "11.2 Labels\nThe easiest place to start when turning an exploratory graphic into an expository graphic is with good labels. You add labels with the labs() function.\n\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_point(aes(color = class)) +\n geom_smooth(se = FALSE) +\n labs(\n x = \"Engine displacement (L)\",\n y = \"Highway fuel economy (mpg)\",\n color = \"Car type\",\n title = \"Fuel efficiency generally decreases with engine size\",\n subtitle = \"Two seaters (sports cars) are an exception because of their light weight\",\n caption = \"Data from fueleconomy.gov\"\n )\n\n\n\n\nThe purpose of a plot title is to summarize the main finding. Avoid titles that just describe what the plot is, e.g., “A scatterplot of engine displacement vs. fuel economy”.\nIf you need to add more text, there are two other useful labels: subtitle adds additional detail in a smaller font beneath the title and caption adds text at the bottom right of the plot, often used to describe the source of the data. You can also use labs() to replace the axis and legend titles. It’s usually a good idea to replace short variable names with more detailed descriptions, and to include the units.\nIt’s possible to use mathematical equations instead of text strings. Just switch \"\" out for quote() and read about the available options in ?plotmath:\n\ndf <- tibble(\n x = 1:10,\n y = cumsum(x^2)\n)\n\nggplot(df, aes(x, y)) +\n geom_point() +\n labs(\n x = quote(x[i]),\n y = quote(sum(x[i] ^ 2, i == 1, n))\n )\n\n\n\n\n\n11.2.1 Exercises\n\nCreate one plot on the fuel economy data with customized title, subtitle, caption, x, y, and color labels.\n\nRecreate the following plot using the fuel economy data. Note that both the colors and shapes of points vary by type of drive train.\n\n\n\n\n\n\nTake an exploratory graphic that you’ve created in the last month, and add informative titles to make it easier for others to understand." + }, + { + "objectID": "communication.html#annotations", + "href": "communication.html#annotations", + "title": "11  Communication", + "section": "\n11.3 Annotations", + "text": "11.3 Annotations\nIn addition to labelling major components of your plot, it’s often useful to label individual observations or groups of observations. The first tool you have at your disposal is geom_text(). geom_text() is similar to geom_point(), but it has an additional aesthetic: label. This makes it possible to add textual labels to your plots.\nThere are two possible sources of labels. First, you might have a tibble that provides labels. In the following plot we pull out the cars with the highest engine size in each drive type and save their information as a new data frame called label_info.\n\nlabel_info <- mpg |>\n group_by(drv) |>\n arrange(desc(displ)) |>\n slice_head(n = 1) |>\n mutate(\n drive_type = case_when(\n drv == \"f\" ~ \"front-wheel drive\",\n drv == \"r\" ~ \"rear-wheel drive\",\n drv == \"4\" ~ \"4-wheel drive\"\n )\n ) |>\n select(displ, hwy, drv, drive_type)\n\nlabel_info\n#> # A tibble: 3 × 4\n#> # Groups: drv [3]\n#> displ hwy drv drive_type \n#> <dbl> <int> <chr> <chr> \n#> 1 6.5 17 4 4-wheel drive \n#> 2 5.3 25 f front-wheel drive\n#> 3 7 24 r rear-wheel drive\n\nThen, we use this new data frame to directly label the three groups to replace the legend with labels placed directly on the plot. Using the fontface and size arguments we can customize the look of the text labels. They’re larger than the rest of the text on the plot and bolded. (theme(legend.position = \"none\") turns all the legends off — we’ll talk about it more shortly.)\n\nggplot(mpg, aes(x = displ, y = hwy, color = drv)) +\n geom_point(alpha = 0.3) +\n geom_smooth(se = FALSE) +\n geom_text(\n data = label_info, \n aes(x = displ, y = hwy, label = drive_type),\n fontface = \"bold\", size = 5, hjust = \"right\", vjust = \"bottom\"\n ) +\n theme(legend.position = \"none\")\n#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n\n\n\n\nNote the use of hjust (horizontal justification) and vjust (vertical justification) to control the alignment of the label.\nHowever the annotated plot we made above is hard to read because the labels overlap with each other, and with the points. We can use the geom_label_repel() function from the ggrepel package to address both of these issues. This useful package will automatically adjust labels so that they don’t overlap:\n\nggplot(mpg, aes(x = displ, y = hwy, color = drv)) +\n geom_point(alpha = 0.3) +\n geom_smooth(se = FALSE) +\n geom_label_repel(\n data = label_info, \n aes(x = displ, y = hwy, label = drive_type),\n fontface = \"bold\", size = 5, nudge_y = 2\n ) +\n theme(legend.position = \"none\")\n#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n\n\n\n\nYou can also use the same idea to highlight certain points on a plot with geom_text_repel() from the ggrepel package. Note another handy technique used here: we added a second layer of large, hollow points to further highlight the labelled points.\n\npotential_outliers <- mpg |>\n filter(hwy > 40 | (hwy > 20 & displ > 5))\n \nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_point() +\n geom_text_repel(data = potential_outliers, aes(label = model)) +\n geom_point(data = potential_outliers, color = \"red\") +\n geom_point(\n data = potential_outliers,\n color = \"red\", size = 3, shape = \"circle open\"\n )\n\n\n\n\nRemember, in addition to geom_text() and geom_label(), you have many other geoms in ggplot2 available to help annotate your plot. A couple ideas:\n\nUse geom_hline() and geom_vline() to add reference lines. We often make them thick (linewidth = 2) and white (color = white), and draw them underneath the primary data layer. That makes them easy to see, without drawing attention away from the data.\nUse geom_rect() to draw a rectangle around points of interest. The boundaries of the rectangle are defined by aesthetics xmin, xmax, ymin, ymax. Alternatively, look into the ggforce package, specifically geom_mark_hull(), which allows you to annotate subsets of points with hulls.\nUse geom_segment() with the arrow argument to draw attention to a point with an arrow. Use aesthetics x and y to define the starting location, and xend and yend to define the end location.\n\nAnother handy function for adding annotations to plots is annotate(). As a rule of thumb, geoms are generally useful for highlighting a subset of the data while annotate() is useful for adding one or few annotation elements to a plot.\nTo demonstrate using annotate(), let’s create some text to add to our plot. The text is a bit long, so we’ll use stringr::str_wrap() to automatically add line breaks to it given the number of characters you want per line:\n\ntrend_text <- \"Larger engine sizes tend to have lower fuel economy.\" |>\n str_wrap(width = 30)\ntrend_text\n#> [1] \"Larger engine sizes tend to\\nhave lower fuel economy.\"\n\nThen, we add two layers of annotation: one with a label geom and the other with a segment geom. The x and y aesthetics in both define where the annotation should start, and the xend and yend aesthetics in the segment annotation define the end location of the segment. Note also that the segment is styled as an arrow.\n\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_point() +\n annotate(\n geom = \"label\", x = 3.5, y = 38,\n label = trend_text,\n hjust = \"left\", color = \"red\"\n ) +\n annotate(\n geom = \"segment\",\n x = 3, y = 35, xend = 5, yend = 25, color = \"red\",\n arrow = arrow(type = \"closed\")\n )\n\n\n\n\nAnnotation is a powerful tool for communicating main takeaways and interesting features of your visualizations. The only limit is your imagination (and your patience with positioning annotations to be aesthetically pleasing)!\n\n11.3.1 Exercises\n\nUse geom_text() with infinite positions to place text at the four corners of the plot.\nUse annotate() to add a point geom in the middle of your last plot without having to create a tibble. Customize the shape, size, or color of the point.\nHow do labels with geom_text() interact with faceting? How can you add a label to a single facet? How can you put a different label in each facet? (Hint: Think about the dataset that is being passed to geom_text().)\nWhat arguments to geom_label() control the appearance of the background box?\nWhat are the four arguments to arrow()? How do they work? Create a series of plots that demonstrate the most important options." + }, + { + "objectID": "communication.html#scales", + "href": "communication.html#scales", + "title": "11  Communication", + "section": "\n11.4 Scales", + "text": "11.4 Scales\nThe third way you can make your plot better for communication is to adjust the scales. Scales control how the aesthetic mappings manifest visually.\n\n11.4.1 Default scales\nNormally, ggplot2 automatically adds scales for you. For example, when you type:\n\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_point(aes(color = class))\n\nggplot2 automatically adds default scales behind the scenes:\n\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_point(aes(color = class)) +\n scale_x_continuous() +\n scale_y_continuous() +\n scale_color_discrete()\n\nNote the naming scheme for scales: scale_ followed by the name of the aesthetic, then _, then the name of the scale. The default scales are named according to the type of variable they align with: continuous, discrete, datetime, or date. scale_x_continuous() puts the numeric values from displ on a continuous number line on the x-axis, scale_color_discrete() chooses colors for each of the class of car, etc. There are lots of non-default scales which you’ll learn about below.\nThe default scales have been carefully chosen to do a good job for a wide range of inputs. Nevertheless, you might want to override the defaults for two reasons:\n\nYou might want to tweak some of the parameters of the default scale. This allows you to do things like change the breaks on the axes, or the key labels on the legend.\nYou might want to replace the scale altogether, and use a completely different algorithm. Often you can do better than the default because you know more about the data.\n\n11.4.2 Axis ticks and legend keys\nCollectively axes and legends are called guides. Axes are used for x and y aesthetics; legends are used for everything else.\nThere are two primary arguments that affect the appearance of the ticks on the axes and the keys on the legend: breaks and labels. Breaks controls the position of the ticks, or the values associated with the keys. Labels controls the text label associated with each tick/key. The most common use of breaks is to override the default choice:\n\nggplot(mpg, aes(x = displ, y = hwy, color = drv)) +\n geom_point() +\n scale_y_continuous(breaks = seq(15, 40, by = 5)) \n\n\n\n\nYou can use labels in the same way (a character vector the same length as breaks), but you can also set it to NULL to suppress the labels altogether. This can be useful for maps, or for publishing plots where you can’t share the absolute numbers. You can also use breaks and labels to control the appearance of legends. For discrete scales for categorical variables, labels can be a named list of the existing levels names and the desired labels for them.\n\nggplot(mpg, aes(x = displ, y = hwy, color = drv)) +\n geom_point() +\n scale_x_continuous(labels = NULL) +\n scale_y_continuous(labels = NULL) +\n scale_color_discrete(labels = c(\"4\" = \"4-wheel\", \"f\" = \"front\", \"r\" = \"rear\"))\n\n\n\n\nThe labels argument coupled with labelling functions from the scales package is also useful for formatting numbers as currency, percent, etc. The plot on the left shows default labelling with label_dollar(), which adds a dollar sign as well as a thousand separator comma. The plot on the right adds further customization by dividing dollar values by 1,000 and adding a suffix “K” (for “thousands”) as well as adding custom breaks. Note that breaks is in the original scale of the data.\n\n# Left\nggplot(diamonds, aes(x = price, y = cut)) +\n geom_boxplot(alpha = 0.05) +\n scale_x_continuous(labels = label_dollar())\n\n# Right\nggplot(diamonds, aes(x = price, y = cut)) +\n geom_boxplot(alpha = 0.05) +\n scale_x_continuous(\n labels = label_dollar(scale = 1/1000, suffix = \"K\"), \n breaks = seq(1000, 19000, by = 6000)\n )\n\n\n\n\n\n\n\n\n\n\n\nAnother handy label function is label_percent():\n\nggplot(diamonds, aes(x = cut, fill = clarity)) +\n geom_bar(position = \"fill\") +\n scale_y_continuous(name = \"Percentage\", labels = label_percent())\n\n\n\n\nAnother use of breaks is when you have relatively few data points and want to highlight exactly where the observations occur. For example, take this plot that shows when each US president started and ended their term.\n\npresidential |>\n mutate(id = 33 + row_number()) |>\n ggplot(aes(x = start, y = id)) +\n geom_point() +\n geom_segment(aes(xend = end, yend = id)) +\n scale_x_date(name = NULL, breaks = presidential$start, date_labels = \"'%y\")\n\n\n\n\nNote that for the breaks argument we pulled out the start variable as a vector with presidential$start because we can’t do an aesthetic mapping for this argument. Also note that the specification of breaks and labels for date and datetime scales is a little different:\n\ndate_labels takes a format specification, in the same form as parse_datetime().\ndate_breaks (not shown here), takes a string like “2 days” or “1 month”.\n\n11.4.3 Legend layout\nYou will most often use breaks and labels to tweak the axes. While they both also work for legends, there are a few other techniques you are more likely to use.\nTo control the overall position of the legend, you need to use a theme() setting. We’ll come back to themes at the end of the chapter, but in brief, they control the non-data parts of the plot. The theme setting legend.position controls where the legend is drawn:\n\nbase <- ggplot(mpg, aes(x = displ, y = hwy)) +\n geom_point(aes(color = class))\n\nbase + theme(legend.position = \"right\") # the default\nbase + theme(legend.position = \"left\")\nbase + \n theme(legend.position = \"top\") +\n guides(color = guide_legend(nrow = 3))\nbase + \n theme(legend.position = \"bottom\") +\n guides(color = guide_legend(nrow = 3))\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIf your plot is short and wide, place the legend at the top or bottom, and if it’s tall and narrow, place the legend at the left or right. You can also use legend.position = \"none\" to suppress the display of the legend altogether.\nTo control the display of individual legends, use guides() along with guide_legend() or guide_colorbar(). The following example shows two important settings: controlling the number of rows the legend uses with nrow, and overriding one of the aesthetics to make the points bigger. This is particularly useful if you have used a low alpha to display many points on a plot.\n\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_point(aes(color = class)) +\n geom_smooth(se = FALSE) +\n theme(legend.position = \"bottom\") +\n guides(color = guide_legend(nrow = 2, override.aes = list(size = 4)))\n#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n\n\n\n\nNote that the name of the argument in guides() matches the name of the aesthetic, just like in labs().\n\n11.4.4 Replacing a scale\nInstead of just tweaking the details a little, you can instead replace the scale altogether. There are two types of scales you’re mostly likely to want to switch out: continuous position scales and color scales. Fortunately, the same principles apply to all the other aesthetics, so once you’ve mastered position and color, you’ll be able to quickly pick up other scale replacements.\nIt’s very useful to plot transformations of your variable. For example, it’s easier to see the precise relationship between carat and price if we log transform them:\n\n# Left\nggplot(diamonds, aes(x = carat, y = price)) +\n geom_bin2d()\n\n# Right\nggplot(diamonds, aes(x = log10(carat), y = log10(price))) +\n geom_bin2d()\n\n\n\n\n\n\n\n\n\n\n\nHowever, the disadvantage of this transformation is that the axes are now labelled with the transformed values, making it hard to interpret the plot. Instead of doing the transformation in the aesthetic mapping, we can instead do it with the scale. This is visually identical, except the axes are labelled on the original data scale.\n\nggplot(diamonds, aes(x = carat, y = price)) +\n geom_bin2d() + \n scale_x_log10() + \n scale_y_log10()\n\n\n\n\nAnother scale that is frequently customized is color. The default categorical scale picks colors that are evenly spaced around the color wheel. Useful alternatives are the ColorBrewer scales which have been hand tuned to work better for people with common types of color blindness. The two plots below look similar, but there is enough difference in the shades of red and green that the dots on the right can be distinguished even by people with red-green color blindness.1\n\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_point(aes(color = drv))\n\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_point(aes(color = drv)) +\n scale_color_brewer(palette = \"Set1\")\n\n\n\n\n\n\n\n\n\n\n\nDon’t forget simpler techniques for improving accessibility. If there are just a few colors, you can add a redundant shape mapping. This will also help ensure your plot is interpretable in black and white.\n\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_point(aes(color = drv, shape = drv)) +\n scale_color_brewer(palette = \"Set1\")\n\n\n\n\nThe ColorBrewer scales are documented online at https://colorbrewer2.org/ and made available in R via the RColorBrewer package, by Erich Neuwirth. Figura 11.1 shows the complete list of all palettes. The sequential (top) and diverging (bottom) palettes are particularly useful if your categorical values are ordered, or have a “middle”. This often arises if you’ve used cut() to make a continuous variable into a categorical variable.\n\n\n\n\nFigura 11.1: All colorBrewer scales.\n\n\n\nWhen you have a predefined mapping between values and colors, use scale_color_manual(). For example, if we map presidential party to color, we want to use the standard mapping of red for Republicans and blue for Democrats. One approach for assigning these colors is using hex color codes:\n\npresidential |>\n mutate(id = 33 + row_number()) |>\n ggplot(aes(x = start, y = id, color = party)) +\n geom_point() +\n geom_segment(aes(xend = end, yend = id)) +\n scale_color_manual(values = c(Republican = \"#E81B23\", Democratic = \"#00AEF3\"))\n\n\n\n\nFor continuous color, you can use the built-in scale_color_gradient() or scale_fill_gradient(). If you have a diverging scale, you can use scale_color_gradient2(). That allows you to give, for example, positive and negative values different colors. That’s sometimes also useful if you want to distinguish points above or below the mean.\nAnother option is to use the viridis color scales. The designers, Nathaniel Smith and Stéfan van der Walt, carefully tailored continuous color schemes that are perceptible to people with various forms of color blindness as well as perceptually uniform in both color and black and white. These scales are available as continuous (c), discrete (d), and binned (b) palettes in ggplot2.\n\ndf <- tibble(\n x = rnorm(10000),\n y = rnorm(10000)\n)\n\nggplot(df, aes(x, y)) +\n geom_hex() +\n coord_fixed() +\n labs(title = \"Default, continuous\", x = NULL, y = NULL)\n\nggplot(df, aes(x, y)) +\n geom_hex() +\n coord_fixed() +\n scale_fill_viridis_c() +\n labs(title = \"Viridis, continuous\", x = NULL, y = NULL)\n\nggplot(df, aes(x, y)) +\n geom_hex() +\n coord_fixed() +\n scale_fill_viridis_b() +\n labs(title = \"Viridis, binned\", x = NULL, y = NULL)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nNote that all color scales come in two varieties: scale_color_*() and scale_fill_*() for the color and fill aesthetics respectively (the color scales are available in both UK and US spellings).\n\n11.4.5 Zooming\nThere are three ways to control the plot limits:\n\nAdjusting what data are plotted.\nSetting the limits in each scale.\nSetting xlim and ylim in coord_cartesian().\n\nWe’ll demonstrate these options in a series of plots. The plot on the left shows the relationship between engine size and fuel efficiency, colored by type of drive train. The plot on the right shows the same variables, but subsets the data that are plotted. Subsetting the data has affected the x and y scales as well as the smooth curve.\n\n# Left\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_point(aes(color = drv)) +\n geom_smooth()\n\n# Right\nmpg |>\n filter(displ >= 5 & displ <= 6 & hwy >= 10 & hwy <= 25) |>\n ggplot(aes(x = displ, y = hwy)) +\n geom_point(aes(color = drv)) +\n geom_smooth()\n\n\n\n\n\n\n\n\n\n\n\nLet’s compare these to the two plots below where the plot on the left sets the limits on individual scales and the plot on the right sets them in coord_cartesian(). We can see that reducing the limits is equivalent to subsetting the data. Therefore, to zoom in on a region of the plot, it’s generally best to use coord_cartesian().\n\n# Left\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_point(aes(color = drv)) +\n geom_smooth() +\n scale_x_continuous(limits = c(5, 6)) +\n scale_y_continuous(limits = c(10, 25))\n\n# Right\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_point(aes(color = drv)) +\n geom_smooth() +\n coord_cartesian(xlim = c(5, 6), ylim = c(10, 25))\n\n\n\n\n\n\n\n\n\n\n\nOn the other hand, setting the limits on individual scales is generally more useful if you want to expand the limits, e.g., to match scales across different plots. For example, if we extract two classes of cars and plot them separately, it’s difficult to compare the plots because all three scales (the x-axis, the y-axis, and the color aesthetic) have different ranges.\n\nsuv <- mpg |> filter(class == \"suv\")\ncompact <- mpg |> filter(class == \"compact\")\n\n# Left\nggplot(suv, aes(x = displ, y = hwy, color = drv)) +\n geom_point()\n\n# Right\nggplot(compact, aes(x = displ, y = hwy, color = drv)) +\n geom_point()\n\n\n\n\n\n\n\n\n\n\n\nOne way to overcome this problem is to share scales across multiple plots, training the scales with the limits of the full data.\n\nx_scale <- scale_x_continuous(limits = range(mpg$displ))\ny_scale <- scale_y_continuous(limits = range(mpg$hwy))\ncol_scale <- scale_color_discrete(limits = unique(mpg$drv))\n\n# Left\nggplot(suv, aes(x = displ, y = hwy, color = drv)) +\n geom_point() +\n x_scale +\n y_scale +\n col_scale\n\n# Right\nggplot(compact, aes(x = displ, y = hwy, color = drv)) +\n geom_point() +\n x_scale +\n y_scale +\n col_scale\n\n\n\n\n\n\n\n\n\n\n\nIn this particular case, you could have simply used faceting, but this technique is useful more generally, if for instance, you want to spread plots over multiple pages of a report.\n\n11.4.6 Exercises\n\n\nWhy doesn’t the following code override the default scale?\n\ndf <- tibble(\n x = rnorm(10000),\n y = rnorm(10000)\n)\n\nggplot(df, aes(x, y)) +\n geom_hex() +\n scale_color_gradient(low = \"white\", high = \"red\") +\n coord_fixed()\n\n\nWhat is the first argument to every scale? How does it compare to labs()?\n\nChange the display of the presidential terms by:\n\nCombining the two variants that customize colors and x axis breaks.\nImproving the display of the y axis.\nLabelling each term with the name of the president.\nAdding informative plot labels.\nPlacing breaks every 4 years (this is trickier than it seems!).\n\n\n\nFirst, create the following plot. Then, modify the code using override.aes to make the legend easier to see.\n\nggplot(diamonds, aes(x = carat, y = price)) +\n geom_point(aes(color = cut), alpha = 1/20)" + }, + { + "objectID": "communication.html#sec-themes", + "href": "communication.html#sec-themes", + "title": "11  Communication", + "section": "\n11.5 Themes", + "text": "11.5 Themes\nFinally, you can customize the non-data elements of your plot with a theme:\n\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_point(aes(color = class)) +\n geom_smooth(se = FALSE) +\n theme_bw()\n\n\n\n\nggplot2 includes the eight themes shown in Figura 11.2, with theme_gray() as the default.2 Many more are included in add-on packages like ggthemes (https://jrnold.github.io/ggthemes), by Jeffrey Arnold. You can also create your own themes, if you are trying to match a particular corporate or journal style.\n\n\n\n\nFigura 11.2: The eight themes built-in to ggplot2.\n\n\n\nIt’s also possible to control individual components of each theme, like the size and color of the font used for the y axis. We’ve already seen that legend.position controls where the legend is drawn. There are many other aspects of the legend that can be customized with theme(). For example, in the plot below we change the direction of the legend as well as put a black border around it. Note that customization of the legend box and plot title elements of the theme are done with element_*() functions. These functions specify the styling of non-data components, e.g., the title text is bolded in the face argument of element_text() and the legend border color is defined in the color argument of element_rect(). The theme elements that control the position of the title and the caption are plot.title.position and plot.caption.position, respectively. In the following plot these are set to \"plot\" to indicate these elements are aligned to the entire plot area, instead of the plot panel (the default). A few other helpful theme() components are used to change the placement for format of the title and caption text.\n\nggplot(mpg, aes(x = displ, y = hwy, color = drv)) +\n geom_point() +\n labs(\n title = \"Larger engine sizes tend to have lower fuel economy\",\n caption = \"Source: https://fueleconomy.gov.\"\n ) +\n theme(\n legend.position = c(0.6, 0.7),\n legend.direction = \"horizontal\",\n legend.box.background = element_rect(color = \"black\"),\n plot.title = element_text(face = \"bold\"),\n plot.title.position = \"plot\",\n plot.caption.position = \"plot\",\n plot.caption = element_text(hjust = 0)\n )\n\n\n\n\nFor an overview of all theme() components, see help with ?theme. The ggplot2 book is also a great place to go for the full details on theming.\n\n11.5.1 Exercises\n\nPick a theme offered by the ggthemes package and apply it to the last plot you made.\nMake the axis labels of your plot blue and bolded." + }, + { + "objectID": "communication.html#layout", + "href": "communication.html#layout", + "title": "11  Communication", + "section": "\n11.6 Layout", + "text": "11.6 Layout\nSo far we talked about how to create and modify a single plot. What if you have multiple plots you want to lay out in a certain way? The patchwork package allows you to combine separate plots into the same graphic. We loaded this package earlier in the chapter.\nTo place two plots next to each other, you can simply add them to each other. Note that you first need to create the plots and save them as objects (in the following example they’re called p1 and p2). Then, you place them next to each other with +.\n\np1 <- ggplot(mpg, aes(x = displ, y = hwy)) + \n geom_point() + \n labs(title = \"Plot 1\")\np2 <- ggplot(mpg, aes(x = drv, y = hwy)) + \n geom_boxplot() + \n labs(title = \"Plot 2\")\np1 + p2\n\n\n\n\nIt’s important to note that in the above code chunk we did not use a new function from the patchwork package. Instead, the package added a new functionality to the + operator.\nYou can also create complex plot layouts with patchwork. In the following, | places the p1 and p3 next to each other and / moves p2 to the next line.\n\np3 <- ggplot(mpg, aes(x = cty, y = hwy)) + \n geom_point() + \n labs(title = \"Plot 3\")\n(p1 | p3) / p2\n\n\n\n\nAdditionally, patchwork allows you to collect legends from multiple plots into one common legend, customize the placement of the legend as well as dimensions of the plots, and add a common title, subtitle, caption, etc. to your plots. Below we create 5 plots. We have turned off the legends on the box plots and the scatterplot and collected the legends for the density plots at the top of the plot with & theme(legend.position = \"top\"). Note the use of the & operator here instead of the usual +. This is because we’re modifying the theme for the patchwork plot as opposed to the individual ggplots. The legend is placed on top, inside the guide_area(). Finally, we have also customized the heights of the various components of our patchwork – the guide has a height of 1, the box plots 3, density plots 2, and the faceted scatterplot 4. Patchwork divides up the area you have allotted for your plot using this scale and places the components accordingly.\n\np1 <- ggplot(mpg, aes(x = drv, y = cty, color = drv)) + \n geom_boxplot(show.legend = FALSE) + \n labs(title = \"Plot 1\")\n\np2 <- ggplot(mpg, aes(x = drv, y = hwy, color = drv)) + \n geom_boxplot(show.legend = FALSE) + \n labs(title = \"Plot 2\")\n\np3 <- ggplot(mpg, aes(x = cty, color = drv, fill = drv)) + \n geom_density(alpha = 0.5) + \n labs(title = \"Plot 3\")\n\np4 <- ggplot(mpg, aes(x = hwy, color = drv, fill = drv)) + \n geom_density(alpha = 0.5) + \n labs(title = \"Plot 4\")\n\np5 <- ggplot(mpg, aes(x = cty, y = hwy, color = drv)) + \n geom_point(show.legend = FALSE) + \n facet_wrap(~drv) +\n labs(title = \"Plot 5\")\n\n(guide_area() / (p1 + p2) / (p3 + p4) / p5) +\n plot_annotation(\n title = \"City and highway mileage for cars with different drive trains\",\n caption = \"Source: https://fueleconomy.gov.\"\n ) +\n plot_layout(\n guides = \"collect\",\n heights = c(1, 3, 2, 4)\n ) &\n theme(legend.position = \"top\")\n\n\n\n\nIf you’d like to learn more about combining and layout out multiple plots with patchwork, we recommend looking through the guides on the package website: https://patchwork.data-imaginist.com.\n\n11.6.1 Exercises\n\n\nWhat happens if you omit the parentheses in the following plot layout. Can you explain why this happens?\n\np1 <- ggplot(mpg, aes(x = displ, y = hwy)) + \n geom_point() + \n labs(title = \"Plot 1\")\np2 <- ggplot(mpg, aes(x = drv, y = hwy)) + \n geom_boxplot() + \n labs(title = \"Plot 2\")\np3 <- ggplot(mpg, aes(x = cty, y = hwy)) + \n geom_point() + \n labs(title = \"Plot 3\")\n\n(p1 | p2) / p3\n\n\n\nUsing the three plots from the previous exercise, recreate the following patchwork." + }, + { + "objectID": "communication.html#summary", + "href": "communication.html#summary", + "title": "11  Communication", + "section": "\n11.7 Summary", + "text": "11.7 Summary\nIn this chapter you’ve learned about adding plot labels such as title, subtitle, caption as well as modifying default axis labels, using annotation to add informational text to your plot or to highlight specific data points, customizing the axis scales, and changing the theme of your plot. You’ve also learned about combining multiple plots in a single graph using both simple and complex plot layouts.\nWhile you’ve so far learned about how to make many different types of plots and how to customize them using a variety of techniques, we’ve barely scratched the surface of what you can create with ggplot2. If you want to get a comprehensive understanding of ggplot2, we recommend reading the book, ggplot2: Elegant Graphics for Data Analysis. Other useful resources are the R Graphics Cookbook by Winston Chang and Fundamentals of Data Visualization by Claus Wilke." + }, + { + "objectID": "communication.html#footnotes", + "href": "communication.html#footnotes", + "title": "11  Communication", + "section": "", + "text": "You can use a tool like SimDaltonism to simulate color blindness to test these images.↩︎\nMany people wonder why the default theme has a gray background. This was a deliberate choice because it puts the data forward while still making the grid lines visible. The white grid lines are visible (which is important because they significantly aid position judgments), but they have little visual impact and we can easily tune them out. The gray background gives the plot a similar typographic color to the text, ensuring that the graphics fit in with the flow of a document without jumping out with a bright white background. Finally, the gray background creates a continuous field of color which ensures that the plot is perceived as a single visual entity.↩︎" }, { "objectID": "transform.html", "href": "transform.html", "title": "Transform", "section": "", - "text": "The second part of the book was a deep dive into data visualization. In this part of the book, you’ll learn about the most important types of variables that you’ll encounter inside a data frame and learn the tools you can use to work with them.\n\n\n\n\nFigura 1: The options for data transformation depends heavily on the type of data involved, the subject of this part of the book.\n\n\n\nYou can read these chapters as you need them; they’re designed to be largely standalone so that they can be read out of order.\n\n?sec-logicals teaches you about logical vectors. These are the simplest types of vectors, but are extremely powerful. You’ll learn how to create them with numeric comparisons, how to combine them with Boolean algebra, how to use them in summaries, and how to use them for condition transformations.\n?sec-numbers dives into tools for vectors of numbers, the powerhouse of data science. You’ll learn more about counting and a bunch of important transformation and summary functions.\n?sec-strings will give you the tools to work with strings: you’ll slice them, you’ll dice them, and you’ll stick them back together again. This chapter mostly focuses on the stringr package, but you’ll also learn some more tidyr functions devoted to extracting data from character strings.\n?sec-regular-expressions introduces you to regular expressions, a powerful tool for manipulating strings. This chapter will take you from thinking that a cat walked over your keyboard to reading and writing complex string patterns.\n?sec-factors introduces factors: the data type that R uses to store categorical data. You use a factor when variable has a fixed set of possible values, or when you want to use a non-alphabetical ordering of a string.\n?sec-dates-and-times will give you the key tools for working with dates and date-times. Unfortunately, the more you learn about date-times, the more complicated they seem to get, but with the help of the lubridate package, you’ll learn to how to overcome the most common challenges.\n?sec-missing-values discusses missing values in depth. We’ve discussed them a couple of times in isolation, but now it’s time to discuss them holistically, helping you come to grips with the difference between implicit and explicit missing values, and how and why you might convert between them.\n?sec-joins finishes up this part of the book by giving you tools to join two (or more) data frames together. Learning about joins will force you to grapple with the idea of keys, and think about how you identify each row in a dataset." + "text": "The second part of the book was a deep dive into data visualization. In this part of the book, you’ll learn about the most important types of variables that you’ll encounter inside a data frame and learn the tools you can use to work with them.\n\n\n\n\nFigura 1: The options for data transformation depends heavily on the type of data involved, the subject of this part of the book.\n\n\n\nYou can read these chapters as you need them; they’re designed to be largely standalone so that they can be read out of order.\n\nCapítulo 12 teaches you about logical vectors. These are the simplest types of vectors, but are extremely powerful. You’ll learn how to create them with numeric comparisons, how to combine them with Boolean algebra, how to use them in summaries, and how to use them for condition transformations.\nCapítulo 13 dives into tools for vectors of numbers, the powerhouse of data science. You’ll learn more about counting and a bunch of important transformation and summary functions.\nCapítulo 14 will give you the tools to work with strings: you’ll slice them, you’ll dice them, and you’ll stick them back together again. This chapter mostly focuses on the stringr package, but you’ll also learn some more tidyr functions devoted to extracting data from character strings.\nCapítulo 15 introduces you to regular expressions, a powerful tool for manipulating strings. This chapter will take you from thinking that a cat walked over your keyboard to reading and writing complex string patterns.\nCapítulo 16 introduces factors: the data type that R uses to store categorical data. You use a factor when variable has a fixed set of possible values, or when you want to use a non-alphabetical ordering of a string.\nCapítulo 17 will give you the key tools for working with dates and date-times. Unfortunately, the more you learn about date-times, the more complicated they seem to get, but with the help of the lubridate package, you’ll learn to how to overcome the most common challenges.\nCapítulo 18 discusses missing values in depth. We’ve discussed them a couple of times in isolation, but now it’s time to discuss them holistically, helping you come to grips with the difference between implicit and explicit missing values, and how and why you might convert between them.\nCapítulo 19 finishes up this part of the book by giving you tools to join two (or more) data frames together. Learning about joins will force you to grapple with the idea of keys, and think about how you identify each row in a dataset." + }, + { + "objectID": "logicals.html#introduction", + "href": "logicals.html#introduction", + "title": "12  Logical vectors", + "section": "\n12.1 Introduction", + "text": "12.1 Introduction\nIn this chapter, you’ll learn tools for working with logical vectors. Logical vectors are the simplest type of vector because each element can only be one of three possible values: TRUE, FALSE, and NA. It’s relatively rare to find logical vectors in your raw data, but you’ll create and manipulate them in the course of almost every analysis.\nWe’ll begin by discussing the most common way of creating logical vectors: with numeric comparisons. Then you’ll learn about how you can use Boolean algebra to combine different logical vectors, as well as some useful summaries. We’ll finish off with if_else() and case_when(), two useful functions for making conditional changes powered by logical vectors.\n\n12.1.1 Prerequisites\nMost of the functions you’ll learn about in this chapter are provided by base R, so we don’t need the tidyverse, but we’ll still load it so we can use mutate(), filter(), and friends to work with data frames. We’ll also continue to draw examples from the nycflights13::flights dataset.\n\nlibrary(tidyverse)\nlibrary(nycflights13)\n\nHowever, as we start to cover more tools, there won’t always be a perfect real example. So we’ll start making up some dummy data with c():\n\nx <- c(1, 2, 3, 5, 7, 11, 13)\nx * 2\n#> [1] 2 4 6 10 14 22 26\n\nThis makes it easier to explain individual functions at the cost of making it harder to see how it might apply to your data problems. Just remember that any manipulation we do to a free-floating vector, you can do to a variable inside a data frame with mutate() and friends.\n\ndf <- tibble(x)\ndf |> \n mutate(y = x * 2)\n#> # A tibble: 7 × 2\n#> x y\n#> <dbl> <dbl>\n#> 1 1 2\n#> 2 2 4\n#> 3 3 6\n#> 4 5 10\n#> 5 7 14\n#> 6 11 22\n#> # ℹ 1 more row" + }, + { + "objectID": "logicals.html#comparisons", + "href": "logicals.html#comparisons", + "title": "12  Logical vectors", + "section": "\n12.2 Comparisons", + "text": "12.2 Comparisons\nA very common way to create a logical vector is via a numeric comparison with <, <=, >, >=, !=, and ==. So far, we’ve mostly created logical variables transiently within filter() — they are computed, used, and then thrown away. For example, the following filter finds all daytime departures that arrive roughly on time:\n\nflights |> \n filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)\n#> # A tibble: 172,286 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 601 600 1 844 850\n#> 2 2013 1 1 602 610 -8 812 820\n#> 3 2013 1 1 602 605 -3 821 805\n#> 4 2013 1 1 606 610 -4 858 910\n#> 5 2013 1 1 606 610 -4 837 845\n#> 6 2013 1 1 607 607 0 858 915\n#> # ℹ 172,280 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nIt’s useful to know that this is a shortcut and you can explicitly create the underlying logical variables with mutate():\n\nflights |> \n mutate(\n daytime = dep_time > 600 & dep_time < 2000,\n approx_ontime = abs(arr_delay) < 20,\n .keep = \"used\"\n )\n#> # A tibble: 336,776 × 4\n#> dep_time arr_delay daytime approx_ontime\n#> <int> <dbl> <lgl> <lgl> \n#> 1 517 11 FALSE TRUE \n#> 2 533 20 FALSE FALSE \n#> 3 542 33 FALSE FALSE \n#> 4 544 -18 FALSE TRUE \n#> 5 554 -25 FALSE FALSE \n#> 6 554 12 FALSE TRUE \n#> # ℹ 336,770 more rows\n\nThis is particularly useful for more complicated logic because naming the intermediate steps makes it easier to both read your code and check that each step has been computed correctly.\nAll up, the initial filter is equivalent to:\n\nflights |> \n mutate(\n daytime = dep_time > 600 & dep_time < 2000,\n approx_ontime = abs(arr_delay) < 20,\n ) |> \n filter(daytime & approx_ontime)\n\n\n12.2.1 Floating point comparison\nBeware of using == with numbers. For example, it looks like this vector contains the numbers 1 and 2:\n\nx <- c(1 / 49 * 49, sqrt(2) ^ 2)\nx\n#> [1] 1 2\n\nBut if you test them for equality, you get FALSE:\n\nx == c(1, 2)\n#> [1] FALSE FALSE\n\nWhat’s going on? Computers store numbers with a fixed number of decimal places so there’s no way to exactly represent 1/49 or sqrt(2) and subsequent computations will be very slightly off. We can see the exact values by calling print() with the digits1 argument:\n\nprint(x, digits = 16)\n#> [1] 0.9999999999999999 2.0000000000000004\n\nYou can see why R defaults to rounding these numbers; they really are very close to what you expect.\nNow that you’ve seen why == is failing, what can you do about it? One option is to use dplyr::near() which ignores small differences:\n\nnear(x, c(1, 2))\n#> [1] TRUE TRUE\n\n\n12.2.2 Missing values\nMissing values represent the unknown so they are “contagious”: almost any operation involving an unknown value will also be unknown:\n\nNA > 5\n#> [1] NA\n10 == NA\n#> [1] NA\n\nThe most confusing result is this one:\n\nNA == NA\n#> [1] NA\n\nIt’s easiest to understand why this is true if we artificially supply a little more context:\n\n# We don't know how old Mary is\nage_mary <- NA\n\n# We don't know how old John is\nage_john <- NA\n\n# Are Mary and John the same age?\nage_mary == age_john\n#> [1] NA\n# We don't know!\n\nSo if you want to find all flights where dep_time is missing, the following code doesn’t work because dep_time == NA will yield NA for every single row, and filter() automatically drops missing values:\n\nflights |> \n filter(dep_time == NA)\n#> # A tibble: 0 × 19\n#> # ℹ 19 variables: year <int>, month <int>, day <int>, dep_time <int>,\n#> # sched_dep_time <int>, dep_delay <dbl>, arr_time <int>, …\n\nInstead we’ll need a new tool: is.na().\n\n12.2.3 is.na()\n\nis.na(x) works with any type of vector and returns TRUE for missing values and FALSE for everything else:\n\nis.na(c(TRUE, NA, FALSE))\n#> [1] FALSE TRUE FALSE\nis.na(c(1, NA, 3))\n#> [1] FALSE TRUE FALSE\nis.na(c(\"a\", NA, \"b\"))\n#> [1] FALSE TRUE FALSE\n\nWe can use is.na() to find all the rows with a missing dep_time:\n\nflights |> \n filter(is.na(dep_time))\n#> # A tibble: 8,255 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 NA 1630 NA NA 1815\n#> 2 2013 1 1 NA 1935 NA NA 2240\n#> 3 2013 1 1 NA 1500 NA NA 1825\n#> 4 2013 1 1 NA 600 NA NA 901\n#> 5 2013 1 2 NA 1540 NA NA 1747\n#> 6 2013 1 2 NA 1620 NA NA 1746\n#> # ℹ 8,249 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nis.na() can also be useful in arrange(). arrange() usually puts all the missing values at the end but you can override this default by first sorting by is.na():\n\nflights |> \n filter(month == 1, day == 1) |> \n arrange(dep_time)\n#> # A tibble: 842 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 836 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nflights |> \n filter(month == 1, day == 1) |> \n arrange(desc(is.na(dep_time)), dep_time)\n#> # A tibble: 842 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 NA 1630 NA NA 1815\n#> 2 2013 1 1 NA 1935 NA NA 2240\n#> 3 2013 1 1 NA 1500 NA NA 1825\n#> 4 2013 1 1 NA 600 NA NA 901\n#> 5 2013 1 1 517 515 2 830 819\n#> 6 2013 1 1 533 529 4 850 830\n#> # ℹ 836 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nWe’ll come back to cover missing values in more depth in Capítulo 18.\n\n12.2.4 Exercises\n\nHow does dplyr::near() work? Type near to see the source code. Is sqrt(2)^2 near 2?\nUse mutate(), is.na(), and count() together to describe how the missing values in dep_time, sched_dep_time and dep_delay are connected." + }, + { + "objectID": "logicals.html#boolean-algebra", + "href": "logicals.html#boolean-algebra", + "title": "12  Logical vectors", + "section": "\n12.3 Boolean algebra", + "text": "12.3 Boolean algebra\nOnce you have multiple logical vectors, you can combine them together using Boolean algebra. In R, & is “and”, | is “or”, ! is “not”, and xor() is exclusive or2. For example, df |> filter(!is.na(x)) finds all rows where x is not missing and df |> filter(x < -10 | x > 0) finds all rows where x is smaller than -10 or bigger than 0. Figura 12.1 shows the complete set of Boolean operations and how they work.\n\n\n\n\nFigura 12.1: The complete set of Boolean operations. x is the left-hand circle, y is the right-hand circle, and the shaded region show which parts each operator selects.\n\n\n\nAs well as & and |, R also has && and ||. Don’t use them in dplyr functions! These are called short-circuiting operators and only ever return a single TRUE or FALSE. They’re important for programming, not data science.\n\n12.3.1 Missing values\nThe rules for missing values in Boolean algebra are a little tricky to explain because they seem inconsistent at first glance:\n\ndf <- tibble(x = c(TRUE, FALSE, NA))\n\ndf |> \n mutate(\n and = x & NA,\n or = x | NA\n )\n#> # A tibble: 3 × 3\n#> x and or \n#> <lgl> <lgl> <lgl>\n#> 1 TRUE NA TRUE \n#> 2 FALSE FALSE NA \n#> 3 NA NA NA\n\nTo understand what’s going on, think about NA | TRUE (NA or TRUE). A missing value in a logical vector means that the value could either be TRUE or FALSE. TRUE | TRUE and FALSE | TRUE are both TRUE because at least one of them is TRUE. NA | TRUE must also be TRUE because NA can either be TRUE or FALSE. However, NA | FALSE is NA because we don’t know if NA is TRUE or FALSE. Similar reasoning applies with NA & FALSE.\n\n12.3.2 Order of operations\nNote that the order of operations doesn’t work like English. Take the following code that finds all flights that departed in November or December:\n\nflights |> \n filter(month == 11 | month == 12)\n\nYou might be tempted to write it like you’d say in English: “Find all flights that departed in November or December.”:\n\nflights |> \n filter(month == 11 | 12)\n#> # A tibble: 336,776 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 336,770 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nThis code doesn’t error but it also doesn’t seem to have worked. What’s going on? Here, R first evaluates month == 11 creating a logical vector, which we call nov. It computes nov | 12. When you use a number with a logical operator it converts everything apart from 0 to TRUE, so this is equivalent to nov | TRUE which will always be TRUE, so every row will be selected:\n\nflights |> \n mutate(\n nov = month == 11,\n final = nov | 12,\n .keep = \"used\"\n )\n#> # A tibble: 336,776 × 3\n#> month nov final\n#> <int> <lgl> <lgl>\n#> 1 1 FALSE TRUE \n#> 2 1 FALSE TRUE \n#> 3 1 FALSE TRUE \n#> 4 1 FALSE TRUE \n#> 5 1 FALSE TRUE \n#> 6 1 FALSE TRUE \n#> # ℹ 336,770 more rows\n\n\n12.3.3 %in%\n\nAn easy way to avoid the problem of getting your ==s and |s in the right order is to use %in%. x %in% y returns a logical vector the same length as x that is TRUE whenever a value in x is anywhere in y .\n\n1:12 %in% c(1, 5, 11)\n#> [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE\nletters[1:10] %in% c(\"a\", \"e\", \"i\", \"o\", \"u\")\n#> [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE\n\nSo to find all flights in November and December we could write:\n\nflights |> \n filter(month %in% c(11, 12))\n\nNote that %in% obeys different rules for NA to ==, as NA %in% NA is TRUE.\n\nc(1, 2, NA) == NA\n#> [1] NA NA NA\nc(1, 2, NA) %in% NA\n#> [1] FALSE FALSE TRUE\n\nThis can make for a useful shortcut:\n\nflights |> \n filter(dep_time %in% c(NA, 0800))\n#> # A tibble: 8,803 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 800 800 0 1022 1014\n#> 2 2013 1 1 800 810 -10 949 955\n#> 3 2013 1 1 NA 1630 NA NA 1815\n#> 4 2013 1 1 NA 1935 NA NA 2240\n#> 5 2013 1 1 NA 1500 NA NA 1825\n#> 6 2013 1 1 NA 600 NA NA 901\n#> # ℹ 8,797 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\n\n12.3.4 Exercises\n\nFind all flights where arr_delay is missing but dep_delay is not. Find all flights where neither arr_time nor sched_arr_time are missing, but arr_delay is.\nHow many flights have a missing dep_time? What other variables are missing in these rows? What might these rows represent?\nAssuming that a missing dep_time implies that a flight is cancelled, look at the number of cancelled flights per day. Is there a pattern? Is there a connection between the proportion of cancelled flights and the average delay of non-cancelled flights?" + }, + { + "objectID": "logicals.html#sec-logical-summaries", + "href": "logicals.html#sec-logical-summaries", + "title": "12  Logical vectors", + "section": "\n12.4 Summaries", + "text": "12.4 Summaries\nThe following sections describe some useful techniques for summarizing logical vectors. As well as functions that only work specifically with logical vectors, you can also use functions that work with numeric vectors.\n\n12.4.1 Logical summaries\nThere are two main logical summaries: any() and all(). any(x) is the equivalent of |; it’ll return TRUE if there are any TRUE’s in x. all(x) is equivalent of &; it’ll return TRUE only if all values of x are TRUE’s. Like all summary functions, they’ll return NA if there are any missing values present, and as usual you can make the missing values go away with na.rm = TRUE.\nFor example, we could use all() and any() to find out if every flight was delayed on departure by at most an hour or if any flights were delayed on arrival by five hours or more. And using group_by() allows us to do that by day:\n\nflights |> \n group_by(year, month, day) |> \n summarize(\n all_delayed = all(dep_delay <= 60, na.rm = TRUE),\n any_long_delay = any(arr_delay >= 300, na.rm = TRUE),\n .groups = \"drop\"\n )\n#> # A tibble: 365 × 5\n#> year month day all_delayed any_long_delay\n#> <int> <int> <int> <lgl> <lgl> \n#> 1 2013 1 1 FALSE TRUE \n#> 2 2013 1 2 FALSE TRUE \n#> 3 2013 1 3 FALSE FALSE \n#> 4 2013 1 4 FALSE FALSE \n#> 5 2013 1 5 FALSE TRUE \n#> 6 2013 1 6 FALSE FALSE \n#> # ℹ 359 more rows\n\nIn most cases, however, any() and all() are a little too crude, and it would be nice to be able to get a little more detail about how many values are TRUE or FALSE. That leads us to the numeric summaries.\n\n12.4.2 Numeric summaries of logical vectors\nWhen you use a logical vector in a numeric context, TRUE becomes 1 and FALSE becomes 0. This makes sum() and mean() very useful with logical vectors because sum(x) gives the number of TRUEs and mean(x) gives the proportion of TRUEs (because mean() is just sum() divided by length().\nThat, for example, allows us to see the proportion of flights that were delayed on departure by at most an hour and the number of flights that were delayed on arrival by five hours or more:\n\nflights |> \n group_by(year, month, day) |> \n summarize(\n all_delayed = mean(dep_delay <= 60, na.rm = TRUE),\n any_long_delay = sum(arr_delay >= 300, na.rm = TRUE),\n .groups = \"drop\"\n )\n#> # A tibble: 365 × 5\n#> year month day all_delayed any_long_delay\n#> <int> <int> <int> <dbl> <int>\n#> 1 2013 1 1 0.939 3\n#> 2 2013 1 2 0.914 3\n#> 3 2013 1 3 0.941 0\n#> 4 2013 1 4 0.953 0\n#> 5 2013 1 5 0.964 1\n#> 6 2013 1 6 0.959 0\n#> # ℹ 359 more rows\n\n\n12.4.3 Logical subsetting\nThere’s one final use for logical vectors in summaries: you can use a logical vector to filter a single variable to a subset of interest. This makes use of the base [ (pronounced subset) operator, which you’ll learn more about in Seção 27.2.\nImagine we wanted to look at the average delay just for flights that were actually delayed. One way to do so would be to first filter the flights and then calculate the average delay:\n\nflights |> \n filter(arr_delay > 0) |> \n group_by(year, month, day) |> \n summarize(\n behind = mean(arr_delay),\n n = n(),\n .groups = \"drop\"\n )\n#> # A tibble: 365 × 5\n#> year month day behind n\n#> <int> <int> <int> <dbl> <int>\n#> 1 2013 1 1 32.5 461\n#> 2 2013 1 2 32.0 535\n#> 3 2013 1 3 27.7 460\n#> 4 2013 1 4 28.3 297\n#> 5 2013 1 5 22.6 238\n#> 6 2013 1 6 24.4 381\n#> # ℹ 359 more rows\n\nThis works, but what if we wanted to also compute the average delay for flights that arrived early? We’d need to perform a separate filter step, and then figure out how to combine the two data frames together3. Instead you could use [ to perform an inline filtering: arr_delay[arr_delay > 0] will yield only the positive arrival delays.\nThis leads to:\n\nflights |> \n group_by(year, month, day) |> \n summarize(\n behind = mean(arr_delay[arr_delay > 0], na.rm = TRUE),\n ahead = mean(arr_delay[arr_delay < 0], na.rm = TRUE),\n n = n(),\n .groups = \"drop\"\n )\n#> # A tibble: 365 × 6\n#> year month day behind ahead n\n#> <int> <int> <int> <dbl> <dbl> <int>\n#> 1 2013 1 1 32.5 -12.5 842\n#> 2 2013 1 2 32.0 -14.3 943\n#> 3 2013 1 3 27.7 -18.2 914\n#> 4 2013 1 4 28.3 -17.0 915\n#> 5 2013 1 5 22.6 -14.0 720\n#> 6 2013 1 6 24.4 -13.6 832\n#> # ℹ 359 more rows\n\nAlso note the difference in the group size: in the first chunk n() gives the number of delayed flights per day; in the second, n() gives the total number of flights.\n\n12.4.4 Exercises\n\nWhat will sum(is.na(x)) tell you? How about mean(is.na(x))?\nWhat does prod() return when applied to a logical vector? What logical summary function is it equivalent to? What does min() return when applied to a logical vector? What logical summary function is it equivalent to? Read the documentation and perform a few experiments." + }, + { + "objectID": "logicals.html#conditional-transformations", + "href": "logicals.html#conditional-transformations", + "title": "12  Logical vectors", + "section": "\n12.5 Conditional transformations", + "text": "12.5 Conditional transformations\nOne of the most powerful features of logical vectors are their use for conditional transformations, i.e. doing one thing for condition x, and something different for condition y. There are two important tools for this: if_else() and case_when().\n\n12.5.1 if_else()\n\nIf you want to use one value when a condition is TRUE and another value when it’s FALSE, you can use dplyr::if_else()4. You’ll always use the first three argument of if_else(). The first argument, condition, is a logical vector, the second, true, gives the output when the condition is true, and the third, false, gives the output if the condition is false.\nLet’s begin with a simple example of labeling a numeric vector as either “+ve” (positive) or “-ve” (negative):\n\nx <- c(-3:3, NA)\nif_else(x > 0, \"+ve\", \"-ve\")\n#> [1] \"-ve\" \"-ve\" \"-ve\" \"-ve\" \"+ve\" \"+ve\" \"+ve\" NA\n\nThere’s an optional fourth argument, missing which will be used if the input is NA:\n\nif_else(x > 0, \"+ve\", \"-ve\", \"???\")\n#> [1] \"-ve\" \"-ve\" \"-ve\" \"-ve\" \"+ve\" \"+ve\" \"+ve\" \"???\"\n\nYou can also use vectors for the the true and false arguments. For example, this allows us to create a minimal implementation of abs():\n\nif_else(x < 0, -x, x)\n#> [1] 3 2 1 0 1 2 3 NA\n\nSo far all the arguments have used the same vectors, but you can of course mix and match. For example, you could implement a simple version of coalesce() like this:\n\nx1 <- c(NA, 1, 2, NA)\ny1 <- c(3, NA, 4, 6)\nif_else(is.na(x1), y1, x1)\n#> [1] 3 1 2 6\n\nYou might have noticed a small infelicity in our labeling example above: zero is neither positive nor negative. We could resolve this by adding an additional if_else():\n\nif_else(x == 0, \"0\", if_else(x < 0, \"-ve\", \"+ve\"), \"???\")\n#> [1] \"-ve\" \"-ve\" \"-ve\" \"0\" \"+ve\" \"+ve\" \"+ve\" \"???\"\n\nThis is already a little hard to read, and you can imagine it would only get harder if you have more conditions. Instead, you can switch to dplyr::case_when().\n\n12.5.2 case_when()\n\ndplyr’s case_when() is inspired by SQL’s CASE statement and provides a flexible way of performing different computations for different conditions. It has a special syntax that unfortunately looks like nothing else you’ll use in the tidyverse. It takes pairs that look like condition ~ output. condition must be a logical vector; when it’s TRUE, output will be used.\nThis means we could recreate our previous nested if_else() as follows:\n\nx <- c(-3:3, NA)\ncase_when(\n x == 0 ~ \"0\",\n x < 0 ~ \"-ve\", \n x > 0 ~ \"+ve\",\n is.na(x) ~ \"???\"\n)\n#> [1] \"-ve\" \"-ve\" \"-ve\" \"0\" \"+ve\" \"+ve\" \"+ve\" \"???\"\n\nThis is more code, but it’s also more explicit.\nTo explain how case_when() works, let’s explore some simpler cases. If none of the cases match, the output gets an NA:\n\ncase_when(\n x < 0 ~ \"-ve\",\n x > 0 ~ \"+ve\"\n)\n#> [1] \"-ve\" \"-ve\" \"-ve\" NA \"+ve\" \"+ve\" \"+ve\" NA\n\nUse .default if you want to create a “default”/catch all value:\n\ncase_when(\n x < 0 ~ \"-ve\",\n x > 0 ~ \"+ve\",\n .default = \"???\"\n)\n#> [1] \"-ve\" \"-ve\" \"-ve\" \"???\" \"+ve\" \"+ve\" \"+ve\" \"???\"\n\nAnd note that if multiple conditions match, only the first will be used:\n\ncase_when(\n x > 0 ~ \"+ve\",\n x > 2 ~ \"big\"\n)\n#> [1] NA NA NA NA \"+ve\" \"+ve\" \"+ve\" NA\n\nJust like with if_else() you can use variables on both sides of the ~ and you can mix and match variables as needed for your problem. For example, we could use case_when() to provide some human readable labels for the arrival delay:\n\nflights |> \n mutate(\n status = case_when(\n is.na(arr_delay) ~ \"cancelled\",\n arr_delay < -30 ~ \"very early\",\n arr_delay < -15 ~ \"early\",\n abs(arr_delay) <= 15 ~ \"on time\",\n arr_delay < 60 ~ \"late\",\n arr_delay < Inf ~ \"very late\",\n ),\n .keep = \"used\"\n )\n#> # A tibble: 336,776 × 2\n#> arr_delay status \n#> <dbl> <chr> \n#> 1 11 on time\n#> 2 20 late \n#> 3 33 late \n#> 4 -18 early \n#> 5 -25 early \n#> 6 12 on time\n#> # ℹ 336,770 more rows\n\nBe wary when writing this sort of complex case_when() statement; my first two attempts used a mix of < and > and I kept accidentally creating overlapping conditions.\n\n12.5.3 Compatible types\nNote that both if_else() and case_when() require compatible types in the output. If they’re not compatible, you’ll see errors like this:\n\nif_else(TRUE, \"a\", 1)\n#> Error in `if_else()`:\n#> ! Can't combine `true` <character> and `false` <double>.\n\ncase_when(\n x < -1 ~ TRUE, \n x > 0 ~ now()\n)\n#> Error in `case_when()`:\n#> ! Can't combine `..1 (right)` <logical> and `..2 (right)` <datetime<local>>.\n\nOverall, relatively few types are compatible, because automatically converting one type of vector to another is a common source of errors. Here are the most important cases that are compatible:\n\nNumeric and logical vectors are compatible, as we discussed in Seção 12.4.2.\nStrings and factors (Capítulo 16) are compatible, because you can think of a factor as a string with a restricted set of values.\nDates and date-times, which we’ll discuss in Capítulo 17, are compatible because you can think of a date as a special case of date-time.\n\nNA, which is technically a logical vector, is compatible with everything because every vector has some way of representing a missing value.\n\nWe don’t expect you to memorize these rules, but they should become second nature over time because they are applied consistently throughout the tidyverse.\n\n12.5.4 Exercises\n\nA number is even if it’s divisible by two, which in R you can find out with x %% 2 == 0. Use this fact and if_else() to determine whether each number between 0 and 20 is even or odd.\nGiven a vector of days like x <- c(\"Monday\", \"Saturday\", \"Wednesday\"), use an ifelse() statement to label them as weekends or weekdays.\nUse ifelse() to compute the absolute value of a numeric vector called x.\nWrite a case_when() statement that uses the month and day columns from flights to label a selection of important US holidays (e.g., New Years Day, 4th of July, Thanksgiving, and Christmas). First create a logical column that is either TRUE or FALSE, and then create a character column that either gives the name of the holiday or is NA." + }, + { + "objectID": "logicals.html#summary", + "href": "logicals.html#summary", + "title": "12  Logical vectors", + "section": "\n12.6 Summary", + "text": "12.6 Summary\nThe definition of a logical vector is simple because each value must be either TRUE, FALSE, or NA. But logical vectors provide a huge amount of power. In this chapter, you learned how to create logical vectors with >, <, <=, >=, ==, !=, and is.na(), how to combine them with !, &, and |, and how to summarize them with any(), all(), sum(), and mean(). You also learned the powerful if_else() and case_when() functions that allow you to return values depending on the value of a logical vector.\nWe’ll see logical vectors again and again in the following chapters. For example in Capítulo 14 you’ll learn about str_detect(x, pattern) which returns a logical vector that’s TRUE for the elements of x that match the pattern, and in Capítulo 17 you’ll create logical vectors from the comparison of dates and times. But for now, we’re going to move onto the next most important type of vector: numeric vectors." + }, + { + "objectID": "logicals.html#footnotes", + "href": "logicals.html#footnotes", + "title": "12  Logical vectors", + "section": "", + "text": "R normally calls print for you (i.e. x is a shortcut for print(x)), but calling it explicitly is useful if you want to provide other arguments.↩︎\nThat is, xor(x, y) is true if x is true, or y is true, but not both. This is how we usually use “or” In English. “Both” is not usually an acceptable answer to the question “would you like ice cream or cake?”.↩︎\nWe’ll cover this in Capítulo 19.↩︎\ndplyr’s if_else() is very similar to base R’s ifelse(). There are two main advantages of if_else()over ifelse(): you can choose what should happen to missing values, and if_else() is much more likely to give you a meaningful error if your variables have incompatible types.↩︎" + }, + { + "objectID": "numbers.html#introduction", + "href": "numbers.html#introduction", + "title": "13  Numbers", + "section": "\n13.1 Introduction", + "text": "13.1 Introduction\nNumeric vectors are the backbone of data science, and you’ve already used them a bunch of times earlier in the book. Now it’s time to systematically survey what you can do with them in R, ensuring that you’re well situated to tackle any future problem involving numeric vectors.\nWe’ll start by giving you a couple of tools to make numbers if you have strings, and then going into a little more detail of count(). Then we’ll dive into various numeric transformations that pair well with mutate(), including more general transformations that can be applied to other types of vectors, but are often used with numeric vectors. We’ll finish off by covering the summary functions that pair well with summarize() and show you how they can also be used with mutate().\n\n13.1.1 Prerequisites\nThis chapter mostly uses functions from base R, which are available without loading any packages. But we still need the tidyverse because we’ll use these base R functions inside of tidyverse functions like mutate() and filter(). Like in the last chapter, we’ll use real examples from nycflights13, as well as toy examples made with c() and tribble().\n\nlibrary(tidyverse)\nlibrary(nycflights13)" + }, + { + "objectID": "numbers.html#making-numbers", + "href": "numbers.html#making-numbers", + "title": "13  Numbers", + "section": "\n13.2 Making numbers", + "text": "13.2 Making numbers\nIn most cases, you’ll get numbers already recorded in one of R’s numeric types: integer or double. In some cases, however, you’ll encounter them as strings, possibly because you’ve created them by pivoting from column headers or because something has gone wrong in your data import process.\nreadr provides two useful functions for parsing strings into numbers: parse_double() and parse_number(). Use parse_double() when you have numbers that have been written as strings:\n\nx <- c(\"1.2\", \"5.6\", \"1e3\")\nparse_double(x)\n#> [1] 1.2 5.6 1000.0\n\nUse parse_number() when the string contains non-numeric text that you want to ignore. This is particularly useful for currency data and percentages:\n\nx <- c(\"$1,234\", \"USD 3,513\", \"59%\")\nparse_number(x)\n#> [1] 1234 3513 59" + }, + { + "objectID": "numbers.html#sec-counts", + "href": "numbers.html#sec-counts", + "title": "13  Numbers", + "section": "\n13.3 Counts", + "text": "13.3 Counts\nIt’s surprising how much data science you can do with just counts and a little basic arithmetic, so dplyr strives to make counting as easy as possible with count(). This function is great for quick exploration and checks during analysis:\n\nflights |> count(dest)\n#> # A tibble: 105 × 2\n#> dest n\n#> <chr> <int>\n#> 1 ABQ 254\n#> 2 ACK 265\n#> 3 ALB 439\n#> 4 ANC 8\n#> 5 ATL 17215\n#> 6 AUS 2439\n#> # ℹ 99 more rows\n\n(Despite the advice in Capítulo 4, we usually put count() on a single line because it’s usually used at the console for a quick check that a calculation is working as expected.)\nIf you want to see the most common values, add sort = TRUE:\n\nflights |> count(dest, sort = TRUE)\n#> # A tibble: 105 × 2\n#> dest n\n#> <chr> <int>\n#> 1 ORD 17283\n#> 2 ATL 17215\n#> 3 LAX 16174\n#> 4 BOS 15508\n#> 5 MCO 14082\n#> 6 CLT 14064\n#> # ℹ 99 more rows\n\nAnd remember that if you want to see all the values, you can use |> View() or |> print(n = Inf).\nYou can perform the same computation “by hand” with group_by(), summarize() and n(). This is useful because it allows you to compute other summaries at the same time:\n\nflights |> \n group_by(dest) |> \n summarize(\n n = n(),\n delay = mean(arr_delay, na.rm = TRUE)\n )\n#> # A tibble: 105 × 3\n#> dest n delay\n#> <chr> <int> <dbl>\n#> 1 ABQ 254 4.38\n#> 2 ACK 265 4.85\n#> 3 ALB 439 14.4 \n#> 4 ANC 8 -2.5 \n#> 5 ATL 17215 11.3 \n#> 6 AUS 2439 6.02\n#> # ℹ 99 more rows\n\nn() is a special summary function that doesn’t take any arguments and instead accesses information about the “current” group. This means that it only works inside dplyr verbs:\n\nn()\n#> Error in `n()`:\n#> ! Must only be used inside data-masking verbs like `mutate()`,\n#> `filter()`, and `group_by()`.\n\nThere are a couple of variants of n() and count() that you might find useful:\n\n\nn_distinct(x) counts the number of distinct (unique) values of one or more variables. For example, we could figure out which destinations are served by the most carriers:\n\nflights |> \n group_by(dest) |> \n summarize(carriers = n_distinct(carrier)) |> \n arrange(desc(carriers))\n#> # A tibble: 105 × 2\n#> dest carriers\n#> <chr> <int>\n#> 1 ATL 7\n#> 2 BOS 7\n#> 3 CLT 7\n#> 4 ORD 7\n#> 5 TPA 7\n#> 6 AUS 6\n#> # ℹ 99 more rows\n\n\n\nA weighted count is a sum. For example you could “count” the number of miles each plane flew:\n\nflights |> \n group_by(tailnum) |> \n summarize(miles = sum(distance))\n#> # A tibble: 4,044 × 2\n#> tailnum miles\n#> <chr> <dbl>\n#> 1 D942DN 3418\n#> 2 N0EGMQ 250866\n#> 3 N10156 115966\n#> 4 N102UW 25722\n#> 5 N103US 24619\n#> 6 N104UW 25157\n#> # ℹ 4,038 more rows\n\nWeighted counts are a common problem so count() has a wt argument that does the same thing:\n\nflights |> count(tailnum, wt = distance)\n\n\n\nYou can count missing values by combining sum() and is.na(). In the flights dataset this represents flights that are cancelled:\n\nflights |> \n group_by(dest) |> \n summarize(n_cancelled = sum(is.na(dep_time))) \n#> # A tibble: 105 × 2\n#> dest n_cancelled\n#> <chr> <int>\n#> 1 ABQ 0\n#> 2 ACK 0\n#> 3 ALB 20\n#> 4 ANC 0\n#> 5 ATL 317\n#> 6 AUS 21\n#> # ℹ 99 more rows\n\n\n\n\n13.3.1 Exercises\n\nHow can you use count() to count the number rows with a missing value for a given variable?\nExpand the following calls to count() to instead use group_by(), summarize(), and arrange():\n\nflights |> count(dest, sort = TRUE)\nflights |> count(tailnum, wt = distance)" + }, + { + "objectID": "numbers.html#numeric-transformations", + "href": "numbers.html#numeric-transformations", + "title": "13  Numbers", + "section": "\n13.4 Numeric transformations", + "text": "13.4 Numeric transformations\nTransformation functions work well with mutate() because their output is the same length as the input. The vast majority of transformation functions are already built into base R. It’s impractical to list them all so this section will show the most useful ones. As an example, while R provides all the trigonometric functions that you might dream of, we don’t list them here because they’re rarely needed for data science.\n\n13.4.1 Arithmetic and recycling rules\nWe introduced the basics of arithmetic (+, -, *, /, ^) in Capítulo 2 and have used them a bunch since. These functions don’t need a huge amount of explanation because they do what you learned in grade school. But we need to briefly talk about the recycling rules which determine what happens when the left and right hand sides have different lengths. This is important for operations like flights |> mutate(air_time = air_time / 60) because there are 336,776 numbers on the left of / but only one on the right.\nR handles mismatched lengths by recycling, or repeating, the short vector. We can see this in operation more easily if we create some vectors outside of a data frame:\n\nx <- c(1, 2, 10, 20)\nx / 5\n#> [1] 0.2 0.4 2.0 4.0\n# is shorthand for\nx / c(5, 5, 5, 5)\n#> [1] 0.2 0.4 2.0 4.0\n\nGenerally, you only want to recycle single numbers (i.e. vectors of length 1), but R will recycle any shorter length vector. It usually (but not always) gives you a warning if the longer vector isn’t a multiple of the shorter:\n\nx * c(1, 2)\n#> [1] 1 4 10 40\nx * c(1, 2, 3)\n#> Warning in x * c(1, 2, 3): longer object length is not a multiple of shorter\n#> object length\n#> [1] 1 4 30 20\n\nThese recycling rules are also applied to logical comparisons (==, <, <=, >, >=, !=) and can lead to a surprising result if you accidentally use == instead of %in% and the data frame has an unfortunate number of rows. For example, take this code which attempts to find all flights in January and February:\n\nflights |> \n filter(month == c(1, 2))\n#> # A tibble: 25,977 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 542 540 2 923 850\n#> 3 2013 1 1 554 600 -6 812 837\n#> 4 2013 1 1 555 600 -5 913 854\n#> 5 2013 1 1 557 600 -3 838 846\n#> 6 2013 1 1 558 600 -2 849 851\n#> # ℹ 25,971 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nThe code runs without error, but it doesn’t return what you want. Because of the recycling rules it finds flights in odd numbered rows that departed in January and flights in even numbered rows that departed in February. And unfortunately there’s no warning because flights has an even number of rows.\nTo protect you from this type of silent failure, most tidyverse functions use a stricter form of recycling that only recycles single values. Unfortunately that doesn’t help here, or in many other cases, because the key computation is performed by the base R function ==, not filter().\n\n13.4.2 Minimum and maximum\nThe arithmetic functions work with pairs of variables. Two closely related functions are pmin() and pmax(), which when given two or more variables will return the smallest or largest value in each row:\n\ndf <- tribble(\n ~x, ~y,\n 1, 3,\n 5, 2,\n 7, NA,\n)\n\ndf |> \n mutate(\n min = pmin(x, y, na.rm = TRUE),\n max = pmax(x, y, na.rm = TRUE)\n )\n#> # A tibble: 3 × 4\n#> x y min max\n#> <dbl> <dbl> <dbl> <dbl>\n#> 1 1 3 1 3\n#> 2 5 2 2 5\n#> 3 7 NA 7 7\n\nNote that these are different to the summary functions min() and max() which take multiple observations and return a single value. You can tell that you’ve used the wrong form when all the minimums and all the maximums have the same value:\n\ndf |> \n mutate(\n min = min(x, y, na.rm = TRUE),\n max = max(x, y, na.rm = TRUE)\n )\n#> # A tibble: 3 × 4\n#> x y min max\n#> <dbl> <dbl> <dbl> <dbl>\n#> 1 1 3 1 7\n#> 2 5 2 1 7\n#> 3 7 NA 1 7\n\n\n13.4.3 Modular arithmetic\nModular arithmetic is the technical name for the type of math you did before you learned about decimal places, i.e. division that yields a whole number and a remainder. In R, %/% does integer division and %% computes the remainder:\n\n1:10 %/% 3\n#> [1] 0 0 1 1 1 2 2 2 3 3\n1:10 %% 3\n#> [1] 1 2 0 1 2 0 1 2 0 1\n\nModular arithmetic is handy for the flights dataset, because we can use it to unpack the sched_dep_time variable into hour and minute:\n\nflights |> \n mutate(\n hour = sched_dep_time %/% 100,\n minute = sched_dep_time %% 100,\n .keep = \"used\"\n )\n#> # A tibble: 336,776 × 3\n#> sched_dep_time hour minute\n#> <int> <dbl> <dbl>\n#> 1 515 5 15\n#> 2 529 5 29\n#> 3 540 5 40\n#> 4 545 5 45\n#> 5 600 6 0\n#> 6 558 5 58\n#> # ℹ 336,770 more rows\n\nWe can combine that with the mean(is.na(x)) trick from Seção 12.4 to see how the proportion of cancelled flights varies over the course of the day. The results are shown in Figura 13.1.\n\nflights |> \n group_by(hour = sched_dep_time %/% 100) |> \n summarize(prop_cancelled = mean(is.na(dep_time)), n = n()) |> \n filter(hour > 1) |> \n ggplot(aes(x = hour, y = prop_cancelled)) +\n geom_line(color = \"grey50\") + \n geom_point(aes(size = n))\n\n\n\nFigura 13.1: A line plot with scheduled departure hour on the x-axis, and proportion of cancelled flights on the y-axis. Cancellations seem to accumulate over the course of the day until 8pm, very late flights are much less likely to be cancelled.\n\n\n\n\n13.4.4 Logarithms\nLogarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude and converting exponential growth to linear growth. In R, you have a choice of three logarithms: log() (the natural log, base e), log2() (base 2), and log10() (base 10). We recommend using log2() or log10(). log2() is easy to interpret because a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving; whereas log10() is easy to back-transform because (e.g.) 3 is 10^3 = 1000. The inverse of log() is exp(); to compute the inverse of log2() or log10() you’ll need to use 2^ or 10^.\n\n13.4.5 Rounding\nUse round(x) to round a number to the nearest integer:\n\nround(123.456)\n#> [1] 123\n\nYou can control the precision of the rounding with the second argument, digits. round(x, digits) rounds to the nearest 10^-n so digits = 2 will round to the nearest 0.01. This definition is useful because it implies round(x, -3) will round to the nearest thousand, which indeed it does:\n\nround(123.456, 2) # two digits\n#> [1] 123.46\nround(123.456, 1) # one digit\n#> [1] 123.5\nround(123.456, -1) # round to nearest ten\n#> [1] 120\nround(123.456, -2) # round to nearest hundred\n#> [1] 100\n\nThere’s one weirdness with round() that seems surprising at first glance:\n\nround(c(1.5, 2.5))\n#> [1] 2 2\n\nround() uses what’s known as “round half to even” or Banker’s rounding: if a number is half way between two integers, it will be rounded to the even integer. This is a good strategy because it keeps the rounding unbiased: half of all 0.5s are rounded up, and half are rounded down.\nround() is paired with floor() which always rounds down and ceiling() which always rounds up:\n\nx <- 123.456\n\nfloor(x)\n#> [1] 123\nceiling(x)\n#> [1] 124\n\nThese functions don’t have a digits argument, so you can instead scale down, round, and then scale back up:\n\n# Round down to nearest two digits\nfloor(x / 0.01) * 0.01\n#> [1] 123.45\n# Round up to nearest two digits\nceiling(x / 0.01) * 0.01\n#> [1] 123.46\n\nYou can use the same technique if you want to round() to a multiple of some other number:\n\n# Round to nearest multiple of 4\nround(x / 4) * 4\n#> [1] 124\n\n# Round to nearest 0.25\nround(x / 0.25) * 0.25\n#> [1] 123.5\n\n\n13.4.6 Cutting numbers into ranges\nUse cut()1 to break up (aka bin) a numeric vector into discrete buckets:\n\nx <- c(1, 2, 5, 10, 15, 20)\ncut(x, breaks = c(0, 5, 10, 15, 20))\n#> [1] (0,5] (0,5] (0,5] (5,10] (10,15] (15,20]\n#> Levels: (0,5] (5,10] (10,15] (15,20]\n\nThe breaks don’t need to be evenly spaced:\n\ncut(x, breaks = c(0, 5, 10, 100))\n#> [1] (0,5] (0,5] (0,5] (5,10] (10,100] (10,100]\n#> Levels: (0,5] (5,10] (10,100]\n\nYou can optionally supply your own labels. Note that there should be one less labels than breaks.\n\ncut(x, \n breaks = c(0, 5, 10, 15, 20), \n labels = c(\"sm\", \"md\", \"lg\", \"xl\")\n)\n#> [1] sm sm sm md lg xl\n#> Levels: sm md lg xl\n\nAny values outside of the range of the breaks will become NA:\n\ny <- c(NA, -10, 5, 10, 30)\ncut(y, breaks = c(0, 5, 10, 15, 20))\n#> [1] <NA> <NA> (0,5] (5,10] <NA> \n#> Levels: (0,5] (5,10] (10,15] (15,20]\n\nSee the documentation for other useful arguments like right and include.lowest, which control if the intervals are [a, b) or (a, b] and if the lowest interval should be [a, b].\n\n13.4.7 Cumulative and rolling aggregates\nBase R provides cumsum(), cumprod(), cummin(), cummax() for running, or cumulative, sums, products, mins and maxes. dplyr provides cummean() for cumulative means. Cumulative sums tend to come up the most in practice:\n\nx <- 1:10\ncumsum(x)\n#> [1] 1 3 6 10 15 21 28 36 45 55\n\nIf you need more complex rolling or sliding aggregates, try the slider package.\n\n13.4.8 Exercises\n\nExplain in words what each line of the code used to generate Figura 13.1 does.\nWhat trigonometric functions does R provide? Guess some names and look up the documentation. Do they use degrees or radians?\n\nCurrently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. You can see the basic problem by running the code below: there’s a gap between each hour.\n\nflights |> \n filter(month == 1, day == 1) |> \n ggplot(aes(x = sched_dep_time, y = dep_delay)) +\n geom_point()\n\nConvert them to a more truthful representation of time (either fractional hours or minutes since midnight).\n\nRound dep_time and arr_time to the nearest five minutes." + }, + { + "objectID": "numbers.html#general-transformations", + "href": "numbers.html#general-transformations", + "title": "13  Numbers", + "section": "\n13.5 General transformations", + "text": "13.5 General transformations\nThe following sections describe some general transformations which are often used with numeric vectors, but can be applied to all other column types.\n\n13.5.1 Ranks\ndplyr provides a number of ranking functions inspired by SQL, but you should always start with dplyr::min_rank(). It uses the typical method for dealing with ties, e.g., 1st, 2nd, 2nd, 4th.\n\nx <- c(1, 2, 2, 3, 4, NA)\nmin_rank(x)\n#> [1] 1 2 2 4 5 NA\n\nNote that the smallest values get the lowest ranks; use desc(x) to give the largest values the smallest ranks:\n\nmin_rank(desc(x))\n#> [1] 5 3 3 2 1 NA\n\nIf min_rank() doesn’t do what you need, look at the variants dplyr::row_number(), dplyr::dense_rank(), dplyr::percent_rank(), and dplyr::cume_dist(). See the documentation for details.\n\ndf <- tibble(x = x)\ndf |> \n mutate(\n row_number = row_number(x),\n dense_rank = dense_rank(x),\n percent_rank = percent_rank(x),\n cume_dist = cume_dist(x)\n )\n#> # A tibble: 6 × 5\n#> x row_number dense_rank percent_rank cume_dist\n#> <dbl> <int> <int> <dbl> <dbl>\n#> 1 1 1 1 0 0.2\n#> 2 2 2 2 0.25 0.6\n#> 3 2 3 2 0.25 0.6\n#> 4 3 4 3 0.75 0.8\n#> 5 4 5 4 1 1 \n#> 6 NA NA NA NA NA\n\nYou can achieve many of the same results by picking the appropriate ties.method argument to base R’s rank(); you’ll probably also want to set na.last = \"keep\" to keep NAs as NA.\nrow_number() can also be used without any arguments when inside a dplyr verb. In this case, it’ll give the number of the “current” row. When combined with %% or %/% this can be a useful tool for dividing data into similarly sized groups:\n\ndf <- tibble(id = 1:10)\n\ndf |> \n mutate(\n row0 = row_number() - 1,\n three_groups = row0 %% 3,\n three_in_each_group = row0 %/% 3\n )\n#> # A tibble: 10 × 4\n#> id row0 three_groups three_in_each_group\n#> <int> <dbl> <dbl> <dbl>\n#> 1 1 0 0 0\n#> 2 2 1 1 0\n#> 3 3 2 2 0\n#> 4 4 3 0 1\n#> 5 5 4 1 1\n#> 6 6 5 2 1\n#> # ℹ 4 more rows\n\n\n13.5.2 Offsets\ndplyr::lead() and dplyr::lag() allow you to refer the values just before or just after the “current” value. They return a vector of the same length as the input, padded with NAs at the start or end:\n\nx <- c(2, 5, 11, 11, 19, 35)\nlag(x)\n#> [1] NA 2 5 11 11 19\nlead(x)\n#> [1] 5 11 11 19 35 NA\n\n\n\nx - lag(x) gives you the difference between the current and previous value.\n\nx - lag(x)\n#> [1] NA 3 6 0 8 16\n\n\n\nx == lag(x) tells you when the current value changes.\n\nx == lag(x)\n#> [1] NA FALSE FALSE TRUE FALSE FALSE\n\n\n\nYou can lead or lag by more than one position by using the second argument, n.\n\n13.5.3 Consecutive identifiers\nSometimes you want to start a new group every time some event occurs. For example, when you’re looking at website data, it’s common to want to break up events into sessions, where you begin a new session after gap of more than x minutes since the last activity. For example, imagine you have the times when someone visited a website:\n\nevents <- tibble(\n time = c(0, 1, 2, 3, 5, 10, 12, 15, 17, 19, 20, 27, 28, 30)\n)\n\nAnd you’ve computed the time between each event, and figured out if there’s a gap that’s big enough to qualify:\n\nevents <- events |> \n mutate(\n diff = time - lag(time, default = first(time)),\n has_gap = diff >= 5\n )\nevents\n#> # A tibble: 14 × 3\n#> time diff has_gap\n#> <dbl> <dbl> <lgl> \n#> 1 0 0 FALSE \n#> 2 1 1 FALSE \n#> 3 2 1 FALSE \n#> 4 3 1 FALSE \n#> 5 5 2 FALSE \n#> 6 10 5 TRUE \n#> # ℹ 8 more rows\n\nBut how do we go from that logical vector to something that we can group_by()? cumsum(), from Seção 13.4.7, comes to the rescue as gap, i.e. has_gap is TRUE, will increment group by one (Seção 12.4.2):\n\nevents |> mutate(\n group = cumsum(has_gap)\n)\n#> # A tibble: 14 × 4\n#> time diff has_gap group\n#> <dbl> <dbl> <lgl> <int>\n#> 1 0 0 FALSE 0\n#> 2 1 1 FALSE 0\n#> 3 2 1 FALSE 0\n#> 4 3 1 FALSE 0\n#> 5 5 2 FALSE 0\n#> 6 10 5 TRUE 1\n#> # ℹ 8 more rows\n\nAnother approach for creating grouping variables is consecutive_id(), which starts a new group every time one of its arguments changes. For example, inspired by this stackoverflow question, imagine you have a data frame with a bunch of repeated values:\n\ndf <- tibble(\n x = c(\"a\", \"a\", \"a\", \"b\", \"c\", \"c\", \"d\", \"e\", \"a\", \"a\", \"b\", \"b\"),\n y = c(1, 2, 3, 2, 4, 1, 3, 9, 4, 8, 10, 199)\n)\n\nIf you want to keep the first row from each repeated x, you could use group_by(), consecutive_id(), and slice_head():\n\ndf |> \n group_by(id = consecutive_id(x)) |> \n slice_head(n = 1)\n#> # A tibble: 7 × 3\n#> # Groups: id [7]\n#> x y id\n#> <chr> <dbl> <int>\n#> 1 a 1 1\n#> 2 b 2 2\n#> 3 c 4 3\n#> 4 d 3 4\n#> 5 e 9 5\n#> 6 a 4 6\n#> # ℹ 1 more row\n\n\n13.5.4 Exercises\n\nFind the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().\nWhich plane (tailnum) has the worst on-time record?\nWhat time of day should you fly if you want to avoid delays as much as possible?\nWhat does flights |> group_by(dest) |> filter(row_number() < 4) do? What does flights |> group_by(dest) |> filter(row_number(dep_delay) < 4) do?\nFor each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.\n\nDelays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag(), explore how the average flight delay for an hour is related to the average delay for the previous hour.\n\nflights |> \n mutate(hour = dep_time %/% 100) |> \n group_by(year, month, day, hour) |> \n summarize(\n dep_delay = mean(dep_delay, na.rm = TRUE),\n n = n(),\n .groups = \"drop\"\n ) |> \n filter(n > 5)\n\n\nLook at each destination. Can you find flights that are suspiciously fast (i.e. flights that represent a potential data entry error)? Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?\nFind all destinations that are flown by at least two carriers. Use those destinations to come up with a relative ranking of the carriers based on their performance for the same destination." + }, + { + "objectID": "numbers.html#numeric-summaries", + "href": "numbers.html#numeric-summaries", + "title": "13  Numbers", + "section": "\n13.6 Numeric summaries", + "text": "13.6 Numeric summaries\nJust using the counts, means, and sums that we’ve introduced already can get you a long way, but R provides many other useful summary functions. Here is a selection that you might find useful.\n\n13.6.1 Center\nSo far, we’ve mostly used mean() to summarize the center of a vector of values. As we’ve seen in Seção 3.6, because the mean is the sum divided by the count, it is sensitive to even just a few unusually high or low values. An alternative is to use the median(), which finds a value that lies in the “middle” of the vector, i.e. 50% of the values is above it and 50% are below it. Depending on the shape of the distribution of the variable you’re interested in, mean or median might be a better measure of center. For example, for symmetric distributions we generally report the mean while for skewed distributions we usually report the median.\nFigura 13.2 compares the mean vs. the median departure delay (in minutes) for each destination. The median delay is always smaller than the mean delay because flights sometimes leave multiple hours late, but never leave multiple hours early.\n\nflights |>\n group_by(year, month, day) |>\n summarize(\n mean = mean(dep_delay, na.rm = TRUE),\n median = median(dep_delay, na.rm = TRUE),\n n = n(),\n .groups = \"drop\"\n ) |> \n ggplot(aes(x = mean, y = median)) + \n geom_abline(slope = 1, intercept = 0, color = \"white\", linewidth = 2) +\n geom_point()\n\n\n\nFigura 13.2: A scatterplot showing the differences of summarizing daily depature delay with median instead of mean.\n\n\n\nYou might also wonder about the mode, or the most common value. This is a summary that only works well for very simple cases (which is why you might have learned about it in high school), but it doesn’t work well for many real datasets. If the data is discrete, there may be multiple most common values, and if the data is continuous, there might be no most common value because every value is ever so slightly different. For these reasons, the mode tends not to be used by statisticians and there’s no mode function included in base R2.\n\n13.6.2 Minimum, maximum, and quantiles\nWhat if you’re interested in locations other than the center? min() and max() will give you the largest and smallest values. Another powerful tool is quantile() which is a generalization of the median: quantile(x, 0.25) will find the value of x that is greater than 25% of the values, quantile(x, 0.5) is equivalent to the median, and quantile(x, 0.95) will find the value that’s greater than 95% of the values.\nFor the flights data, you might want to look at the 95% quantile of delays rather than the maximum, because it will ignore the 5% of most delayed flights which can be quite extreme.\n\nflights |>\n group_by(year, month, day) |>\n summarize(\n max = max(dep_delay, na.rm = TRUE),\n q95 = quantile(dep_delay, 0.95, na.rm = TRUE),\n .groups = \"drop\"\n )\n#> # A tibble: 365 × 5\n#> year month day max q95\n#> <int> <int> <int> <dbl> <dbl>\n#> 1 2013 1 1 853 70.1\n#> 2 2013 1 2 379 85 \n#> 3 2013 1 3 291 68 \n#> 4 2013 1 4 288 60 \n#> 5 2013 1 5 327 41 \n#> 6 2013 1 6 202 51 \n#> # ℹ 359 more rows\n\n\n13.6.3 Spread\nSometimes you’re not so interested in where the bulk of the data lies, but in how it is spread out. Two commonly used summaries are the standard deviation, sd(x), and the inter-quartile range, IQR(). We won’t explain sd() here since you’re probably already familiar with it, but IQR() might be new — it’s quantile(x, 0.75) - quantile(x, 0.25) and gives you the range that contains the middle 50% of the data.\nWe can use this to reveal a small oddity in the flights data. You might expect the spread of the distance between origin and destination to be zero, since airports are always in the same place. But the code below reveals a data oddity for airport EGE:\n\nflights |> \n group_by(origin, dest) |> \n summarize(\n distance_sd = IQR(distance), \n n = n(),\n .groups = \"drop\"\n ) |> \n filter(distance_sd > 0)\n#> # A tibble: 2 × 4\n#> origin dest distance_sd n\n#> <chr> <chr> <dbl> <int>\n#> 1 EWR EGE 1 110\n#> 2 JFK EGE 1 103\n\n\n13.6.4 Distributions\nIt’s worth remembering that all of the summary statistics described above are a way of reducing the distribution down to a single number. This means that they’re fundamentally reductive, and if you pick the wrong summary, you can easily miss important differences between groups. That’s why it’s always a good idea to visualize the distribution before committing to your summary statistics.\nFigura 13.3 shows the overall distribution of departure delays. The distribution is so skewed that we have to zoom in to see the bulk of the data. This suggests that the mean is unlikely to be a good summary and we might prefer the median instead.\n\n\n\n\nFigura 13.3: (Left) The histogram of the full data is extremely skewed making it hard to get any details. (Right) Zooming into delays of less than two hours makes it possible to see what’s happening with the bulk of the observations.\n\n\n\nIt’s also a good idea to check that distributions for subgroups resemble the whole. In the following plot 365 frequency polygons of dep_delay, one for each day, are overlaid. The distributions seem to follow a common pattern, suggesting it’s fine to use the same summary for each day.\n\nflights |>\n filter(dep_delay < 120) |> \n ggplot(aes(x = dep_delay, group = interaction(day, month))) + \n geom_freqpoly(binwidth = 5, alpha = 1/5)\n\n\n\n\nDon’t be afraid to explore your own custom summaries specifically tailored for the data that you’re working with. In this case, that might mean separately summarizing the flights that left early vs. the flights that left late, or given that the values are so heavily skewed, you might try a log-transformation. Finally, don’t forget what you learned in Seção 3.6: whenever creating numerical summaries, it’s a good idea to include the number of observations in each group.\n\n13.6.5 Positions\nThere’s one final type of summary that’s useful for numeric vectors, but also works with every other type of value: extracting a value at a specific position: first(x), last(x), and nth(x, n).\nFor example, we can find the first and last departure for each day:\n\nflights |> \n group_by(year, month, day) |> \n summarize(\n first_dep = first(dep_time, na_rm = TRUE), \n fifth_dep = nth(dep_time, 5, na_rm = TRUE),\n last_dep = last(dep_time, na_rm = TRUE)\n )\n#> `summarise()` has grouped output by 'year', 'month'. You can override using\n#> the `.groups` argument.\n#> # A tibble: 365 × 6\n#> # Groups: year, month [12]\n#> year month day first_dep fifth_dep last_dep\n#> <int> <int> <int> <int> <int> <int>\n#> 1 2013 1 1 517 554 2356\n#> 2 2013 1 2 42 535 2354\n#> 3 2013 1 3 32 520 2349\n#> 4 2013 1 4 25 531 2358\n#> 5 2013 1 5 14 534 2357\n#> 6 2013 1 6 16 555 2355\n#> # ℹ 359 more rows\n\n(NB: Because dplyr functions use _ to separate components of function and arguments names, these functions use na_rm instead of na.rm.)\nIf you’re familiar with [, which we’ll come back to in Seção 27.2, you might wonder if you ever need these functions. There are three reasons: the default argument allows you to provide a default if the specified position doesn’t exist, the order_by argument allows you to locally override the order of the rows, and the na_rm argument allows you to drop missing values.\nExtracting values at positions is complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row:\n\nflights |> \n group_by(year, month, day) |> \n mutate(r = min_rank(sched_dep_time)) |> \n filter(r %in% c(1, max(r)))\n#> # A tibble: 1,195 × 20\n#> # Groups: year, month, day [365]\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 2353 2359 -6 425 445\n#> 3 2013 1 1 2353 2359 -6 418 442\n#> 4 2013 1 1 2356 2359 -3 425 437\n#> 5 2013 1 2 42 2359 43 518 442\n#> 6 2013 1 2 458 500 -2 703 650\n#> # ℹ 1,189 more rows\n#> # ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\n\n13.6.6 With mutate()\n\nAs the names suggest, the summary functions are typically paired with summarize(). However, because of the recycling rules we discussed in Seção 13.4.1 they can also be usefully paired with mutate(), particularly when you want do some sort of group standardization. For example:\n\n\nx / sum(x) calculates the proportion of a total.\n\n(x - mean(x)) / sd(x) computes a Z-score (standardized to mean 0 and sd 1).\n\n(x - min(x)) / (max(x) - min(x)) standardizes to range [0, 1].\n\nx / first(x) computes an index based on the first observation.\n\n13.6.7 Exercises\n\nBrainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. When is mean() useful? When is median() useful? When might you want to use something else? Should you use arrival delay or departure delay? Why might you want to use data from planes?\nWhich destinations show the greatest variation in air speed?\nCreate a plot to further explore the adventures of EGE. Can you find any evidence that the airport moved locations? Can you find another variable that might explain the difference?" + }, + { + "objectID": "numbers.html#summary", + "href": "numbers.html#summary", + "title": "13  Numbers", + "section": "\n13.7 Summary", + "text": "13.7 Summary\nYou’re already familiar with many tools for working with numbers, and after reading this chapter you now know how to use them in R. You’ve also learned a handful of useful general transformations that are commonly, but not exclusively, applied to numeric vectors like ranks and offsets. Finally, you worked through a number of numeric summaries, and discussed a few of the statistical challenges that you should consider.\nOver the next two chapters, we’ll dive into working with strings with the stringr package. Strings are a big topic so they get two chapters, one on the fundamentals of strings and one on regular expressions." + }, + { + "objectID": "numbers.html#footnotes", + "href": "numbers.html#footnotes", + "title": "13  Numbers", + "section": "", + "text": "ggplot2 provides some helpers for common cases in cut_interval(), cut_number(), and cut_width(). ggplot2 is an admittedly weird place for these functions to live, but they are useful as part of histogram computation and were written before any other parts of the tidyverse existed.↩︎\nThe mode() function does something quite different!↩︎" + }, + { + "objectID": "strings.html#introduction", + "href": "strings.html#introduction", + "title": "14  Strings", + "section": "\n14.1 Introduction", + "text": "14.1 Introduction\nSo far, you’ve used a bunch of strings without learning much about the details. Now it’s time to dive into them, learn what makes strings tick, and master some of the powerful string manipulation tools you have at your disposal.\nWe’ll begin with the details of creating strings and character vectors. You’ll then dive into creating strings from data, then the opposite: extracting strings from data. We’ll then discuss tools that work with individual letters. The chapter finishes with functions that work with individual letters and a brief discussion of where your expectations from English might steer you wrong when working with other languages.\nWe’ll keep working with strings in the next chapter, where you’ll learn more about the power of regular expressions.\n\n14.1.1 Prerequisites\nIn this chapter, we’ll use functions from the stringr package, which is part of the core tidyverse. We’ll also use the babynames data since it provides some fun strings to manipulate.\n\nlibrary(tidyverse)\nlibrary(babynames)\n\nYou can quickly tell when you’re using a stringr function because all stringr functions start with str_. This is particularly useful if you use RStudio because typing str_ will trigger autocomplete, allowing you to jog your memory of the available functions." + }, + { + "objectID": "strings.html#creating-a-string", + "href": "strings.html#creating-a-string", + "title": "14  Strings", + "section": "\n14.2 Creating a string", + "text": "14.2 Creating a string\nWe’ve created strings in passing earlier in the book but didn’t discuss the details. Firstly, you can create a string using either single quotes (') or double quotes (\"). There’s no difference in behavior between the two, so in the interests of consistency, the tidyverse style guide recommends using \", unless the string contains multiple \".\n\nstring1 <- \"This is a string\"\nstring2 <- 'If I want to include a \"quote\" inside a string, I use single quotes'\n\nIf you forget to close a quote, you’ll see +, the continuation prompt:\n> \"This is a string without a closing quote\n+ \n+ \n+ HELP I'M STUCK IN A STRING\nIf this happens to you and you can’t figure out which quote to close, press Escape to cancel and try again.\n\n14.2.1 Escapes\nTo include a literal single or double quote in a string, you can use \\ to “escape” it:\n\ndouble_quote <- \"\\\"\" # or '\"'\nsingle_quote <- '\\'' # or \"'\"\n\nSo if you want to include a literal backslash in your string, you’ll need to escape it: \"\\\\\":\n\nbackslash <- \"\\\\\"\n\nBeware that the printed representation of a string is not the same as the string itself because the printed representation shows the escapes (in other words, when you print a string, you can copy and paste the output to recreate that string). To see the raw contents of the string, use str_view()1:\n\nx <- c(single_quote, double_quote, backslash)\nx\n#> [1] \"'\" \"\\\"\" \"\\\\\"\n\nstr_view(x)\n#> [1] │ '\n#> [2] │ \"\n#> [3] │ \\\n\n\n14.2.2 Raw strings\nCreating a string with multiple quotes or backslashes gets confusing quickly. To illustrate the problem, let’s create a string that contains the contents of the code block where we define the double_quote and single_quote variables:\n\ntricky <- \"double_quote <- \\\"\\\\\\\"\\\" # or '\\\"'\nsingle_quote <- '\\\\'' # or \\\"'\\\"\"\nstr_view(tricky)\n#> [1] │ double_quote <- \"\\\"\" # or '\"'\n#> │ single_quote <- '\\'' # or \"'\"\n\nThat’s a lot of backslashes! (This is sometimes called leaning toothpick syndrome.) To eliminate the escaping, you can instead use a raw string2:\n\ntricky <- r\"(double_quote <- \"\\\"\" # or '\"'\nsingle_quote <- '\\'' # or \"'\")\"\nstr_view(tricky)\n#> [1] │ double_quote <- \"\\\"\" # or '\"'\n#> │ single_quote <- '\\'' # or \"'\"\n\nA raw string usually starts with r\"( and finishes with )\". But if your string contains )\" you can instead use r\"[]\" or r\"{}\", and if that’s still not enough, you can insert any number of dashes to make the opening and closing pairs unique, e.g., r\"--()--\", r\"---()---\", etc. Raw strings are flexible enough to handle any text.\n\n14.2.3 Other special characters\nAs well as \\\", \\', and \\\\, there are a handful of other special characters that may come in handy. The most common are \\n, a new line, and \\t, tab. You’ll also sometimes see strings containing Unicode escapes that start with \\u or \\U. This is a way of writing non-English characters that work on all systems. You can see the complete list of other special characters in ?Quotes.\n\nx <- c(\"one\\ntwo\", \"one\\ttwo\", \"\\u00b5\", \"\\U0001f604\")\nx\n#> [1] \"one\\ntwo\" \"one\\ttwo\" \"µ\" \"😄\"\nstr_view(x)\n#> [1] │ one\n#> │ two\n#> [2] │ one{\\t}two\n#> [3] │ µ\n#> [4] │ 😄\n\nNote that str_view() uses curly braces for tabs to make them easier to spot3. One of the challenges of working with text is that there’s a variety of ways that white space can end up in the text, so this background helps you recognize that something strange is going on.\n\n14.2.4 Exercises\n\n\nCreate strings that contain the following values:\n\nHe said \"That's amazing!\"\n\\a\\b\\c\\d\n\\\\\\\\\\\\\n\n\n\nCreate the string in your R session and print it. What happens to the special “\\u00a0”? How does str_view() display it? Can you do a little googling to figure out what this special character is?\n\nx <- \"This\\u00a0is\\u00a0tricky\"" + }, + { + "objectID": "strings.html#creating-many-strings-from-data", + "href": "strings.html#creating-many-strings-from-data", + "title": "14  Strings", + "section": "\n14.3 Creating many strings from data", + "text": "14.3 Creating many strings from data\nNow that you’ve learned the basics of creating a string or two by “hand”, we’ll go into the details of creating strings from other strings. This will help you solve the common problem where you have some text you wrote that you want to combine with strings from a data frame. For example, you might combine “Hello” with a name variable to create a greeting. We’ll show you how to do this with str_c() and str_glue() and how you can use them with mutate(). That naturally raises the question of what stringr functions you might use with summarize(), so we’ll finish this section with a discussion of str_flatten(), which is a summary function for strings.\n\n14.3.1 str_c()\n\nstr_c() takes any number of vectors as arguments and returns a character vector:\n\nstr_c(\"x\", \"y\")\n#> [1] \"xy\"\nstr_c(\"x\", \"y\", \"z\")\n#> [1] \"xyz\"\nstr_c(\"Hello \", c(\"John\", \"Susan\"))\n#> [1] \"Hello John\" \"Hello Susan\"\n\nstr_c() is very similar to the base paste0(), but is designed to be used with mutate() by obeying the usual tidyverse rules for recycling and propagating missing values:\n\ndf <- tibble(name = c(\"Flora\", \"David\", \"Terra\", NA))\ndf |> mutate(greeting = str_c(\"Hi \", name, \"!\"))\n#> # A tibble: 4 × 2\n#> name greeting \n#> <chr> <chr> \n#> 1 Flora Hi Flora!\n#> 2 David Hi David!\n#> 3 Terra Hi Terra!\n#> 4 <NA> <NA>\n\nIf you want missing values to display in another way, use coalesce() to replace them. Depending on what you want, you might use it either inside or outside of str_c():\n\ndf |> \n mutate(\n greeting1 = str_c(\"Hi \", coalesce(name, \"you\"), \"!\"),\n greeting2 = coalesce(str_c(\"Hi \", name, \"!\"), \"Hi!\")\n )\n#> # A tibble: 4 × 3\n#> name greeting1 greeting2\n#> <chr> <chr> <chr> \n#> 1 Flora Hi Flora! Hi Flora!\n#> 2 David Hi David! Hi David!\n#> 3 Terra Hi Terra! Hi Terra!\n#> 4 <NA> Hi you! Hi!\n\n\n14.3.2 str_glue()\n\nIf you are mixing many fixed and variable strings with str_c(), you’ll notice that you type a lot of \"s, making it hard to see the overall goal of the code. An alternative approach is provided by the glue package via str_glue()4. You give it a single string that has a special feature: anything inside {} will be evaluated like it’s outside of the quotes:\n\ndf |> mutate(greeting = str_glue(\"Hi {name}!\"))\n#> # A tibble: 4 × 2\n#> name greeting \n#> <chr> <glue> \n#> 1 Flora Hi Flora!\n#> 2 David Hi David!\n#> 3 Terra Hi Terra!\n#> 4 <NA> Hi NA!\n\nAs you can see, str_glue() currently converts missing values to the string \"NA\", unfortunately making it inconsistent with str_c().\nYou also might wonder what happens if you need to include a regular { or } in your string. You’re on the right track if you guess you’ll need to escape it somehow. The trick is that glue uses a slightly different escaping technique: instead of prefixing with special character like \\, you double up the special characters:\n\ndf |> mutate(greeting = str_glue(\"{{Hi {name}!}}\"))\n#> # A tibble: 4 × 2\n#> name greeting \n#> <chr> <glue> \n#> 1 Flora {Hi Flora!}\n#> 2 David {Hi David!}\n#> 3 Terra {Hi Terra!}\n#> 4 <NA> {Hi NA!}\n\n\n14.3.3 str_flatten()\n\nstr_c() and str_glue() work well with mutate() because their output is the same length as their inputs. What if you want a function that works well with summarize(), i.e. something that always returns a single string? That’s the job of str_flatten()5: it takes a character vector and combines each element of the vector into a single string:\n\nstr_flatten(c(\"x\", \"y\", \"z\"))\n#> [1] \"xyz\"\nstr_flatten(c(\"x\", \"y\", \"z\"), \", \")\n#> [1] \"x, y, z\"\nstr_flatten(c(\"x\", \"y\", \"z\"), \", \", last = \", and \")\n#> [1] \"x, y, and z\"\n\nThis makes it work well with summarize():\n\ndf <- tribble(\n ~ name, ~ fruit,\n \"Carmen\", \"banana\",\n \"Carmen\", \"apple\",\n \"Marvin\", \"nectarine\",\n \"Terence\", \"cantaloupe\",\n \"Terence\", \"papaya\",\n \"Terence\", \"mandarin\"\n)\ndf |>\n group_by(name) |> \n summarize(fruits = str_flatten(fruit, \", \"))\n#> # A tibble: 3 × 2\n#> name fruits \n#> <chr> <chr> \n#> 1 Carmen banana, apple \n#> 2 Marvin nectarine \n#> 3 Terence cantaloupe, papaya, mandarin\n\n\n14.3.4 Exercises\n\n\nCompare and contrast the results of paste0() with str_c() for the following inputs:\n\nstr_c(\"hi \", NA)\nstr_c(letters[1:2], letters[1:3])\n\n\nWhat’s the difference between paste() and paste0()? How can you recreate the equivalent of paste() with str_c()?\n\nConvert the following expressions from str_c() to str_glue() or vice versa:\n\nstr_c(\"The price of \", food, \" is \", price)\nstr_glue(\"I'm {age} years old and live in {country}\")\nstr_c(\"\\\\section{\", title, \"}\")" + }, + { + "objectID": "strings.html#extracting-data-from-strings", + "href": "strings.html#extracting-data-from-strings", + "title": "14  Strings", + "section": "\n14.4 Extracting data from strings", + "text": "14.4 Extracting data from strings\nIt’s very common for multiple variables to be crammed together into a single string. In this section, you’ll learn how to use four tidyr functions to extract them:\n\ndf |> separate_longer_delim(col, delim)\ndf |> separate_longer_position(col, width)\ndf |> separate_wider_delim(col, delim, names)\ndf |> separate_wider_position(col, widths)\n\nIf you look closely, you can see there’s a common pattern here: separate_, then longer or wider, then _, then by delim or position. That’s because these four functions are composed of two simpler primitives:\n\nJust like with pivot_longer() and pivot_wider(), _longer functions make the input data frame longer by creating new rows and _wider functions make the input data frame wider by generating new columns.\n\ndelim splits up a string with a delimiter like \", \" or \" \"; position splits at specified widths, like c(3, 5, 2).\n\nWe’ll return to the last member of this family, separate_wider_regex(), in Capítulo 15. It’s the most flexible of the wider functions, but you need to know something about regular expressions before you can use it.\nThe following two sections will give you the basic idea behind these separate functions, first separating into rows (which is a little simpler) and then separating into columns. We’ll finish off by discussing the tools that the wider functions give you to diagnose problems.\n\n14.4.1 Separating into rows\nSeparating a string into rows tends to be most useful when the number of components varies from row to row. The most common case is requiring separate_longer_delim() to split based on a delimiter:\n\ndf1 <- tibble(x = c(\"a,b,c\", \"d,e\", \"f\"))\ndf1 |> \n separate_longer_delim(x, delim = \",\")\n#> # A tibble: 6 × 1\n#> x \n#> <chr>\n#> 1 a \n#> 2 b \n#> 3 c \n#> 4 d \n#> 5 e \n#> 6 f\n\nIt’s rarer to see separate_longer_position() in the wild, but some older datasets do use a very compact format where each character is used to record a value:\n\ndf2 <- tibble(x = c(\"1211\", \"131\", \"21\"))\ndf2 |> \n separate_longer_position(x, width = 1)\n#> # A tibble: 9 × 1\n#> x \n#> <chr>\n#> 1 1 \n#> 2 2 \n#> 3 1 \n#> 4 1 \n#> 5 1 \n#> 6 3 \n#> # ℹ 3 more rows\n\n\n14.4.2 Separating into columns\nSeparating a string into columns tends to be most useful when there are a fixed number of components in each string, and you want to spread them into columns. They are slightly more complicated than their longer equivalents because you need to name the columns. For example, in this following dataset, x is made up of a code, an edition number, and a year, separated by \".\". To use separate_wider_delim(), we supply the delimiter and the names in two arguments:\n\ndf3 <- tibble(x = c(\"a10.1.2022\", \"b10.2.2011\", \"e15.1.2015\"))\ndf3 |> \n separate_wider_delim(\n x,\n delim = \".\",\n names = c(\"code\", \"edition\", \"year\")\n )\n#> # A tibble: 3 × 3\n#> code edition year \n#> <chr> <chr> <chr>\n#> 1 a10 1 2022 \n#> 2 b10 2 2011 \n#> 3 e15 1 2015\n\nIf a specific piece is not useful you can use an NA name to omit it from the results:\n\ndf3 |> \n separate_wider_delim(\n x,\n delim = \".\",\n names = c(\"code\", NA, \"year\")\n )\n#> # A tibble: 3 × 2\n#> code year \n#> <chr> <chr>\n#> 1 a10 2022 \n#> 2 b10 2011 \n#> 3 e15 2015\n\nseparate_wider_position() works a little differently because you typically want to specify the width of each column. So you give it a named integer vector, where the name gives the name of the new column, and the value is the number of characters it occupies. You can omit values from the output by not naming them:\n\ndf4 <- tibble(x = c(\"202215TX\", \"202122LA\", \"202325CA\")) \ndf4 |> \n separate_wider_position(\n x,\n widths = c(year = 4, age = 2, state = 2)\n )\n#> # A tibble: 3 × 3\n#> year age state\n#> <chr> <chr> <chr>\n#> 1 2022 15 TX \n#> 2 2021 22 LA \n#> 3 2023 25 CA\n\n\n14.4.3 Diagnosing widening problems\nseparate_wider_delim()6 requires a fixed and known set of columns. What happens if some of the rows don’t have the expected number of pieces? There are two possible problems, too few or too many pieces, so separate_wider_delim() provides two arguments to help: too_few and too_many. Let’s first look at the too_few case with the following sample dataset:\n\ndf <- tibble(x = c(\"1-1-1\", \"1-1-2\", \"1-3\", \"1-3-2\", \"1\"))\n\ndf |> \n separate_wider_delim(\n x,\n delim = \"-\",\n names = c(\"x\", \"y\", \"z\")\n )\n#> Error in `separate_wider_delim()`:\n#> ! Expected 3 pieces in each element of `x`.\n#> ! 2 values were too short.\n#> ℹ Use `too_few = \"debug\"` to diagnose the problem.\n#> ℹ Use `too_few = \"align_start\"/\"align_end\"` to silence this message.\n\nYou’ll notice that we get an error, but the error gives us some suggestions on how you might proceed. Let’s start by debugging the problem:\n\ndebug <- df |> \n separate_wider_delim(\n x,\n delim = \"-\",\n names = c(\"x\", \"y\", \"z\"),\n too_few = \"debug\"\n )\n#> Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and\n#> `x_remainder`.\ndebug\n#> # A tibble: 5 × 6\n#> x y z x_ok x_pieces x_remainder\n#> <chr> <chr> <chr> <lgl> <int> <chr> \n#> 1 1-1-1 1 1 TRUE 3 \"\" \n#> 2 1-1-2 1 2 TRUE 3 \"\" \n#> 3 1-3 3 <NA> FALSE 2 \"\" \n#> 4 1-3-2 3 2 TRUE 3 \"\" \n#> 5 1 <NA> <NA> FALSE 1 \"\"\n\nWhen you use the debug mode, you get three extra columns added to the output: x_ok, x_pieces, and x_remainder (if you separate a variable with a different name, you’ll get a different prefix). Here, x_ok lets you quickly find the inputs that failed:\n\ndebug |> filter(!x_ok)\n#> # A tibble: 2 × 6\n#> x y z x_ok x_pieces x_remainder\n#> <chr> <chr> <chr> <lgl> <int> <chr> \n#> 1 1-3 3 <NA> FALSE 2 \"\" \n#> 2 1 <NA> <NA> FALSE 1 \"\"\n\nx_pieces tells us how many pieces were found, compared to the expected 3 (the length of names). x_remainder isn’t useful when there are too few pieces, but we’ll see it again shortly.\nSometimes looking at this debugging information will reveal a problem with your delimiter strategy or suggest that you need to do more preprocessing before separating. In that case, fix the problem upstream and make sure to remove too_few = \"debug\" to ensure that new problems become errors.\nIn other cases, you may want to fill in the missing pieces with NAs and move on. That’s the job of too_few = \"align_start\" and too_few = \"align_end\" which allow you to control where the NAs should go:\n\ndf |> \n separate_wider_delim(\n x,\n delim = \"-\",\n names = c(\"x\", \"y\", \"z\"),\n too_few = \"align_start\"\n )\n#> # A tibble: 5 × 3\n#> x y z \n#> <chr> <chr> <chr>\n#> 1 1 1 1 \n#> 2 1 1 2 \n#> 3 1 3 <NA> \n#> 4 1 3 2 \n#> 5 1 <NA> <NA>\n\nThe same principles apply if you have too many pieces:\n\ndf <- tibble(x = c(\"1-1-1\", \"1-1-2\", \"1-3-5-6\", \"1-3-2\", \"1-3-5-7-9\"))\n\ndf |> \n separate_wider_delim(\n x,\n delim = \"-\",\n names = c(\"x\", \"y\", \"z\")\n )\n#> Error in `separate_wider_delim()`:\n#> ! Expected 3 pieces in each element of `x`.\n#> ! 2 values were too long.\n#> ℹ Use `too_many = \"debug\"` to diagnose the problem.\n#> ℹ Use `too_many = \"drop\"/\"merge\"` to silence this message.\n\nBut now, when we debug the result, you can see the purpose of x_remainder:\n\ndebug <- df |> \n separate_wider_delim(\n x,\n delim = \"-\",\n names = c(\"x\", \"y\", \"z\"),\n too_many = \"debug\"\n )\n#> Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and\n#> `x_remainder`.\ndebug |> filter(!x_ok)\n#> # A tibble: 2 × 6\n#> x y z x_ok x_pieces x_remainder\n#> <chr> <chr> <chr> <lgl> <int> <chr> \n#> 1 1-3-5-6 3 5 FALSE 4 -6 \n#> 2 1-3-5-7-9 3 5 FALSE 5 -7-9\n\nYou have a slightly different set of options for handling too many pieces: you can either silently “drop” any additional pieces or “merge” them all into the final column:\n\ndf |> \n separate_wider_delim(\n x,\n delim = \"-\",\n names = c(\"x\", \"y\", \"z\"),\n too_many = \"drop\"\n )\n#> # A tibble: 5 × 3\n#> x y z \n#> <chr> <chr> <chr>\n#> 1 1 1 1 \n#> 2 1 1 2 \n#> 3 1 3 5 \n#> 4 1 3 2 \n#> 5 1 3 5\n\n\ndf |> \n separate_wider_delim(\n x,\n delim = \"-\",\n names = c(\"x\", \"y\", \"z\"),\n too_many = \"merge\"\n )\n#> # A tibble: 5 × 3\n#> x y z \n#> <chr> <chr> <chr>\n#> 1 1 1 1 \n#> 2 1 1 2 \n#> 3 1 3 5-6 \n#> 4 1 3 2 \n#> 5 1 3 5-7-9" + }, + { + "objectID": "strings.html#letters", + "href": "strings.html#letters", + "title": "14  Strings", + "section": "\n14.5 Letters", + "text": "14.5 Letters\nIn this section, we’ll introduce you to functions that allow you to work with the individual letters within a string. You’ll learn how to find the length of a string, extract substrings, and handle long strings in plots and tables.\n\n14.5.1 Length\nstr_length() tells you the number of letters in the string:\n\nstr_length(c(\"a\", \"R for data science\", NA))\n#> [1] 1 18 NA\n\nYou could use this with count() to find the distribution of lengths of US babynames and then with filter() to look at the longest names, which happen to have 15 letters7:\n\nbabynames |>\n count(length = str_length(name), wt = n)\n#> # A tibble: 14 × 2\n#> length n\n#> <int> <int>\n#> 1 2 338150\n#> 2 3 8589596\n#> 3 4 48506739\n#> 4 5 87011607\n#> 5 6 90749404\n#> 6 7 72120767\n#> # ℹ 8 more rows\n\nbabynames |> \n filter(str_length(name) == 15) |> \n count(name, wt = n, sort = TRUE)\n#> # A tibble: 34 × 2\n#> name n\n#> <chr> <int>\n#> 1 Franciscojavier 123\n#> 2 Christopherjohn 118\n#> 3 Johnchristopher 118\n#> 4 Christopherjame 108\n#> 5 Christophermich 52\n#> 6 Ryanchristopher 45\n#> # ℹ 28 more rows\n\n\n14.5.2 Subsetting\nYou can extract parts of a string using str_sub(string, start, end), where start and end are the positions where the substring should start and end. The start and end arguments are inclusive, so the length of the returned string will be end - start + 1:\n\nx <- c(\"Apple\", \"Banana\", \"Pear\")\nstr_sub(x, 1, 3)\n#> [1] \"App\" \"Ban\" \"Pea\"\n\nYou can use negative values to count back from the end of the string: -1 is the last character, -2 is the second to last character, etc.\n\nstr_sub(x, -3, -1)\n#> [1] \"ple\" \"ana\" \"ear\"\n\nNote that str_sub() won’t fail if the string is too short: it will just return as much as possible:\n\nstr_sub(\"a\", 1, 5)\n#> [1] \"a\"\n\nWe could use str_sub() with mutate() to find the first and last letter of each name:\n\nbabynames |> \n mutate(\n first = str_sub(name, 1, 1),\n last = str_sub(name, -1, -1)\n )\n#> # A tibble: 1,924,665 × 7\n#> year sex name n prop first last \n#> <dbl> <chr> <chr> <int> <dbl> <chr> <chr>\n#> 1 1880 F Mary 7065 0.0724 M y \n#> 2 1880 F Anna 2604 0.0267 A a \n#> 3 1880 F Emma 2003 0.0205 E a \n#> 4 1880 F Elizabeth 1939 0.0199 E h \n#> 5 1880 F Minnie 1746 0.0179 M e \n#> 6 1880 F Margaret 1578 0.0162 M t \n#> # ℹ 1,924,659 more rows\n\n\n14.5.3 Exercises\n\nWhen computing the distribution of the length of babynames, why did we use wt = n?\nUse str_length() and str_sub() to extract the middle letter from each baby name. What will you do if the string has an even number of characters?\nAre there any major trends in the length of babynames over time? What about the popularity of first and last letters?" + }, + { + "objectID": "strings.html#sec-other-languages", + "href": "strings.html#sec-other-languages", + "title": "14  Strings", + "section": "\n14.6 Non-English text", + "text": "14.6 Non-English text\nSo far, we’ve focused on English language text which is particularly easy to work with for two reasons. Firstly, the English alphabet is relatively simple: there are just 26 letters. Secondly (and maybe more importantly), the computing infrastructure we use today was predominantly designed by English speakers. Unfortunately, we don’t have room for a full treatment of non-English languages. Still, we wanted to draw your attention to some of the biggest challenges you might encounter: encoding, letter variations, and locale-dependent functions.\n\n14.6.1 Encoding\nWhen working with non-English text, the first challenge is often the encoding. To understand what’s going on, we need to dive into how computers represent strings. In R, we can get at the underlying representation of a string using charToRaw():\n\ncharToRaw(\"Hadley\")\n#> [1] 48 61 64 6c 65 79\n\nEach of these six hexadecimal numbers represents one letter: 48 is H, 61 is a, and so on. The mapping from hexadecimal number to character is called the encoding, and in this case, the encoding is called ASCII. ASCII does a great job of representing English characters because it’s the American Standard Code for Information Interchange.\nThings aren’t so easy for languages other than English. In the early days of computing, there were many competing standards for encoding non-English characters. For example, there were two different encodings for Europe: Latin1 (aka ISO-8859-1) was used for Western European languages, and Latin2 (aka ISO-8859-2) was used for Central European languages. In Latin1, the byte b1 is “±”, but in Latin2, it’s “ą”! Fortunately, today there is one standard that is supported almost everywhere: UTF-8. UTF-8 can encode just about every character used by humans today and many extra symbols like emojis.\nreadr uses UTF-8 everywhere. This is a good default but will fail for data produced by older systems that don’t use UTF-8. If this happens, your strings will look weird when you print them. Sometimes just one or two characters might be messed up; other times, you’ll get complete gibberish. For example here are two inline CSVs with unusual encodings8:\n\nx1 <- \"text\\nEl Ni\\xf1o was particularly bad this year\"\nread_csv(x1)$text\n#> [1] \"El Ni\\xf1o was particularly bad this year\"\n\nx2 <- \"text\\n\\x82\\xb1\\x82\\xf1\\x82\\xc9\\x82\\xbf\\x82\\xcd\"\nread_csv(x2)$text\n#> [1] \"\\x82\\xb1\\x82\\xf1\\x82ɂ\\xbf\\x82\\xcd\"\n\nTo read these correctly, you specify the encoding via the locale argument:\n\nread_csv(x1, locale = locale(encoding = \"Latin1\"))$text\n#> [1] \"El Niño was particularly bad this year\"\n\nread_csv(x2, locale = locale(encoding = \"Shift-JIS\"))$text\n#> [1] \"こんにちは\"\n\nHow do you find the correct encoding? If you’re lucky, it’ll be included somewhere in the data documentation. Unfortunately, that’s rarely the case, so readr provides guess_encoding() to help you figure it out. It’s not foolproof and works better when you have lots of text (unlike here), but it’s a reasonable place to start. Expect to try a few different encodings before you find the right one.\nEncodings are a rich and complex topic; we’ve only scratched the surface here. If you’d like to learn more, we recommend reading the detailed explanation at http://kunststube.net/encoding/.\n\n14.6.2 Letter variations\nWorking in languages with accents poses a significant challenge when determining the position of letters (e.g., with str_length() and str_sub()) as accented letters might be encoded as a single individual character (e.g., ü) or as two characters by combining an unaccented letter (e.g., u) with a diacritic mark (e.g., ¨). For example, this code shows two ways of representing ü that look identical:\n\nu <- c(\"\\u00fc\", \"u\\u0308\")\nstr_view(u)\n#> [1] │ ü\n#> [2] │ ü\n\nBut both strings differ in length, and their first characters are different:\n\nstr_length(u)\n#> [1] 1 2\nstr_sub(u, 1, 1)\n#> [1] \"ü\" \"u\"\n\nFinally, note that a comparison of these strings with == interprets these strings as different, while the handy str_equal() function in stringr recognizes that both have the same appearance:\n\nu[[1]] == u[[2]]\n#> [1] FALSE\n\nstr_equal(u[[1]], u[[2]])\n#> [1] TRUE\n\n\n14.6.3 Locale-dependent functions\nFinally, there are a handful of stringr functions whose behavior depends on your locale. A locale is similar to a language but includes an optional region specifier to handle regional variations within a language. A locale is specified by a lower-case language abbreviation, optionally followed by a _ and an upper-case region identifier. For example, “en” is English, “en_GB” is British English, and “en_US” is American English. If you don’t already know the code for your language, Wikipedia has a good list, and you can see which are supported in stringr by looking at stringi::stri_locale_list().\nBase R string functions automatically use the locale set by your operating system. This means that base R string functions do what you expect for your language, but your code might work differently if you share it with someone who lives in a different country. To avoid this problem, stringr defaults to English rules by using the “en” locale and requires you to specify the locale argument to override it. Fortunately, there are two sets of functions where the locale really matters: changing case and sorting.\nThe rules for changing cases differ among languages. For example, Turkish has two i’s: with and without a dot. Since they’re two distinct letters, they’re capitalized differently:\n\nstr_to_upper(c(\"i\", \"ı\"))\n#> [1] \"I\" \"I\"\nstr_to_upper(c(\"i\", \"ı\"), locale = \"tr\")\n#> [1] \"İ\" \"I\"\n\nSorting strings depends on the order of the alphabet, and the order of the alphabet is not the same in every language9! Here’s an example: in Czech, “ch” is a compound letter that appears after h in the alphabet.\n\nstr_sort(c(\"a\", \"c\", \"ch\", \"h\", \"z\"))\n#> [1] \"a\" \"c\" \"ch\" \"h\" \"z\"\nstr_sort(c(\"a\", \"c\", \"ch\", \"h\", \"z\"), locale = \"cs\")\n#> [1] \"a\" \"c\" \"h\" \"ch\" \"z\"\n\nThis also comes up when sorting strings with dplyr::arrange(), which is why it also has a locale argument." + }, + { + "objectID": "strings.html#summary", + "href": "strings.html#summary", + "title": "14  Strings", + "section": "\n14.7 Summary", + "text": "14.7 Summary\nIn this chapter, you’ve learned about some of the power of the stringr package: how to create, combine, and extract strings, and about some of the challenges you might face with non-English strings. Now it’s time to learn one of the most important and powerful tools for working with strings: regular expressions. Regular expressions are a very concise but very expressive language for describing patterns within strings and are the topic of the next chapter." + }, + { + "objectID": "strings.html#footnotes", + "href": "strings.html#footnotes", + "title": "14  Strings", + "section": "", + "text": "Or use the base R function writeLines().↩︎\nAvailable in R 4.0.0 and above.↩︎\nstr_view() also uses color to bring tabs, spaces, matches, etc. to your attention. The colors don’t currently show up in the book, but you’ll notice them when running code interactively.↩︎\nIf you’re not using stringr, you can also access it directly with glue::glue().↩︎\nThe base R equivalent is paste() used with the collapse argument.↩︎\nThe same principles apply to separate_wider_position() and separate_wider_regex().↩︎\nLooking at these entries, we’d guess that the babynames data drops spaces or hyphens and truncates after 15 letters.↩︎\nHere I’m using the special \\x to encode binary data directly into a string.↩︎\nSorting in languages that don’t have an alphabet, like Chinese, is more complicated still.↩︎" + }, + { + "objectID": "regexps.html#introduction", + "href": "regexps.html#introduction", + "title": "15  Regular expressions", + "section": "\n15.1 Introduction", + "text": "15.1 Introduction\nIn Capítulo 14, you learned a whole bunch of useful functions for working with strings. This chapter will focus on functions that use regular expressions, a concise and powerful language for describing patterns within strings. The term “regular expression” is a bit of a mouthful, so most people abbreviate it to “regex”1 or “regexp”.\nThe chapter starts with the basics of regular expressions and the most useful stringr functions for data analysis. We’ll then expand your knowledge of patterns and cover seven important new topics (escaping, anchoring, character classes, shorthand classes, quantifiers, precedence, and grouping). Next, we’ll talk about some of the other types of patterns that stringr functions can work with and the various “flags” that allow you to tweak the operation of regular expressions. We’ll finish with a survey of other places in the tidyverse and base R where you might use regexes.\n\n15.1.1 Prerequisites\nIn this chapter, we’ll use regular expression functions from stringr and tidyr, both core members of the tidyverse, as well as data from the babynames package.\n\nlibrary(tidyverse)\nlibrary(babynames)\n\nThrough this chapter, we’ll use a mix of very simple inline examples so you can get the basic idea, the baby names data, and three character vectors from stringr:\n\n\nfruit contains the names of 80 fruits.\n\nwords contains 980 common English words.\n\nsentences contains 720 short sentences." + }, + { + "objectID": "regexps.html#sec-reg-basics", + "href": "regexps.html#sec-reg-basics", + "title": "15  Regular expressions", + "section": "\n15.2 Pattern basics", + "text": "15.2 Pattern basics\nWe’ll use str_view() to learn how regex patterns work. We used str_view() in the last chapter to better understand a string vs. its printed representation, and now we’ll use it with its second argument, a regular expression. When this is supplied, str_view() will show only the elements of the string vector that match, surrounding each match with <>, and, where possible, highlighting the match in blue.\nThe simplest patterns consist of letters and numbers which match those characters exactly:\n\nstr_view(fruit, \"berry\")\n#> [6] │ bil<berry>\n#> [7] │ black<berry>\n#> [10] │ blue<berry>\n#> [11] │ boysen<berry>\n#> [19] │ cloud<berry>\n#> [21] │ cran<berry>\n#> ... and 8 more\n\nLetters and numbers match exactly and are called literal characters. Most punctuation characters, like ., +, *, [, ], and ?, have special meanings2 and are called metacharacters. For example, . will match any character3, so \"a.\" will match any string that contains an “a” followed by another character :\n\nstr_view(c(\"a\", \"ab\", \"ae\", \"bd\", \"ea\", \"eab\"), \"a.\")\n#> [2] │ <ab>\n#> [3] │ <ae>\n#> [6] │ e<ab>\n\nOr we could find all the fruits that contain an “a”, followed by three letters, followed by an “e”:\n\nstr_view(fruit, \"a...e\")\n#> [1] │ <apple>\n#> [7] │ bl<ackbe>rry\n#> [48] │ mand<arine>\n#> [51] │ nect<arine>\n#> [62] │ pine<apple>\n#> [64] │ pomegr<anate>\n#> ... and 2 more\n\nQuantifiers control how many times a pattern can match:\n\n\n? makes a pattern optional (i.e. it matches 0 or 1 times)\n\n+ lets a pattern repeat (i.e. it matches at least once)\n\n* lets a pattern be optional or repeat (i.e. it matches any number of times, including 0).\n\n\n# ab? matches an \"a\", optionally followed by a \"b\".\nstr_view(c(\"a\", \"ab\", \"abb\"), \"ab?\")\n#> [1] │ <a>\n#> [2] │ <ab>\n#> [3] │ <ab>b\n\n# ab+ matches an \"a\", followed by at least one \"b\".\nstr_view(c(\"a\", \"ab\", \"abb\"), \"ab+\")\n#> [2] │ <ab>\n#> [3] │ <abb>\n\n# ab* matches an \"a\", followed by any number of \"b\"s.\nstr_view(c(\"a\", \"ab\", \"abb\"), \"ab*\")\n#> [1] │ <a>\n#> [2] │ <ab>\n#> [3] │ <abb>\n\nCharacter classes are defined by [] and let you match a set of characters, e.g., [abcd] matches “a”, “b”, “c”, or “d”. You can also invert the match by starting with ^: [^abcd] matches anything except “a”, “b”, “c”, or “d”. We can use this idea to find the words containing an “x” surrounded by vowels, or a “y” surrounded by consonants:\n\nstr_view(words, \"[aeiou]x[aeiou]\")\n#> [284] │ <exa>ct\n#> [285] │ <exa>mple\n#> [288] │ <exe>rcise\n#> [289] │ <exi>st\nstr_view(words, \"[^aeiou]y[^aeiou]\")\n#> [836] │ <sys>tem\n#> [901] │ <typ>e\n\nYou can use alternation, |, to pick between one or more alternative patterns. For example, the following patterns look for fruits containing “apple”, “melon”, or “nut”, or a repeated vowel.\n\nstr_view(fruit, \"apple|melon|nut\")\n#> [1] │ <apple>\n#> [13] │ canary <melon>\n#> [20] │ coco<nut>\n#> [52] │ <nut>\n#> [62] │ pine<apple>\n#> [72] │ rock <melon>\n#> ... and 1 more\nstr_view(fruit, \"aa|ee|ii|oo|uu\")\n#> [9] │ bl<oo>d orange\n#> [33] │ g<oo>seberry\n#> [47] │ lych<ee>\n#> [66] │ purple mangost<ee>n\n\nRegular expressions are very compact and use a lot of punctuation characters, so they can seem overwhelming and hard to read at first. Don’t worry; you’ll get better with practice, and simple patterns will soon become second nature. Let’s kick off that process by practicing with some useful stringr functions." + }, + { + "objectID": "regexps.html#sec-stringr-regex-funs", + "href": "regexps.html#sec-stringr-regex-funs", + "title": "15  Regular expressions", + "section": "\n15.3 Key functions", + "text": "15.3 Key functions\nNow that you’ve got the basics of regular expressions under your belt, let’s use them with some stringr and tidyr functions. In the following section, you’ll learn how to detect the presence or absence of a match, how to count the number of matches, how to replace a match with fixed text, and how to extract text using a pattern.\n\n15.3.1 Detect matches\nstr_detect() returns a logical vector that is TRUE if the pattern matches an element of the character vector and FALSE otherwise:\n\nstr_detect(c(\"a\", \"b\", \"c\"), \"[aeiou]\")\n#> [1] TRUE FALSE FALSE\n\nSince str_detect() returns a logical vector of the same length as the initial vector, it pairs well with filter(). For example, this code finds all the most popular names containing a lower-case “x”:\n\nbabynames |> \n filter(str_detect(name, \"x\")) |> \n count(name, wt = n, sort = TRUE)\n#> # A tibble: 974 × 2\n#> name n\n#> <chr> <int>\n#> 1 Alexander 665492\n#> 2 Alexis 399551\n#> 3 Alex 278705\n#> 4 Alexandra 232223\n#> 5 Max 148787\n#> 6 Alexa 123032\n#> # ℹ 968 more rows\n\nWe can also use str_detect() with summarize() by pairing it with sum() or mean(): sum(str_detect(x, pattern)) tells you the number of observations that match and mean(str_detect(x, pattern)) tells you the proportion that match. For example, the following snippet computes and visualizes the proportion of baby names4 that contain “x”, broken down by year. It looks like they’ve radically increased in popularity lately!\n\nbabynames |> \n group_by(year) |> \n summarize(prop_x = mean(str_detect(name, \"x\"))) |> \n ggplot(aes(x = year, y = prop_x)) + \n geom_line()\n\n\n\n\nThere are two functions that are closely related to str_detect(): str_subset() and str_which(). str_subset() returns a character vector containing only the strings that match. str_which() returns an integer vector giving the positions of the strings that match.\n\n15.3.2 Count matches\nThe next step up in complexity from str_detect() is str_count(): rather than a true or false, it tells you how many matches there are in each string.\n\nx <- c(\"apple\", \"banana\", \"pear\")\nstr_count(x, \"p\")\n#> [1] 2 0 1\n\nNote that each match starts at the end of the previous match, i.e. regex matches never overlap. For example, in \"abababa\", how many times will the pattern \"aba\" match? Regular expressions say two, not three:\n\nstr_count(\"abababa\", \"aba\")\n#> [1] 2\nstr_view(\"abababa\", \"aba\")\n#> [1] │ <aba>b<aba>\n\nIt’s natural to use str_count() with mutate(). The following example uses str_count() with character classes to count the number of vowels and consonants in each name.\n\nbabynames |> \n count(name) |> \n mutate(\n vowels = str_count(name, \"[aeiou]\"),\n consonants = str_count(name, \"[^aeiou]\")\n )\n#> # A tibble: 97,310 × 4\n#> name n vowels consonants\n#> <chr> <int> <int> <int>\n#> 1 Aaban 10 2 3\n#> 2 Aabha 5 2 3\n#> 3 Aabid 2 2 3\n#> 4 Aabir 1 2 3\n#> 5 Aabriella 5 4 5\n#> 6 Aada 1 2 2\n#> # ℹ 97,304 more rows\n\nIf you look closely, you’ll notice that there’s something off with our calculations: “Aaban” contains three “a”s, but our summary reports only two vowels. That’s because regular expressions are case sensitive. There are three ways we could fix this:\n\nAdd the upper case vowels to the character class: str_count(name, \"[aeiouAEIOU]\").\nTell the regular expression to ignore case: str_count(name, regex(\"[aeiou]\", ignore_case = TRUE)). We’ll talk about more in Seção 15.5.1.\nUse str_to_lower() to convert the names to lower case: str_count(str_to_lower(name), \"[aeiou]\").\n\nThis variety of approaches is pretty typical when working with strings — there are often multiple ways to reach your goal, either by making your pattern more complicated or by doing some preprocessing on your string. If you get stuck trying one approach, it can often be useful to switch gears and tackle the problem from a different perspective.\nIn this case, since we’re applying two functions to the name, I think it’s easier to transform it first:\n\nbabynames |> \n count(name) |> \n mutate(\n name = str_to_lower(name),\n vowels = str_count(name, \"[aeiou]\"),\n consonants = str_count(name, \"[^aeiou]\")\n )\n#> # A tibble: 97,310 × 4\n#> name n vowels consonants\n#> <chr> <int> <int> <int>\n#> 1 aaban 10 3 2\n#> 2 aabha 5 3 2\n#> 3 aabid 2 3 2\n#> 4 aabir 1 3 2\n#> 5 aabriella 5 5 4\n#> 6 aada 1 3 1\n#> # ℹ 97,304 more rows\n\n\n15.3.3 Replace values\nAs well as detecting and counting matches, we can also modify them with str_replace() and str_replace_all(). str_replace() replaces the first match, and as the name suggests, str_replace_all() replaces all matches.\n\nx <- c(\"apple\", \"pear\", \"banana\")\nstr_replace_all(x, \"[aeiou]\", \"-\")\n#> [1] \"-ppl-\" \"p--r\" \"b-n-n-\"\n\nstr_remove() and str_remove_all() are handy shortcuts for str_replace(x, pattern, \"\"):\n\nx <- c(\"apple\", \"pear\", \"banana\")\nstr_remove_all(x, \"[aeiou]\")\n#> [1] \"ppl\" \"pr\" \"bnn\"\n\nThese functions are naturally paired with mutate() when doing data cleaning, and you’ll often apply them repeatedly to peel off layers of inconsistent formatting.\n\n15.3.4 Extract variables\nThe last function we’ll discuss uses regular expressions to extract data out of one column into one or more new columns: separate_wider_regex(). It’s a peer of the separate_wider_position() and separate_wider_delim() functions that you learned about in Seção 14.4.2. These functions live in tidyr because they operate on (columns of) data frames, rather than individual vectors.\nLet’s create a simple dataset to show how it works. Here we have some data derived from babynames where we have the name, gender, and age of a bunch of people in a rather weird format5:\n\ndf <- tribble(\n ~str,\n \"<Sheryl>-F_34\",\n \"<Kisha>-F_45\", \n \"<Brandon>-N_33\",\n \"<Sharon>-F_38\", \n \"<Penny>-F_58\",\n \"<Justin>-M_41\", \n \"<Patricia>-F_84\", \n)\n\nTo extract this data using separate_wider_regex() we just need to construct a sequence of regular expressions that match each piece. If we want the contents of that piece to appear in the output, we give it a name:\n\ndf |> \n separate_wider_regex(\n str,\n patterns = c(\n \"<\", \n name = \"[A-Za-z]+\", \n \">-\", \n gender = \".\",\n \"_\",\n age = \"[0-9]+\"\n )\n )\n#> # A tibble: 7 × 3\n#> name gender age \n#> <chr> <chr> <chr>\n#> 1 Sheryl F 34 \n#> 2 Kisha F 45 \n#> 3 Brandon N 33 \n#> 4 Sharon F 38 \n#> 5 Penny F 58 \n#> 6 Justin M 41 \n#> # ℹ 1 more row\n\nIf the match fails, you can use too_short = \"debug\" to figure out what went wrong, just like separate_wider_delim() and separate_wider_position().\n\n15.3.5 Exercises\n\nWhat baby name has the most vowels? What name has the highest proportion of vowels? (Hint: what is the denominator?)\nReplace all forward slashes in \"a/b/c/d/e\" with backslashes. What happens if you attempt to undo the transformation by replacing all backslashes with forward slashes? (We’ll discuss the problem very soon.)\nImplement a simple version of str_to_lower() using str_replace_all().\nCreate a regular expression that will match telephone numbers as commonly written in your country." + }, + { + "objectID": "regexps.html#pattern-details", + "href": "regexps.html#pattern-details", + "title": "15  Regular expressions", + "section": "\n15.4 Pattern details", + "text": "15.4 Pattern details\nNow that you understand the basics of the pattern language and how to use it with some stringr and tidyr functions, it’s time to dig into more of the details. First, we’ll start with escaping, which allows you to match metacharacters that would otherwise be treated specially. Next, you’ll learn about anchors which allow you to match the start or end of the string. Then, you’ll learn more about character classes and their shortcuts which allow you to match any character from a set. Next, you’ll learn the final details of quantifiers which control how many times a pattern can match. Then, we have to cover the important (but complex) topic of operator precedence and parentheses. And we’ll finish off with some details of grouping components of the pattern.\nThe terms we use here are the technical names for each component. They’re not always the most evocative of their purpose, but it’s very helpful to know the correct terms if you later want to Google for more details.\n\n15.4.1 Escaping\nIn order to match a literal ., you need an escape which tells the regular expression to match metacharacters6 literally. Like strings, regexps use the backslash for escaping. So, to match a ., you need the regexp \\.. Unfortunately this creates a problem. We use strings to represent regular expressions, and \\ is also used as an escape symbol in strings. So to create the regular expression \\. we need the string \"\\\\.\", as the following example shows.\n\n# To create the regular expression \\., we need to use \\\\.\ndot <- \"\\\\.\"\n\n# But the expression itself only contains one \\\nstr_view(dot)\n#> [1] │ \\.\n\n# And this tells R to look for an explicit .\nstr_view(c(\"abc\", \"a.c\", \"bef\"), \"a\\\\.c\")\n#> [2] │ <a.c>\n\nIn this book, we’ll usually write regular expression without quotes, like \\.. If we need to emphasize what you’ll actually type, we’ll surround it with quotes and add extra escapes, like \"\\\\.\".\nIf \\ is used as an escape character in regular expressions, how do you match a literal \\? Well, you need to escape it, creating the regular expression \\\\. To create that regular expression, you need to use a string, which also needs to escape \\. That means to match a literal \\ you need to write \"\\\\\\\\\" — you need four backslashes to match one!\n\nx <- \"a\\\\b\"\nstr_view(x)\n#> [1] │ a\\b\nstr_view(x, \"\\\\\\\\\")\n#> [1] │ a<\\>b\n\nAlternatively, you might find it easier to use the raw strings you learned about in Seção 14.2.2). That lets you avoid one layer of escaping:\n\nstr_view(x, r\"{\\\\}\")\n#> [1] │ a<\\>b\n\nIf you’re trying to match a literal ., $, |, *, +, ?, {, }, (, ), there’s an alternative to using a backslash escape: you can use a character class: [.], [$], [|], ... all match the literal values.\n\nstr_view(c(\"abc\", \"a.c\", \"a*c\", \"a c\"), \"a[.]c\")\n#> [2] │ <a.c>\nstr_view(c(\"abc\", \"a.c\", \"a*c\", \"a c\"), \".[*]c\")\n#> [3] │ <a*c>\n\n\n15.4.2 Anchors\nBy default, regular expressions will match any part of a string. If you want to match at the start or end you need to anchor the regular expression using ^ to match the start or $ to match the end:\n\nstr_view(fruit, \"^a\")\n#> [1] │ <a>pple\n#> [2] │ <a>pricot\n#> [3] │ <a>vocado\nstr_view(fruit, \"a$\")\n#> [4] │ banan<a>\n#> [15] │ cherimoy<a>\n#> [30] │ feijo<a>\n#> [36] │ guav<a>\n#> [56] │ papay<a>\n#> [74] │ satsum<a>\n\nIt’s tempting to think that $ should match the start of a string, because that’s how we write dollar amounts, but that’s not what regular expressions want.\nTo force a regular expression to match only the full string, anchor it with both ^ and $:\n\nstr_view(fruit, \"apple\")\n#> [1] │ <apple>\n#> [62] │ pine<apple>\nstr_view(fruit, \"^apple$\")\n#> [1] │ <apple>\n\nYou can also match the boundary between words (i.e. the start or end of a word) with \\b. This can be particularly useful when using RStudio’s find and replace tool. For example, if to find all uses of sum(), you can search for \\bsum\\b to avoid matching summarize, summary, rowsum and so on:\n\nx <- c(\"summary(x)\", \"summarize(df)\", \"rowsum(x)\", \"sum(x)\")\nstr_view(x, \"sum\")\n#> [1] │ <sum>mary(x)\n#> [2] │ <sum>marize(df)\n#> [3] │ row<sum>(x)\n#> [4] │ <sum>(x)\nstr_view(x, \"\\\\bsum\\\\b\")\n#> [4] │ <sum>(x)\n\nWhen used alone, anchors will produce a zero-width match:\n\nstr_view(\"abc\", c(\"$\", \"^\", \"\\\\b\"))\n#> [1] │ abc<>\n#> [2] │ <>abc\n#> [3] │ <>abc<>\n\nThis helps you understand what happens when you replace a standalone anchor:\n\nstr_replace_all(\"abc\", c(\"$\", \"^\", \"\\\\b\"), \"--\")\n#> [1] \"abc--\" \"--abc\" \"--abc--\"\n\n\n15.4.3 Character classes\nA character class, or character set, allows you to match any character in a set. As we discussed above, you can construct your own sets with [], where [abc] matches “a”, “b”, or “c” and [^abc] matches any character except “a”, “b”, or “c”. Apart from ^ there are two other characters that have special meaning inside of []:\n\n\n- defines a range, e.g., [a-z] matches any lower case letter and [0-9] matches any number.\n\n\\ escapes special characters, so [\\^\\-\\]] matches ^, -, or ].\n\nHere are few examples:\n\nx <- \"abcd ABCD 12345 -!@#%.\"\nstr_view(x, \"[abc]+\")\n#> [1] │ <abc>d ABCD 12345 -!@#%.\nstr_view(x, \"[a-z]+\")\n#> [1] │ <abcd> ABCD 12345 -!@#%.\nstr_view(x, \"[^a-z0-9]+\")\n#> [1] │ abcd< ABCD >12345< -!@#%.>\n\n# You need an escape to match characters that are otherwise\n# special inside of []\nstr_view(\"a-b-c\", \"[a-c]\")\n#> [1] │ <a>-<b>-<c>\nstr_view(\"a-b-c\", \"[a\\\\-c]\")\n#> [1] │ <a><->b<-><c>\n\nSome character classes are used so commonly that they get their own shortcut. You’ve already seen ., which matches any character apart from a newline. There are three other particularly useful pairs7:\n\n\n\\d matches any digit;\\D matches anything that isn’t a digit.\n\n\\s matches any whitespace (e.g., space, tab, newline);\\S matches anything that isn’t whitespace.\n\n\\w matches any “word” character, i.e. letters and numbers;\\W matches any “non-word” character.\n\nThe following code demonstrates the six shortcuts with a selection of letters, numbers, and punctuation characters.\n\nx <- \"abcd ABCD 12345 -!@#%.\"\nstr_view(x, \"\\\\d+\")\n#> [1] │ abcd ABCD <12345> -!@#%.\nstr_view(x, \"\\\\D+\")\n#> [1] │ <abcd ABCD >12345< -!@#%.>\nstr_view(x, \"\\\\s+\")\n#> [1] │ abcd< >ABCD< >12345< >-!@#%.\nstr_view(x, \"\\\\S+\")\n#> [1] │ <abcd> <ABCD> <12345> <-!@#%.>\nstr_view(x, \"\\\\w+\")\n#> [1] │ <abcd> <ABCD> <12345> -!@#%.\nstr_view(x, \"\\\\W+\")\n#> [1] │ abcd< >ABCD< >12345< -!@#%.>\n\n\n15.4.4 Quantifiers\nQuantifiers control how many times a pattern matches. In Seção 15.2 you learned about ? (0 or 1 matches), + (1 or more matches), and * (0 or more matches). For example, colou?r will match American or British spelling, \\d+ will match one or more digits, and \\s? will optionally match a single item of whitespace. You can also specify the number of matches precisely with {}:\n\n\n{n} matches exactly n times.\n\n{n,} matches at least n times.\n\n{n,m} matches between n and m times.\n\n15.4.5 Operator precedence and parentheses\nWhat does ab+ match? Does it match “a” followed by one or more “b”s, or does it match “ab” repeated any number of times? What does ^a|b$ match? Does it match the complete string a or the complete string b, or does it match a string starting with a or a string ending with b?\nThe answer to these questions is determined by operator precedence, similar to the PEMDAS or BEDMAS rules you might have learned in school. You know that a + b * c is equivalent to a + (b * c) not (a + b) * c because * has higher precedence and + has lower precedence: you compute * before +.\nSimilarly, regular expressions have their own precedence rules: quantifiers have high precedence and alternation has low precedence which means that ab+ is equivalent to a(b+), and ^a|b$ is equivalent to (^a)|(b$). Just like with algebra, you can use parentheses to override the usual order. But unlike algebra you’re unlikely to remember the precedence rules for regexes, so feel free to use parentheses liberally.\n\n15.4.6 Grouping and capturing\nAs well as overriding operator precedence, parentheses have another important effect: they create capturing groups that allow you to use sub-components of the match.\nThe first way to use a capturing group is to refer back to it within a match with back reference: \\1 refers to the match contained in the first parenthesis, \\2 in the second parenthesis, and so on. For example, the following pattern finds all fruits that have a repeated pair of letters:\n\nstr_view(fruit, \"(..)\\\\1\")\n#> [4] │ b<anan>a\n#> [20] │ <coco>nut\n#> [22] │ <cucu>mber\n#> [41] │ <juju>be\n#> [56] │ <papa>ya\n#> [73] │ s<alal> berry\n\nAnd this one finds all words that start and end with the same pair of letters:\n\nstr_view(words, \"^(..).*\\\\1$\")\n#> [152] │ <church>\n#> [217] │ <decide>\n#> [617] │ <photograph>\n#> [699] │ <require>\n#> [739] │ <sense>\n\nYou can also use back references in str_replace(). For example, this code switches the order of the second and third words in sentences:\n\nsentences |> \n str_replace(\"(\\\\w+) (\\\\w+) (\\\\w+)\", \"\\\\1 \\\\3 \\\\2\") |> \n str_view()\n#> [1] │ The canoe birch slid on the smooth planks.\n#> [2] │ Glue sheet the to the dark blue background.\n#> [3] │ It's to easy tell the depth of a well.\n#> [4] │ These a days chicken leg is a rare dish.\n#> [5] │ Rice often is served in round bowls.\n#> [6] │ The of juice lemons makes fine punch.\n#> ... and 714 more\n\nIf you want to extract the matches for each group you can use str_match(). But str_match() returns a matrix, so it’s not particularly easy to work with8:\n\nsentences |> \n str_match(\"the (\\\\w+) (\\\\w+)\") |> \n head()\n#> [,1] [,2] [,3] \n#> [1,] \"the smooth planks\" \"smooth\" \"planks\"\n#> [2,] \"the sheet to\" \"sheet\" \"to\" \n#> [3,] \"the depth of\" \"depth\" \"of\" \n#> [4,] NA NA NA \n#> [5,] NA NA NA \n#> [6,] NA NA NA\n\nYou could convert to a tibble and name the columns:\n\nsentences |> \n str_match(\"the (\\\\w+) (\\\\w+)\") |> \n as_tibble(.name_repair = \"minimal\") |> \n set_names(\"match\", \"word1\", \"word2\")\n#> # A tibble: 720 × 3\n#> match word1 word2 \n#> <chr> <chr> <chr> \n#> 1 the smooth planks smooth planks\n#> 2 the sheet to sheet to \n#> 3 the depth of depth of \n#> 4 <NA> <NA> <NA> \n#> 5 <NA> <NA> <NA> \n#> 6 <NA> <NA> <NA> \n#> # ℹ 714 more rows\n\nBut then you’ve basically recreated your own version of separate_wider_regex(). Indeed, behind the scenes, separate_wider_regex() converts your vector of patterns to a single regex that uses grouping to capture the named components.\nOccasionally, you’ll want to use parentheses without creating matching groups. You can create a non-capturing group with (?:).\n\nx <- c(\"a gray cat\", \"a grey dog\")\nstr_match(x, \"gr(e|a)y\")\n#> [,1] [,2]\n#> [1,] \"gray\" \"a\" \n#> [2,] \"grey\" \"e\"\nstr_match(x, \"gr(?:e|a)y\")\n#> [,1] \n#> [1,] \"gray\"\n#> [2,] \"grey\"\n\n\n15.4.7 Exercises\n\nHow would you match the literal string \"'\\? How about \"$^$\"?\nExplain why each of these patterns don’t match a \\: \"\\\", \"\\\\\", \"\\\\\\\".\n\nGiven the corpus of common words in stringr::words, create regular expressions that find all words that:\n\nStart with “y”.\nDon’t start with “y”.\nEnd with “x”.\nAre exactly three letters long. (Don’t cheat by using str_length()!)\nHave seven letters or more.\nContain a vowel-consonant pair.\nContain at least two vowel-consonant pairs in a row.\nOnly consist of repeated vowel-consonant pairs.\n\n\nCreate 11 regular expressions that match the British or American spellings for each of the following words: airplane/aeroplane, aluminum/aluminium, analog/analogue, ass/arse, center/centre, defense/defence, donut/doughnut, gray/grey, modeling/modelling, skeptic/sceptic, summarize/summarise. Try and make the shortest possible regex!\nSwitch the first and last letters in words. Which of those strings are still words?\n\nDescribe in words what these regular expressions match: (read carefully to see if each entry is a regular expression or a string that defines a regular expression.)\n\n^.*$\n\"\\\\{.+\\\\}\"\n\\d{4}-\\d{2}-\\d{2}\n\"\\\\\\\\{4}\"\n\\..\\..\\..\n(.)\\1\\1\n\"(..)\\\\1\"\n\n\nSolve the beginner regexp crosswords at https://regexcrossword.com/challenges/beginner." + }, + { + "objectID": "regexps.html#pattern-control", + "href": "regexps.html#pattern-control", + "title": "15  Regular expressions", + "section": "\n15.5 Pattern control", + "text": "15.5 Pattern control\nIt’s possible to exercise extra control over the details of the match by using a pattern object instead of just a string. This allows you to control the so called regex flags and match various types of fixed strings, as described below.\n\n15.5.1 Regex flags\nThere are a number of settings that can be used to control the details of the regexp. These settings are often called flags in other programming languages. In stringr, you can use these by wrapping the pattern in a call to regex(). The most useful flag is probably ignore_case = TRUE because it allows characters to match either their uppercase or lowercase forms:\n\nbananas <- c(\"banana\", \"Banana\", \"BANANA\")\nstr_view(bananas, \"banana\")\n#> [1] │ <banana>\nstr_view(bananas, regex(\"banana\", ignore_case = TRUE))\n#> [1] │ <banana>\n#> [2] │ <Banana>\n#> [3] │ <BANANA>\n\nIf you’re doing a lot of work with multiline strings (i.e. strings that contain \\n), dotalland multiline may also be useful:\n\n\ndotall = TRUE lets . match everything, including \\n:\n\nx <- \"Line 1\\nLine 2\\nLine 3\"\nstr_view(x, \".Line\")\nstr_view(x, regex(\".Line\", dotall = TRUE))\n#> [1] │ Line 1<\n#> │ Line> 2<\n#> │ Line> 3\n\n\n\nmultiline = TRUE makes ^ and $ match the start and end of each line rather than the start and end of the complete string:\n\nx <- \"Line 1\\nLine 2\\nLine 3\"\nstr_view(x, \"^Line\")\n#> [1] │ <Line> 1\n#> │ Line 2\n#> │ Line 3\nstr_view(x, regex(\"^Line\", multiline = TRUE))\n#> [1] │ <Line> 1\n#> │ <Line> 2\n#> │ <Line> 3\n\n\n\nFinally, if you’re writing a complicated regular expression and you’re worried you might not understand it in the future, you might try comments = TRUE. It tweaks the pattern language to ignore spaces and new lines, as well as everything after #. This allows you to use comments and whitespace to make complex regular expressions more understandable9, as in the following example:\n\nphone <- regex(\n r\"(\n \\(? # optional opening parens\n (\\d{3}) # area code\n [)\\-]? # optional closing parens or dash\n \\ ? # optional space\n (\\d{3}) # another three numbers\n [\\ -]? # optional space or dash\n (\\d{4}) # four more numbers\n )\", \n comments = TRUE\n)\n\nstr_extract(c(\"514-791-8141\", \"(123) 456 7890\", \"123456\"), phone)\n#> [1] \"514-791-8141\" \"(123) 456 7890\" NA\n\nIf you’re using comments and want to match a space, newline, or #, you’ll need to escape it with \\.\n\n15.5.2 Fixed matches\nYou can opt-out of the regular expression rules by using fixed():\n\nstr_view(c(\"\", \"a\", \".\"), fixed(\".\"))\n#> [3] │ <.>\n\nfixed() also gives you the ability to ignore case:\n\nstr_view(\"x X\", \"X\")\n#> [1] │ x <X>\nstr_view(\"x X\", fixed(\"X\", ignore_case = TRUE))\n#> [1] │ <x> <X>\n\nIf you’re working with non-English text, you will probably want coll() instead of fixed(), as it implements the full rules for capitalization as used by the locale you specify. See Seção 14.6 for more details on locales.\n\nstr_view(\"i İ ı I\", fixed(\"İ\", ignore_case = TRUE))\n#> [1] │ i <İ> ı I\nstr_view(\"i İ ı I\", coll(\"İ\", ignore_case = TRUE, locale = \"tr\"))\n#> [1] │ <i> <İ> ı I" + }, + { + "objectID": "regexps.html#practice", + "href": "regexps.html#practice", + "title": "15  Regular expressions", + "section": "\n15.6 Practice", + "text": "15.6 Practice\nTo put these ideas into practice we’ll solve a few semi-authentic problems next. We’ll discuss three general techniques:\n\nchecking your work by creating simple positive and negative controls\ncombining regular expressions with Boolean algebra\ncreating complex patterns using string manipulation\n\n\n15.6.1 Check your work\nFirst, let’s find all sentences that start with “The”. Using the ^ anchor alone is not enough:\n\nstr_view(sentences, \"^The\")\n#> [1] │ <The> birch canoe slid on the smooth planks.\n#> [4] │ <The>se days a chicken leg is a rare dish.\n#> [6] │ <The> juice of lemons makes fine punch.\n#> [7] │ <The> box was thrown beside the parked truck.\n#> [8] │ <The> hogs were fed chopped corn and garbage.\n#> [11] │ <The> boy was there when the sun rose.\n#> ... and 271 more\n\nBecause that pattern also matches sentences starting with words like They or These. We need to make sure that the “e” is the last letter in the word, which we can do by adding a word boundary:\n\nstr_view(sentences, \"^The\\\\b\")\n#> [1] │ <The> birch canoe slid on the smooth planks.\n#> [6] │ <The> juice of lemons makes fine punch.\n#> [7] │ <The> box was thrown beside the parked truck.\n#> [8] │ <The> hogs were fed chopped corn and garbage.\n#> [11] │ <The> boy was there when the sun rose.\n#> [13] │ <The> source of the huge river is the clear spring.\n#> ... and 250 more\n\nWhat about finding all sentences that begin with a pronoun?\n\nstr_view(sentences, \"^She|He|It|They\\\\b\")\n#> [3] │ <It>'s easy to tell the depth of a well.\n#> [15] │ <He>lp the woman get back to her feet.\n#> [27] │ <He>r purse was full of useless trash.\n#> [29] │ <It> snowed, rained, and hailed the same morning.\n#> [63] │ <He> ran half way to the hardware store.\n#> [90] │ <He> lay prone and hardly moved a limb.\n#> ... and 57 more\n\nA quick inspection of the results shows that we’re getting some spurious matches. That’s because we’ve forgotten to use parentheses:\n\nstr_view(sentences, \"^(She|He|It|They)\\\\b\")\n#> [3] │ <It>'s easy to tell the depth of a well.\n#> [29] │ <It> snowed, rained, and hailed the same morning.\n#> [63] │ <He> ran half way to the hardware store.\n#> [90] │ <He> lay prone and hardly moved a limb.\n#> [116] │ <He> ordered peach pie with ice cream.\n#> [127] │ <It> caught its hind paw in a rusty trap.\n#> ... and 51 more\n\nYou might wonder how you might spot such a mistake if it didn’t occur in the first few matches. A good technique is to create a few positive and negative matches and use them to test that your pattern works as expected:\n\npos <- c(\"He is a boy\", \"She had a good time\")\nneg <- c(\"Shells come from the sea\", \"Hadley said 'It's a great day'\")\n\npattern <- \"^(She|He|It|They)\\\\b\"\nstr_detect(pos, pattern)\n#> [1] TRUE TRUE\nstr_detect(neg, pattern)\n#> [1] FALSE FALSE\n\nIt’s typically much easier to come up with good positive examples than negative examples, because it takes a while before you’re good enough with regular expressions to predict where your weaknesses are. Nevertheless, they’re still useful: as you work on the problem you can slowly accumulate a collection of your mistakes, ensuring that you never make the same mistake twice.\n\n15.6.2 Boolean operations\nImagine we want to find words that only contain consonants. One technique is to create a character class that contains all letters except for the vowels ([^aeiou]), then allow that to match any number of letters ([^aeiou]+), then force it to match the whole string by anchoring to the beginning and the end (^[^aeiou]+$):\n\nstr_view(words, \"^[^aeiou]+$\")\n#> [123] │ <by>\n#> [249] │ <dry>\n#> [328] │ <fly>\n#> [538] │ <mrs>\n#> [895] │ <try>\n#> [952] │ <why>\n\nBut you can make this problem a bit easier by flipping the problem around. Instead of looking for words that contain only consonants, we could look for words that don’t contain any vowels:\n\nstr_view(words[!str_detect(words, \"[aeiou]\")])\n#> [1] │ by\n#> [2] │ dry\n#> [3] │ fly\n#> [4] │ mrs\n#> [5] │ try\n#> [6] │ why\n\nThis is a useful technique whenever you’re dealing with logical combinations, particularly those involving “and” or “not”. For example, imagine if you want to find all words that contain “a” and “b”. There’s no “and” operator built in to regular expressions so we have to tackle it by looking for all words that contain an “a” followed by a “b”, or a “b” followed by an “a”:\n\nstr_view(words, \"a.*b|b.*a\")\n#> [2] │ <ab>le\n#> [3] │ <ab>out\n#> [4] │ <ab>solute\n#> [62] │ <availab>le\n#> [66] │ <ba>by\n#> [67] │ <ba>ck\n#> ... and 24 more\n\nIt’s simpler to combine the results of two calls to str_detect():\n\nwords[str_detect(words, \"a\") & str_detect(words, \"b\")]\n#> [1] \"able\" \"about\" \"absolute\" \"available\" \"baby\" \"back\" \n#> [7] \"bad\" \"bag\" \"balance\" \"ball\" \"bank\" \"bar\" \n#> [13] \"base\" \"basis\" \"bear\" \"beat\" \"beauty\" \"because\" \n#> [19] \"black\" \"board\" \"boat\" \"break\" \"brilliant\" \"britain\" \n#> [25] \"debate\" \"husband\" \"labour\" \"maybe\" \"probable\" \"table\"\n\nWhat if we wanted to see if there was a word that contains all vowels? If we did it with patterns we’d need to generate 5! (120) different patterns:\n\nwords[str_detect(words, \"a.*e.*i.*o.*u\")]\n# ...\nwords[str_detect(words, \"u.*o.*i.*e.*a\")]\n\nIt’s much simpler to combine five calls to str_detect():\n\nwords[\n str_detect(words, \"a\") &\n str_detect(words, \"e\") &\n str_detect(words, \"i\") &\n str_detect(words, \"o\") &\n str_detect(words, \"u\")\n]\n#> character(0)\n\nIn general, if you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.\n\n15.6.3 Creating a pattern with code\nWhat if we wanted to find all sentences that mention a color? The basic idea is simple: we just combine alternation with word boundaries.\n\nstr_view(sentences, \"\\\\b(red|green|blue)\\\\b\")\n#> [2] │ Glue the sheet to the dark <blue> background.\n#> [26] │ Two <blue> fish swam in the tank.\n#> [92] │ A wisp of cloud hung in the <blue> air.\n#> [148] │ The spot on the blotter was made by <green> ink.\n#> [160] │ The sofa cushion is <red> and of light weight.\n#> [174] │ The sky that morning was clear and bright <blue>.\n#> ... and 20 more\n\nBut as the number of colors grows, it would quickly get tedious to construct this pattern by hand. Wouldn’t it be nice if we could store the colors in a vector?\n\nrgb <- c(\"red\", \"green\", \"blue\")\n\nWell, we can! We’d just need to create the pattern from the vector using str_c() and str_flatten():\n\nstr_c(\"\\\\b(\", str_flatten(rgb, \"|\"), \")\\\\b\")\n#> [1] \"\\\\b(red|green|blue)\\\\b\"\n\nWe could make this pattern more comprehensive if we had a good list of colors. One place we could start from is the list of built-in colors that R can use for plots:\n\nstr_view(colors())\n#> [1] │ white\n#> [2] │ aliceblue\n#> [3] │ antiquewhite\n#> [4] │ antiquewhite1\n#> [5] │ antiquewhite2\n#> [6] │ antiquewhite3\n#> ... and 651 more\n\nBut lets first eliminate the numbered variants:\n\ncols <- colors()\ncols <- cols[!str_detect(cols, \"\\\\d\")]\nstr_view(cols)\n#> [1] │ white\n#> [2] │ aliceblue\n#> [3] │ antiquewhite\n#> [4] │ aquamarine\n#> [5] │ azure\n#> [6] │ beige\n#> ... and 137 more\n\nThen we can turn this into one giant pattern. We won’t show the pattern here because it’s huge, but you can see it working:\n\npattern <- str_c(\"\\\\b(\", str_flatten(cols, \"|\"), \")\\\\b\")\nstr_view(sentences, pattern)\n#> [2] │ Glue the sheet to the dark <blue> background.\n#> [12] │ A rod is used to catch <pink> <salmon>.\n#> [26] │ Two <blue> fish swam in the tank.\n#> [66] │ Cars and busses stalled in <snow> drifts.\n#> [92] │ A wisp of cloud hung in the <blue> air.\n#> [112] │ Leaves turn <brown> and <yellow> in the fall.\n#> ... and 57 more\n\nIn this example, cols only contains numbers and letters so you don’t need to worry about metacharacters. But in general, whenever you create patterns from existing strings it’s wise to run them through str_escape() to ensure they match literally.\n\n15.6.4 Exercises\n\n\nFor each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.\n\nFind all words that start or end with x.\nFind all words that start with a vowel and end with a consonant.\nAre there any words that contain at least one of each different vowel?\n\n\nConstruct patterns to find evidence for and against the rule “i before e except after c”?\ncolors() contains a number of modifiers like “lightgray” and “darkblue”. How could you automatically identify these modifiers? (Think about how you might detect and then remove the colors that are modified).\nCreate a regular expression that finds any base R dataset. You can get a list of these datasets via a special use of the data() function: data(package = \"datasets\")$results[, \"Item\"]. Note that a number of old datasets are individual vectors; these contain the name of the grouping “data frame” in parentheses, so you’ll need to strip those off." + }, + { + "objectID": "regexps.html#regular-expressions-in-other-places", + "href": "regexps.html#regular-expressions-in-other-places", + "title": "15  Regular expressions", + "section": "\n15.7 Regular expressions in other places", + "text": "15.7 Regular expressions in other places\nJust like in the stringr and tidyr functions, there are many other places in R where you can use regular expressions. The following sections describe some other useful functions in the wider tidyverse and base R.\n\n15.7.1 tidyverse\nThere are three other particularly useful places where you might want to use a regular expressions\n\nmatches(pattern) will select all variables whose name matches the supplied pattern. It’s a “tidyselect” function that you can use anywhere in any tidyverse function that selects variables (e.g., select(), rename_with() and across()).\npivot_longer()'s names_pattern argument takes a vector of regular expressions, just like separate_wider_regex(). It’s useful when extracting data out of variable names with a complex structure\nThe delim argument in separate_longer_delim() and separate_wider_delim() usually matches a fixed string, but you can use regex() to make it match a pattern. This is useful, for example, if you want to match a comma that is optionally followed by a space, i.e. regex(\", ?\").\n\n15.7.2 Base R\napropos(pattern) searches all objects available from the global environment that match the given pattern. This is useful if you can’t quite remember the name of a function:\n\napropos(\"replace\")\n#> [1] \"%+replace%\" \"replace\" \"replace_na\" \n#> [4] \"setReplaceMethod\" \"str_replace\" \"str_replace_all\" \n#> [7] \"str_replace_na\" \"theme_replace\"\n\nlist.files(path, pattern) lists all files in path that match a regular expression pattern. For example, you can find all the R Markdown files in the current directory with:\n\nhead(list.files(pattern = \"\\\\.Rmd$\"))\n#> character(0)\n\nIt’s worth noting that the pattern language used by base R is very slightly different to that used by stringr. That’s because stringr is built on top of the stringi package, which is in turn built on top of the ICU engine, whereas base R functions use either the TRE engine or the PCRE engine, depending on whether or not you’ve set perl = TRUE. Fortunately, the basics of regular expressions are so well established that you’ll encounter few variations when working with the patterns you’ll learn in this book. You only need to be aware of the difference when you start to rely on advanced features like complex Unicode character ranges or special features that use the (?…) syntax." + }, + { + "objectID": "regexps.html#summary", + "href": "regexps.html#summary", + "title": "15  Regular expressions", + "section": "\n15.8 Summary", + "text": "15.8 Summary\nWith every punctuation character potentially overloaded with meaning, regular expressions are one of the most compact languages out there. They’re definitely confusing at first but as you train your eyes to read them and your brain to understand them, you unlock a powerful skill that you can use in R and in many other places.\nIn this chapter, you’ve started your journey to become a regular expression master by learning the most useful stringr functions and the most important components of the regular expression language. And there are plenty of resources to learn more.\nA good place to start is vignette(\"regular-expressions\", package = \"stringr\"): it documents the full set of syntax supported by stringr. Another useful reference is https://www.regular-expressions.info/. It’s not R specific, but you can use it to learn about the most advanced features of regexes and how they work under the hood.\nIt’s also good to know that stringr is implemented on top of the stringi package by Marek Gagolewski. If you’re struggling to find a function that does what you need in stringr, don’t be afraid to look in stringi. You’ll find stringi very easy to pick up because it follows many of the the same conventions as stringr.\nIn the next chapter, we’ll talk about a data structure closely related to strings: factors. Factors are used to represent categorical data in R, i.e. data with a fixed and known set of possible values identified by a vector of strings." + }, + { + "objectID": "regexps.html#footnotes", + "href": "regexps.html#footnotes", + "title": "15  Regular expressions", + "section": "", + "text": "You can pronounce it with either a hard-g (reg-x) or a soft-g (rej-x).↩︎\nYou’ll learn how to escape these special meanings in Seção 15.4.1.↩︎\nWell, any character apart from \\n.↩︎\nThis gives us the proportion of names that contain an “x”; if you wanted the proportion of babies with a name containing an x, you’d need to perform a weighted mean.↩︎\nWe wish we could reassure you that you’d never see something this weird in real life, but unfortunately over the course of your career you’re likely to see much weirder!↩︎\nThe complete set of metacharacters is .^$\\|*+?{}[]()↩︎\nRemember, to create a regular expression containing \\d or \\s, you’ll need to escape the \\ for the string, so you’ll type \"\\\\d\" or \"\\\\s\".↩︎\nMostly because we never discuss matrices in this book!↩︎\ncomments = TRUE is particularly effective in combination with a raw string, as we use here.↩︎" + }, + { + "objectID": "factors.html#introduction", + "href": "factors.html#introduction", + "title": "16  Factors", + "section": "\n16.1 Introduction", + "text": "16.1 Introduction\nFactors are used for categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.\nWe’ll start by motivating why factors are needed for data analysis1 and how you can create them with factor(). We’ll then introduce you to the gss_cat dataset which contains a bunch of categorical variables to experiment with. You’ll then use that dataset to practice modifying the order and values of factors, before we finish up with a discussion of ordered factors.\n\n16.1.1 Prerequisites\nBase R provides some basic tools for creating and manipulating factors. We’ll supplement these with the forcats package, which is part of the core tidyverse. It provides tools for dealing with categorical variables (and it’s an anagram of factors!) using a wide range of helpers for working with factors.\n\nlibrary(tidyverse)" + }, + { + "objectID": "factors.html#factor-basics", + "href": "factors.html#factor-basics", + "title": "16  Factors", + "section": "\n16.2 Factor basics", + "text": "16.2 Factor basics\nImagine that you have a variable that records month:\n\nx1 <- c(\"Dec\", \"Apr\", \"Jan\", \"Mar\")\n\nUsing a string to record this variable has two problems:\n\n\nThere are only twelve possible months, and there’s nothing saving you from typos:\n\nx2 <- c(\"Dec\", \"Apr\", \"Jam\", \"Mar\")\n\n\n\nIt doesn’t sort in a useful way:\n\nsort(x1)\n#> [1] \"Apr\" \"Dec\" \"Jan\" \"Mar\"\n\n\n\nYou can fix both of these problems with a factor. To create a factor you must start by creating a list of the valid levels:\n\nmonth_levels <- c(\n \"Jan\", \"Feb\", \"Mar\", \"Apr\", \"May\", \"Jun\", \n \"Jul\", \"Aug\", \"Sep\", \"Oct\", \"Nov\", \"Dec\"\n)\n\nNow you can create a factor:\n\ny1 <- factor(x1, levels = month_levels)\ny1\n#> [1] Dec Apr Jan Mar\n#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec\n\nsort(y1)\n#> [1] Jan Mar Apr Dec\n#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec\n\nAnd any values not in the level will be silently converted to NA:\n\ny2 <- factor(x2, levels = month_levels)\ny2\n#> [1] Dec Apr <NA> Mar \n#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec\n\nThis seems risky, so you might want to use forcats::fct() instead:\n\ny2 <- fct(x2, levels = month_levels)\n#> Error in `fct()`:\n#> ! All values of `x` must appear in `levels` or `na`\n#> ℹ Missing level: \"Jam\"\n\nIf you omit the levels, they’ll be taken from the data in alphabetical order:\n\nfactor(x1)\n#> [1] Dec Apr Jan Mar\n#> Levels: Apr Dec Jan Mar\n\nSorting alphabetically is slightly risky because not every computer will sort strings in the same way. So forcats::fct() orders by first appearance:\n\nfct(x1)\n#> [1] Dec Apr Jan Mar\n#> Levels: Dec Apr Jan Mar\n\nIf you ever need to access the set of valid levels directly, you can do so with levels():\n\nlevels(y2)\n#> [1] \"Jan\" \"Feb\" \"Mar\" \"Apr\" \"May\" \"Jun\" \"Jul\" \"Aug\" \"Sep\" \"Oct\" \"Nov\" \"Dec\"\n\nYou can also create a factor when reading your data with readr with col_factor():\n\ncsv <- \"\nmonth,value\nJan,12\nFeb,56\nMar,12\"\n\ndf <- read_csv(csv, col_types = cols(month = col_factor(month_levels)))\ndf$month\n#> [1] Jan Feb Mar\n#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec" + }, + { + "objectID": "factors.html#general-social-survey", + "href": "factors.html#general-social-survey", + "title": "16  Factors", + "section": "\n16.3 General Social Survey", + "text": "16.3 General Social Survey\nFor the rest of this chapter, we’re going to use forcats::gss_cat. It’s a sample of data from the General Social Survey, a long-running US survey conducted by the independent research organization NORC at the University of Chicago. The survey has thousands of questions, so in gss_cat Hadley selected a handful that will illustrate some common challenges you’ll encounter when working with factors.\n\ngss_cat\n#> # A tibble: 21,483 × 9\n#> year marital age race rincome partyid \n#> <int> <fct> <int> <fct> <fct> <fct> \n#> 1 2000 Never married 26 White $8000 to 9999 Ind,near rep \n#> 2 2000 Divorced 48 White $8000 to 9999 Not str republican\n#> 3 2000 Widowed 67 White Not applicable Independent \n#> 4 2000 Never married 39 White Not applicable Ind,near rep \n#> 5 2000 Divorced 25 White Not applicable Not str democrat \n#> 6 2000 Married 25 White $20000 - 24999 Strong democrat \n#> # ℹ 21,477 more rows\n#> # ℹ 3 more variables: relig <fct>, denom <fct>, tvhours <int>\n\n(Remember, since this dataset is provided by a package, you can get more information about the variables with ?gss_cat.)\nWhen factors are stored in a tibble, you can’t see their levels so easily. One way to view them is with count():\n\ngss_cat |>\n count(race)\n#> # A tibble: 3 × 2\n#> race n\n#> <fct> <int>\n#> 1 Other 1959\n#> 2 Black 3129\n#> 3 White 16395\n\nWhen working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels. Those operations are described in the sections below.\n\n16.3.1 Exercise\n\nExplore the distribution of rincome (reported income). What makes the default bar chart hard to understand? How could you improve the plot?\nWhat is the most common relig in this survey? What’s the most common partyid?\nWhich relig does denom (denomination) apply to? How can you find out with a table? How can you find out with a visualization?" + }, + { + "objectID": "factors.html#sec-modifying-factor-order", + "href": "factors.html#sec-modifying-factor-order", + "title": "16  Factors", + "section": "\n16.4 Modifying factor order", + "text": "16.4 Modifying factor order\nIt’s often useful to change the order of the factor levels in a visualization. For example, imagine you want to explore the average number of hours spent watching TV per day across religions:\n\nrelig_summary <- gss_cat |>\n group_by(relig) |>\n summarize(\n tvhours = mean(tvhours, na.rm = TRUE),\n n = n()\n )\n\nggplot(relig_summary, aes(x = tvhours, y = relig)) + \n geom_point()\n\n\n\n\nIt is hard to read this plot because there’s no overall pattern. We can improve it by reordering the levels of relig using fct_reorder(). fct_reorder() takes three arguments:\n\n\nf, the factor whose levels you want to modify.\n\nx, a numeric vector that you want to use to reorder the levels.\nOptionally, fun, a function that’s used if there are multiple values of x for each value of f. The default value is median.\n\n\nggplot(relig_summary, aes(x = tvhours, y = fct_reorder(relig, tvhours))) +\n geom_point()\n\n\n\n\nReordering religion makes it much easier to see that people in the “Don’t know” category watch much more TV, and Hinduism & Other Eastern religions watch much less.\nAs you start making more complicated transformations, we recommend moving them out of aes() and into a separate mutate() step. For example, you could rewrite the plot above as:\n\nrelig_summary |>\n mutate(\n relig = fct_reorder(relig, tvhours)\n ) |>\n ggplot(aes(x = tvhours, y = relig)) +\n geom_point()\n\nWhat if we create a similar plot looking at how average age varies across reported income level?\n\nrincome_summary <- gss_cat |>\n group_by(rincome) |>\n summarize(\n age = mean(age, na.rm = TRUE),\n n = n()\n )\n\nggplot(rincome_summary, aes(x = age, y = fct_reorder(rincome, age))) + \n geom_point()\n\n\n\n\nHere, arbitrarily reordering the levels isn’t a good idea! That’s because rincome already has a principled order that we shouldn’t mess with. Reserve fct_reorder() for factors whose levels are arbitrarily ordered.\nHowever, it does make sense to pull “Not applicable” to the front with the other special levels. You can use fct_relevel(). It takes a factor, f, and then any number of levels that you want to move to the front of the line.\n\nggplot(rincome_summary, aes(x = age, y = fct_relevel(rincome, \"Not applicable\"))) +\n geom_point()\n\n\n\n\nWhy do you think the average age for “Not applicable” is so high?\nAnother type of reordering is useful when you are coloring the lines on a plot. fct_reorder2(f, x, y) reorders the factor f by the y values associated with the largest x values. This makes the plot easier to read because the colors of the line at the far right of the plot will line up with the legend.\n\nby_age <- gss_cat |>\n filter(!is.na(age)) |> \n count(age, marital) |>\n group_by(age) |>\n mutate(\n prop = n / sum(n)\n )\n\nggplot(by_age, aes(x = age, y = prop, color = marital)) +\n geom_line(linewidth = 1) + \n scale_color_brewer(palette = \"Set1\")\n\nggplot(by_age, aes(x = age, y = prop, color = fct_reorder2(marital, age, prop))) +\n geom_line(linewidth = 1) +\n scale_color_brewer(palette = \"Set1\") + \n labs(color = \"marital\") \n\n\n\n\n\n\n\n\n\n\n\nFinally, for bar plots, you can use fct_infreq() to order levels in decreasing frequency: this is the simplest type of reordering because it doesn’t need any extra variables. Combine it with fct_rev() if you want them in increasing frequency so that in the bar plot largest values are on the right, not the left.\n\ngss_cat |>\n mutate(marital = marital |> fct_infreq() |> fct_rev()) |>\n ggplot(aes(x = marital)) +\n geom_bar()\n\n\n\n\n\n16.4.1 Exercises\n\nThere are some suspiciously high numbers in tvhours. Is the mean a good summary?\nFor each factor in gss_cat identify whether the order of the levels is arbitrary or principled.\nWhy did moving “Not applicable” to the front of the levels move it to the bottom of the plot?" + }, + { + "objectID": "factors.html#modifying-factor-levels", + "href": "factors.html#modifying-factor-levels", + "title": "16  Factors", + "section": "\n16.5 Modifying factor levels", + "text": "16.5 Modifying factor levels\nMore powerful than changing the orders of the levels is changing their values. This allows you to clarify labels for publication, and collapse levels for high-level displays. The most general and powerful tool is fct_recode(). It allows you to recode, or change, the value of each level. For example, take the partyid variable from the gss_cat data frame:\n\ngss_cat |> count(partyid)\n#> # A tibble: 10 × 2\n#> partyid n\n#> <fct> <int>\n#> 1 No answer 154\n#> 2 Don't know 1\n#> 3 Other party 393\n#> 4 Strong republican 2314\n#> 5 Not str republican 3032\n#> 6 Ind,near rep 1791\n#> # ℹ 4 more rows\n\nThe levels are terse and inconsistent. Let’s tweak them to be longer and use a parallel construction. Like most rename and recoding functions in the tidyverse, the new values go on the left and the old values go on the right:\n\ngss_cat |>\n mutate(\n partyid = fct_recode(partyid,\n \"Republican, strong\" = \"Strong republican\",\n \"Republican, weak\" = \"Not str republican\",\n \"Independent, near rep\" = \"Ind,near rep\",\n \"Independent, near dem\" = \"Ind,near dem\",\n \"Democrat, weak\" = \"Not str democrat\",\n \"Democrat, strong\" = \"Strong democrat\"\n )\n ) |>\n count(partyid)\n#> # A tibble: 10 × 2\n#> partyid n\n#> <fct> <int>\n#> 1 No answer 154\n#> 2 Don't know 1\n#> 3 Other party 393\n#> 4 Republican, strong 2314\n#> 5 Republican, weak 3032\n#> 6 Independent, near rep 1791\n#> # ℹ 4 more rows\n\nfct_recode() will leave the levels that aren’t explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn’t exist.\nTo combine groups, you can assign multiple old levels to the same new level:\n\ngss_cat |>\n mutate(\n partyid = fct_recode(partyid,\n \"Republican, strong\" = \"Strong republican\",\n \"Republican, weak\" = \"Not str republican\",\n \"Independent, near rep\" = \"Ind,near rep\",\n \"Independent, near dem\" = \"Ind,near dem\",\n \"Democrat, weak\" = \"Not str democrat\",\n \"Democrat, strong\" = \"Strong democrat\",\n \"Other\" = \"No answer\",\n \"Other\" = \"Don't know\",\n \"Other\" = \"Other party\"\n )\n )\n\nUse this technique with care: if you group together categories that are truly different you will end up with misleading results.\nIf you want to collapse a lot of levels, fct_collapse() is a useful variant of fct_recode(). For each new variable, you can provide a vector of old levels:\n\ngss_cat |>\n mutate(\n partyid = fct_collapse(partyid,\n \"other\" = c(\"No answer\", \"Don't know\", \"Other party\"),\n \"rep\" = c(\"Strong republican\", \"Not str republican\"),\n \"ind\" = c(\"Ind,near rep\", \"Independent\", \"Ind,near dem\"),\n \"dem\" = c(\"Not str democrat\", \"Strong democrat\")\n )\n ) |>\n count(partyid)\n#> # A tibble: 4 × 2\n#> partyid n\n#> <fct> <int>\n#> 1 other 548\n#> 2 rep 5346\n#> 3 ind 8409\n#> 4 dem 7180\n\nSometimes you just want to lump together the small groups to make a plot or table simpler. That’s the job of the fct_lump_*() family of functions. fct_lump_lowfreq() is a simple starting point that progressively lumps the smallest groups categories into “Other”, always keeping “Other” as the smallest category.\n\ngss_cat |>\n mutate(relig = fct_lump_lowfreq(relig)) |>\n count(relig)\n#> # A tibble: 2 × 2\n#> relig n\n#> <fct> <int>\n#> 1 Protestant 10846\n#> 2 Other 10637\n\nIn this case it’s not very helpful: it is true that the majority of Americans in this survey are Protestant, but we’d probably like to see some more details! Instead, we can use the fct_lump_n() to specify that we want exactly 10 groups:\n\ngss_cat |>\n mutate(relig = fct_lump_n(relig, n = 10)) |>\n count(relig, sort = TRUE)\n#> # A tibble: 10 × 2\n#> relig n\n#> <fct> <int>\n#> 1 Protestant 10846\n#> 2 Catholic 5124\n#> 3 None 3523\n#> 4 Christian 689\n#> 5 Other 458\n#> 6 Jewish 388\n#> # ℹ 4 more rows\n\nRead the documentation to learn about fct_lump_min() and fct_lump_prop() which are useful in other cases.\n\n16.5.1 Exercises\n\nHow have the proportions of people identifying as Democrat, Republican, and Independent changed over time?\nHow could you collapse rincome into a small set of categories?\nNotice there are 9 groups (excluding other) in the fct_lump example above. Why not 10? (Hint: type ?fct_lump, and find the default for the argument other_level is “Other”.)" + }, + { + "objectID": "factors.html#sec-ordered-factors", + "href": "factors.html#sec-ordered-factors", + "title": "16  Factors", + "section": "\n16.6 Ordered factors", + "text": "16.6 Ordered factors\nBefore we go on, there’s a special type of factor that needs to be mentioned briefly: ordered factors. Ordered factors, created with ordered(), imply a strict ordering and equal distance between levels: the first level is “less than” the second level by the same amount that the second level is “less than” the third level, and so on. You can recognize them when printing because they use < between the factor levels:\n\nordered(c(\"a\", \"b\", \"c\"))\n#> [1] a b c\n#> Levels: a < b < c\n\nIn practice, ordered() factors behave very similarly to regular factors. There are only two places where you might notice different behavior:\n\nIf you map an ordered factor to color or fill in ggplot2, it will default to scale_color_viridis()/scale_fill_viridis(), a color scale that implies a ranking.\nIf you use an ordered function in a linear model, it will use “polygonal contrasts”. These are mildly useful, but you are unlikely to have heard of them unless you have a PhD in Statistics, and even then you probably don’t routinely interpret them. If you want to learn more, we recommend vignette(\"contrasts\", package = \"faux\") by Lisa DeBruine.\n\nGiven the arguable utility of these differences, we don’t generally recommend using ordered factors." + }, + { + "objectID": "factors.html#summary", + "href": "factors.html#summary", + "title": "16  Factors", + "section": "\n16.7 Summary", + "text": "16.7 Summary\nThis chapter introduced you to the handy forcats package for working with factors, introducing you to the most commonly used functions. forcats contains a wide range of other helpers that we didn’t have space to discuss here, so whenever you’re facing a factor analysis challenge that you haven’t encountered before, I highly recommend skimming the reference index to see if there’s a canned function that can help solve your problem.\nIf you want to learn more about factors after reading this chapter, we recommend reading Amelia McNamara and Nicholas Horton’s paper, Wrangling categorical data in R. This paper lays out some of the history discussed in stringsAsFactors: An unauthorized biography and stringsAsFactors = <sigh>, and compares the tidy approaches to categorical data outlined in this book with base R methods. An early version of the paper helped motivate and scope the forcats package; thanks Amelia & Nick!\nIn the next chapter we’ll switch gears to start learning about dates and times in R. Dates and times seem deceptively simple, but as you’ll soon see, the more you learn about them, the more complex they seem to get!" + }, + { + "objectID": "factors.html#footnotes", + "href": "factors.html#footnotes", + "title": "16  Factors", + "section": "", + "text": "They’re also really important for modelling.↩︎" + }, + { + "objectID": "datetimes.html#introduction", + "href": "datetimes.html#introduction", + "title": "17  Dates and times", + "section": "\n17.1 Introduction", + "text": "17.1 Introduction\nThis chapter will show you how to work with dates and times in R. At first glance, dates and times seem simple. You use them all the time in your regular life, and they don’t seem to cause much confusion. However, the more you learn about dates and times, the more complicated they seem to get!\nTo warm up think about how many days there are in a year, and how many hours there are in a day. You probably remembered that most years have 365 days, but leap years have 366. Do you know the full rule for determining if a year is a leap year1? The number of hours in a day is a little less obvious: most days have 24 hours, but in places that use daylight saving time (DST), one day each year has 23 hours and another has 25.\nDates and times are hard because they have to reconcile two physical phenomena (the rotation of the Earth and its orbit around the sun) with a whole raft of geopolitical phenomena including months, time zones, and DST. This chapter won’t teach you every last detail about dates and times, but it will give you a solid grounding of practical skills that will help you with common data analysis challenges.\nWe’ll begin by showing you how to create date-times from various inputs, and then once you’ve got a date-time, how you can extract components like year, month, and day. We’ll then dive into the tricky topic of working with time spans, which come in a variety of flavors depending on what you’re trying to do. We’ll conclude with a brief discussion of the additional challenges posed by time zones.\n\n17.1.1 Prerequisites\nThis chapter will focus on the lubridate package, which makes it easier to work with dates and times in R. As of the latest tidyverse release, lubridate is part of core tidyverse. We will also need nycflights13 for practice data.\n\nlibrary(tidyverse)\nlibrary(nycflights13)" + }, + { + "objectID": "datetimes.html#sec-creating-datetimes", + "href": "datetimes.html#sec-creating-datetimes", + "title": "17  Dates and times", + "section": "\n17.2 Creating date/times", + "text": "17.2 Creating date/times\nThere are three types of date/time data that refer to an instant in time:\n\nA date. Tibbles print this as <date>.\nA time within a day. Tibbles print this as <time>.\nA date-time is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as <dttm>. Base R calls these POSIXct, but doesn’t exactly trip off the tongue.\n\nIn this chapter we are going to focus on dates and date-times as R doesn’t have a native class for storing times. If you need one, you can use the hms package.\nYou should always use the simplest possible data type that works for your needs. That means if you can use a date instead of a date-time, you should. Date-times are substantially more complicated because of the need to handle time zones, which we’ll come back to at the end of the chapter.\nTo get the current date or date-time you can use today() or now():\n\ntoday()\n#> [1] \"2023-11-17\"\nnow()\n#> [1] \"2023-11-17 17:43:56 UTC\"\n\nOtherwise, the following sections describe the four ways you’re likely to create a date/time:\n\nWhile reading a file with readr.\nFrom a string.\nFrom individual date-time components.\nFrom an existing date/time object.\n\n\n17.2.1 During import\nIf your CSV contains an ISO8601 date or date-time, you don’t need to do anything; readr will automatically recognize it:\n\ncsv <- \"\n date,datetime\n 2022-01-02,2022-01-02 05:12\n\"\nread_csv(csv)\n#> # A tibble: 1 × 2\n#> date datetime \n#> <date> <dttm> \n#> 1 2022-01-02 2022-01-02 05:12:00\n\nIf you haven’t heard of ISO8601 before, it’s an international standard2 for writing dates where the components of a date are organized from biggest to smallest separated by -. For example, in ISO8601 May 3 2022 is 2022-05-03. ISO8601 dates can also include times, where hour, minute, and second are separated by :, and the date and time components are separated by either a T or a space. For example, you could write 4:26pm on May 3 2022 as either 2022-05-03 16:26 or 2022-05-03T16:26.\nFor other date-time formats, you’ll need to use col_types plus col_date() or col_datetime() along with a date-time format. The date-time format used by readr is a standard used across many programming languages, describing a date component with a % followed by a single character. For example, %Y-%m-%d specifies a date that’s a year, -, month (as number) -, day. Table Tabela 17.1 lists all the options.\n\n\nTabela 17.1: All date formats understood by readr\n\nType\nCode\nMeaning\nExample\n\n\n\nYear\n%Y\n4 digit year\n2021\n\n\n\n%y\n2 digit year\n21\n\n\nMonth\n%m\nNumber\n2\n\n\n\n%b\nAbbreviated name\nFeb\n\n\n\n%B\nFull name\nFebruary\n\n\nDay\n%d\nOne or two digits\n2\n\n\n\n%e\nTwo digits\n02\n\n\nTime\n%H\n24-hour hour\n13\n\n\n\n%I\n12-hour hour\n1\n\n\n\n%p\nAM/PM\npm\n\n\n\n%M\nMinutes\n35\n\n\n\n%S\nSeconds\n45\n\n\n\n%OS\nSeconds with decimal component\n45.35\n\n\n\n%Z\nTime zone name\nAmerica/Chicago\n\n\n\n%z\nOffset from UTC\n+0800\n\n\nOther\n%.\nSkip one non-digit\n:\n\n\n\n%*\nSkip any number of non-digits\n\n\n\n\n\nAnd this code shows a few options applied to a very ambiguous date:\n\ncsv <- \"\n date\n 01/02/15\n\"\n\nread_csv(csv, col_types = cols(date = col_date(\"%m/%d/%y\")))\n#> # A tibble: 1 × 1\n#> date \n#> <date> \n#> 1 2015-01-02\n\nread_csv(csv, col_types = cols(date = col_date(\"%d/%m/%y\")))\n#> # A tibble: 1 × 1\n#> date \n#> <date> \n#> 1 2015-02-01\n\nread_csv(csv, col_types = cols(date = col_date(\"%y/%m/%d\")))\n#> # A tibble: 1 × 1\n#> date \n#> <date> \n#> 1 2001-02-15\n\nNote that no matter how you specify the date format, it’s always displayed the same way once you get it into R.\nIf you’re using %b or %B and working with non-English dates, you’ll also need to provide a locale(). See the list of built-in languages in date_names_langs(), or create your own with date_names(),\n\n17.2.2 From strings\nThe date-time specification language is powerful, but requires careful analysis of the date format. An alternative approach is to use lubridate’s helpers which attempt to automatically determine the format once you specify the order of the component. To use them, identify the order in which year, month, and day appear in your dates, then arrange “y”, “m”, and “d” in the same order. That gives you the name of the lubridate function that will parse your date. For example:\n\nymd(\"2017-01-31\")\n#> [1] \"2017-01-31\"\nmdy(\"January 31st, 2017\")\n#> [1] \"2017-01-31\"\ndmy(\"31-Jan-2017\")\n#> [1] \"2017-01-31\"\n\nymd() and friends create dates. To create a date-time, add an underscore and one or more of “h”, “m”, and “s” to the name of the parsing function:\n\nymd_hms(\"2017-01-31 20:11:59\")\n#> [1] \"2017-01-31 20:11:59 UTC\"\nmdy_hm(\"01/31/2017 08:01\")\n#> [1] \"2017-01-31 08:01:00 UTC\"\n\nYou can also force the creation of a date-time from a date by supplying a timezone:\n\nymd(\"2017-01-31\", tz = \"UTC\")\n#> [1] \"2017-01-31 UTC\"\n\nHere I use the UTC3 timezone which you might also know as GMT, or Greenwich Mean Time, the time at 0° longitude4 . It doesn’t use daylight saving time, making it a bit easier to compute with .\n\n17.2.3 From individual components\nInstead of a single string, sometimes you’ll have the individual components of the date-time spread across multiple columns. This is what we have in the flights data:\n\nflights |> \n select(year, month, day, hour, minute)\n#> # A tibble: 336,776 × 5\n#> year month day hour minute\n#> <int> <int> <int> <dbl> <dbl>\n#> 1 2013 1 1 5 15\n#> 2 2013 1 1 5 29\n#> 3 2013 1 1 5 40\n#> 4 2013 1 1 5 45\n#> 5 2013 1 1 6 0\n#> 6 2013 1 1 5 58\n#> # ℹ 336,770 more rows\n\nTo create a date/time from this sort of input, use make_date() for dates, or make_datetime() for date-times:\n\nflights |> \n select(year, month, day, hour, minute) |> \n mutate(departure = make_datetime(year, month, day, hour, minute))\n#> # A tibble: 336,776 × 6\n#> year month day hour minute departure \n#> <int> <int> <int> <dbl> <dbl> <dttm> \n#> 1 2013 1 1 5 15 2013-01-01 05:15:00\n#> 2 2013 1 1 5 29 2013-01-01 05:29:00\n#> 3 2013 1 1 5 40 2013-01-01 05:40:00\n#> 4 2013 1 1 5 45 2013-01-01 05:45:00\n#> 5 2013 1 1 6 0 2013-01-01 06:00:00\n#> 6 2013 1 1 5 58 2013-01-01 05:58:00\n#> # ℹ 336,770 more rows\n\nLet’s do the same thing for each of the four time columns in flights. The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components. Once we’ve created the date-time variables, we focus in on the variables we’ll explore in the rest of the chapter.\n\nmake_datetime_100 <- function(year, month, day, time) {\n make_datetime(year, month, day, time %/% 100, time %% 100)\n}\n\nflights_dt <- flights |> \n filter(!is.na(dep_time), !is.na(arr_time)) |> \n mutate(\n dep_time = make_datetime_100(year, month, day, dep_time),\n arr_time = make_datetime_100(year, month, day, arr_time),\n sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),\n sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)\n ) |> \n select(origin, dest, ends_with(\"delay\"), ends_with(\"time\"))\n\nflights_dt\n#> # A tibble: 328,063 × 9\n#> origin dest dep_delay arr_delay dep_time sched_dep_time \n#> <chr> <chr> <dbl> <dbl> <dttm> <dttm> \n#> 1 EWR IAH 2 11 2013-01-01 05:17:00 2013-01-01 05:15:00\n#> 2 LGA IAH 4 20 2013-01-01 05:33:00 2013-01-01 05:29:00\n#> 3 JFK MIA 2 33 2013-01-01 05:42:00 2013-01-01 05:40:00\n#> 4 JFK BQN -1 -18 2013-01-01 05:44:00 2013-01-01 05:45:00\n#> 5 LGA ATL -6 -25 2013-01-01 05:54:00 2013-01-01 06:00:00\n#> 6 EWR ORD -4 12 2013-01-01 05:54:00 2013-01-01 05:58:00\n#> # ℹ 328,057 more rows\n#> # ℹ 3 more variables: arr_time <dttm>, sched_arr_time <dttm>, …\n\nWith this data, we can visualize the distribution of departure times across the year:\n\nflights_dt |> \n ggplot(aes(x = dep_time)) + \n geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day\n\n\n\n\nOr within a single day:\n\nflights_dt |> \n filter(dep_time < ymd(20130102)) |> \n ggplot(aes(x = dep_time)) + \n geom_freqpoly(binwidth = 600) # 600 s = 10 minutes\n\n\n\n\nNote that when you use date-times in a numeric context (like in a histogram), 1 means 1 second, so a binwidth of 86400 means one day. For dates, 1 means 1 day.\n\n17.2.4 From other types\nYou may want to switch between a date-time and a date. That’s the job of as_datetime() and as_date():\n\nas_datetime(today())\n#> [1] \"2023-11-17 UTC\"\nas_date(now())\n#> [1] \"2023-11-17\"\n\nSometimes you’ll get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in seconds, use as_datetime(); if it’s in days, use as_date().\n\nas_datetime(60 * 60 * 10)\n#> [1] \"1970-01-01 10:00:00 UTC\"\nas_date(365 * 10 + 2)\n#> [1] \"1980-01-01\"\n\n\n17.2.5 Exercises\n\n\nWhat happens if you parse a string that contains invalid dates?\n\nymd(c(\"2010-10-10\", \"bananas\"))\n\n\nWhat does the tzone argument to today() do? Why is it important?\n\nFor each of the following date-times, show how you’d parse it using a readr column specification and a lubridate function.\n\nd1 <- \"January 1, 2010\"\nd2 <- \"2015-Mar-07\"\nd3 <- \"06-Jun-2017\"\nd4 <- c(\"August 19 (2015)\", \"July 1 (2015)\")\nd5 <- \"12/30/14\" # Dec 30, 2014\nt1 <- \"1705\"\nt2 <- \"11:15:10.12 PM\"" + }, + { + "objectID": "datetimes.html#date-time-components", + "href": "datetimes.html#date-time-components", + "title": "17  Dates and times", + "section": "\n17.3 Date-time components", + "text": "17.3 Date-time components\nNow that you know how to get date-time data into R’s date-time data structures, let’s explore what you can do with them. This section will focus on the accessor functions that let you get and set individual components. The next section will look at how arithmetic works with date-times.\n\n17.3.1 Getting components\nYou can pull out individual parts of the date with the accessor functions year(), month(), mday() (day of the month), yday() (day of the year), wday() (day of the week), hour(), minute(), and second(). These are effectively the opposites of make_datetime().\n\ndatetime <- ymd_hms(\"2026-07-08 12:34:56\")\n\nyear(datetime)\n#> [1] 2026\nmonth(datetime)\n#> [1] 7\nmday(datetime)\n#> [1] 8\n\nyday(datetime)\n#> [1] 189\nwday(datetime)\n#> [1] 4\n\nFor month() and wday() you can set label = TRUE to return the abbreviated name of the month or day of the week. Set abbr = FALSE to return the full name.\n\nmonth(datetime, label = TRUE)\n#> [1] Jul\n#> 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec\nwday(datetime, label = TRUE, abbr = FALSE)\n#> [1] Wednesday\n#> 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday\n\nWe can use wday() to see that more flights depart during the week than on the weekend:\n\nflights_dt |> \n mutate(wday = wday(dep_time, label = TRUE)) |> \n ggplot(aes(x = wday)) +\n geom_bar()\n\n\n\n\nWe can also look at the average departure delay by minute within the hour. There’s an interesting pattern: flights leaving in minutes 20-30 and 50-60 have much lower delays than the rest of the hour!\n\nflights_dt |> \n mutate(minute = minute(dep_time)) |> \n group_by(minute) |> \n summarize(\n avg_delay = mean(dep_delay, na.rm = TRUE),\n n = n()\n ) |> \n ggplot(aes(x = minute, y = avg_delay)) +\n geom_line()\n\n\n\n\nInterestingly, if we look at the scheduled departure time we don’t see such a strong pattern:\n\nsched_dep <- flights_dt |> \n mutate(minute = minute(sched_dep_time)) |> \n group_by(minute) |> \n summarize(\n avg_delay = mean(arr_delay, na.rm = TRUE),\n n = n()\n )\n\nggplot(sched_dep, aes(x = minute, y = avg_delay)) +\n geom_line()\n\n\n\n\nSo why do we see that pattern with the actual departure times? Well, like much data collected by humans, there’s a strong bias towards flights leaving at “nice” departure times, as Figura 17.1 shows. Always be alert for this sort of pattern whenever you work with data that involves human judgement!\n\n\n\n\nFigura 17.1: A frequency polygon showing the number of flights scheduled to depart each hour. You can see a strong preference for round numbers like 0 and 30 and generally for numbers that are a multiple of five.\n\n\n\n\n17.3.2 Rounding\nAn alternative approach to plotting individual components is to round the date to a nearby unit of time, with floor_date(), round_date(), and ceiling_date(). Each function takes a vector of dates to adjust and then the name of the unit to round down (floor), round up (ceiling), or round to. This, for example, allows us to plot the number of flights per week:\n\nflights_dt |> \n count(week = floor_date(dep_time, \"week\")) |> \n ggplot(aes(x = week, y = n)) +\n geom_line() + \n geom_point()\n\n\n\n\nYou can use rounding to show the distribution of flights across the course of a day by computing the difference between dep_time and the earliest instant of that day:\n\nflights_dt |> \n mutate(dep_hour = dep_time - floor_date(dep_time, \"day\")) |> \n ggplot(aes(x = dep_hour)) +\n geom_freqpoly(binwidth = 60 * 30)\n#> Don't know how to automatically pick scale for object of type <difftime>.\n#> Defaulting to continuous.\n\n\n\n\nComputing the difference between a pair of date-times yields a difftime (more on that in Seção 17.4.3). We can convert that to an hms object to get a more useful x-axis:\n\nflights_dt |> \n mutate(dep_hour = hms::as_hms(dep_time - floor_date(dep_time, \"day\"))) |> \n ggplot(aes(x = dep_hour)) +\n geom_freqpoly(binwidth = 60 * 30)\n\n\n\n\n\n17.3.3 Modifying components\nYou can also use each accessor function to modify the components of a date/time. This doesn’t come up much in data analysis, but can be useful when cleaning data that has clearly incorrect dates.\n\n(datetime <- ymd_hms(\"2026-07-08 12:34:56\"))\n#> [1] \"2026-07-08 12:34:56 UTC\"\n\nyear(datetime) <- 2030\ndatetime\n#> [1] \"2030-07-08 12:34:56 UTC\"\nmonth(datetime) <- 01\ndatetime\n#> [1] \"2030-01-08 12:34:56 UTC\"\nhour(datetime) <- hour(datetime) + 1\ndatetime\n#> [1] \"2030-01-08 13:34:56 UTC\"\n\nAlternatively, rather than modifying an existing variable, you can create a new date-time with update(). This also allows you to set multiple values in one step:\n\nupdate(datetime, year = 2030, month = 2, mday = 2, hour = 2)\n#> [1] \"2030-02-02 02:34:56 UTC\"\n\nIf values are too big, they will roll-over:\n\nupdate(ymd(\"2023-02-01\"), mday = 30)\n#> [1] \"2023-03-02\"\nupdate(ymd(\"2023-02-01\"), hour = 400)\n#> [1] \"2023-02-17 16:00:00 UTC\"\n\n\n17.3.4 Exercises\n\nHow does the distribution of flight times within a day change over the course of the year?\nCompare dep_time, sched_dep_time and dep_delay. Are they consistent? Explain your findings.\nCompare air_time with the duration between the departure and arrival. Explain your findings. (Hint: consider the location of the airport.)\nHow does the average delay time change over the course of a day? Should you use dep_time or sched_dep_time? Why?\nOn what day of the week should you leave if you want to minimise the chance of a delay?\nWhat makes the distribution of diamonds$carat and flights$sched_dep_time similar?\nConfirm our hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. Hint: create a binary variable that tells you whether or not a flight was delayed." + }, + { + "objectID": "datetimes.html#time-spans", + "href": "datetimes.html#time-spans", + "title": "17  Dates and times", + "section": "\n17.4 Time spans", + "text": "17.4 Time spans\nNext you’ll learn about how arithmetic with dates works, including subtraction, addition, and division. Along the way, you’ll learn about three important classes that represent time spans:\n\n\nDurations, which represent an exact number of seconds.\n\nPeriods, which represent human units like weeks and months.\n\nIntervals, which represent a starting and ending point.\n\nHow do you pick between duration, periods, and intervals? As always, pick the simplest data structure that solves your problem. If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval.\n\n17.4.1 Durations\nIn R, when you subtract two dates, you get a difftime object:\n\n# How old is Hadley?\nh_age <- today() - ymd(\"1979-10-14\")\nh_age\n#> Time difference of 16105 days\n\nA difftime class object records a time span of seconds, minutes, hours, days, or weeks. This ambiguity can make difftimes a little painful to work with, so lubridate provides an alternative which always uses seconds: the duration.\n\nas.duration(h_age)\n#> [1] \"1391472000s (~44.09 years)\"\n\nDurations come with a bunch of convenient constructors:\n\ndseconds(15)\n#> [1] \"15s\"\ndminutes(10)\n#> [1] \"600s (~10 minutes)\"\ndhours(c(12, 24))\n#> [1] \"43200s (~12 hours)\" \"86400s (~1 days)\"\nddays(0:5)\n#> [1] \"0s\" \"86400s (~1 days)\" \"172800s (~2 days)\"\n#> [4] \"259200s (~3 days)\" \"345600s (~4 days)\" \"432000s (~5 days)\"\ndweeks(3)\n#> [1] \"1814400s (~3 weeks)\"\ndyears(1)\n#> [1] \"31557600s (~1 years)\"\n\nDurations always record the time span in seconds. Larger units are created by converting minutes, hours, days, weeks, and years to seconds: 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day, and 7 days in a week. Larger time units are more problematic. A year uses the “average” number of days in a year, i.e. 365.25. There’s no way to convert a month to a duration, because there’s just too much variation.\nYou can add and multiply durations:\n\n2 * dyears(1)\n#> [1] \"63115200s (~2 years)\"\ndyears(1) + dweeks(12) + dhours(15)\n#> [1] \"38869200s (~1.23 years)\"\n\nYou can add and subtract durations to and from days:\n\ntomorrow <- today() + ddays(1)\nlast_year <- today() - dyears(1)\n\nHowever, because durations represent an exact number of seconds, sometimes you might get an unexpected result:\n\none_am <- ymd_hms(\"2026-03-08 01:00:00\", tz = \"America/New_York\")\n\none_am\n#> [1] \"2026-03-08 01:00:00 EST\"\none_am + ddays(1)\n#> [1] \"2026-03-09 02:00:00 EDT\"\n\nWhy is one day after 1am March 8, 2am March 9? If you look carefully at the date you might also notice that the time zones have changed. March 8 only has 23 hours because it’s when DST starts, so if we add a full days worth of seconds we end up with a different time.\n\n17.4.2 Periods\nTo solve this problem, lubridate provides periods. Periods are time spans but don’t have a fixed length in seconds, instead they work with “human” times, like days and months. That allows them to work in a more intuitive way:\n\none_am\n#> [1] \"2026-03-08 01:00:00 EST\"\none_am + days(1)\n#> [1] \"2026-03-09 01:00:00 EDT\"\n\nLike durations, periods can be created with a number of friendly constructor functions.\n\nhours(c(12, 24))\n#> [1] \"12H 0M 0S\" \"24H 0M 0S\"\ndays(7)\n#> [1] \"7d 0H 0M 0S\"\nmonths(1:6)\n#> [1] \"1m 0d 0H 0M 0S\" \"2m 0d 0H 0M 0S\" \"3m 0d 0H 0M 0S\" \"4m 0d 0H 0M 0S\"\n#> [5] \"5m 0d 0H 0M 0S\" \"6m 0d 0H 0M 0S\"\n\nYou can add and multiply periods:\n\n10 * (months(6) + days(1))\n#> [1] \"60m 10d 0H 0M 0S\"\ndays(50) + hours(25) + minutes(2)\n#> [1] \"50d 25H 2M 0S\"\n\nAnd of course, add them to dates. Compared to durations, periods are more likely to do what you expect:\n\n# A leap year\nymd(\"2024-01-01\") + dyears(1)\n#> [1] \"2024-12-31 06:00:00 UTC\"\nymd(\"2024-01-01\") + years(1)\n#> [1] \"2025-01-01\"\n\n# Daylight saving time\none_am + ddays(1)\n#> [1] \"2026-03-09 02:00:00 EDT\"\none_am + days(1)\n#> [1] \"2026-03-09 01:00:00 EDT\"\n\nLet’s use periods to fix an oddity related to our flight dates. Some planes appear to have arrived at their destination before they departed from New York City.\n\nflights_dt |> \n filter(arr_time < dep_time) \n#> # A tibble: 10,633 × 9\n#> origin dest dep_delay arr_delay dep_time sched_dep_time \n#> <chr> <chr> <dbl> <dbl> <dttm> <dttm> \n#> 1 EWR BQN 9 -4 2013-01-01 19:29:00 2013-01-01 19:20:00\n#> 2 JFK DFW 59 NA 2013-01-01 19:39:00 2013-01-01 18:40:00\n#> 3 EWR TPA -2 9 2013-01-01 20:58:00 2013-01-01 21:00:00\n#> 4 EWR SJU -6 -12 2013-01-01 21:02:00 2013-01-01 21:08:00\n#> 5 EWR SFO 11 -14 2013-01-01 21:08:00 2013-01-01 20:57:00\n#> 6 LGA FLL -10 -2 2013-01-01 21:20:00 2013-01-01 21:30:00\n#> # ℹ 10,627 more rows\n#> # ℹ 3 more variables: arr_time <dttm>, sched_arr_time <dttm>, …\n\nThese are overnight flights. We used the same date information for both the departure and the arrival times, but these flights arrived on the following day. We can fix this by adding days(1) to the arrival time of each overnight flight.\n\nflights_dt <- flights_dt |> \n mutate(\n overnight = arr_time < dep_time,\n arr_time = arr_time + days(overnight),\n sched_arr_time = sched_arr_time + days(overnight)\n )\n\nNow all of our flights obey the laws of physics.\n\nflights_dt |> \n filter(arr_time < dep_time) \n#> # A tibble: 0 × 10\n#> # ℹ 10 variables: origin <chr>, dest <chr>, dep_delay <dbl>,\n#> # arr_delay <dbl>, dep_time <dttm>, sched_dep_time <dttm>, …\n\n\n17.4.3 Intervals\nWhat does dyears(1) / ddays(365) return? It’s not quite one, because dyears() is defined as the number of seconds per average year, which is 365.25 days.\nWhat does years(1) / days(1) return? Well, if the year was 2015 it should return 365, but if it was 2016, it should return 366! There’s not quite enough information for lubridate to give a single clear answer. What it does instead is give an estimate:\n\nyears(1) / days(1)\n#> [1] 365.25\n\nIf you want a more accurate measurement, you’ll have to use an interval. An interval is a pair of starting and ending date times, or you can think of it as a duration with a starting point.\nYou can create an interval by writing start %--% end:\n\ny2023 <- ymd(\"2023-01-01\") %--% ymd(\"2024-01-01\")\ny2024 <- ymd(\"2024-01-01\") %--% ymd(\"2025-01-01\")\n\ny2023\n#> [1] 2023-01-01 UTC--2024-01-01 UTC\ny2024\n#> [1] 2024-01-01 UTC--2025-01-01 UTC\n\nYou could then divide it by days() to find out how many days fit in the year:\n\ny2023 / days(1)\n#> [1] 365\ny2024 / days(1)\n#> [1] 366\n\n\n17.4.4 Exercises\n\nExplain days(!overnight) and days(overnight) to someone who has just started learning R. What is the key fact you need to know?\nCreate a vector of dates giving the first day of every month in 2015. Create a vector of dates giving the first day of every month in the current year.\nWrite a function that given your birthday (as a date), returns how old you are in years.\nWhy can’t (today() %--% (today() + years(1))) / months(1) work?" + }, + { + "objectID": "datetimes.html#time-zones", + "href": "datetimes.html#time-zones", + "title": "17  Dates and times", + "section": "\n17.5 Time zones", + "text": "17.5 Time zones\nTime zones are an enormously complicated topic because of their interaction with geopolitical entities. Fortunately we don’t need to dig into all the details as they’re not all important for data analysis, but there are a few challenges we’ll need to tackle head on.\n\nThe first challenge is that everyday names of time zones tend to be ambiguous. For example, if you’re American you’re probably familiar with EST, or Eastern Standard Time. However, both Australia and Canada also have EST! To avoid confusion, R uses the international standard IANA time zones. These use a consistent naming scheme {area}/{location}, typically in the form {continent}/{city} or {ocean}/{city}. Examples include “America/New_York”, “Europe/Paris”, and “Pacific/Auckland”.\nYou might wonder why the time zone uses a city, when typically you think of time zones as associated with a country or region within a country. This is because the IANA database has to record decades worth of time zone rules. Over the course of decades, countries change names (or break apart) fairly frequently, but city names tend to stay the same. Another problem is that the name needs to reflect not only the current behavior, but also the complete history. For example, there are time zones for both “America/New_York” and “America/Detroit”. These cities both currently use Eastern Standard Time but in 1969-1972 Michigan (the state in which Detroit is located), did not follow DST, so it needs a different name. It’s worth reading the raw time zone database (available at https://www.iana.org/time-zones) just to read some of these stories!\nYou can find out what R thinks your current time zone is with Sys.timezone():\n\nSys.timezone()\n#> [1] \"UTC\"\n\n(If R doesn’t know, you’ll get an NA.)\nAnd see the complete list of all time zone names with OlsonNames():\n\nlength(OlsonNames())\n#> [1] 597\nhead(OlsonNames())\n#> [1] \"Africa/Abidjan\" \"Africa/Accra\" \"Africa/Addis_Ababa\"\n#> [4] \"Africa/Algiers\" \"Africa/Asmara\" \"Africa/Asmera\"\n\nIn R, the time zone is an attribute of the date-time that only controls printing. For example, these three objects represent the same instant in time:\n\nx1 <- ymd_hms(\"2024-06-01 12:00:00\", tz = \"America/New_York\")\nx1\n#> [1] \"2024-06-01 12:00:00 EDT\"\n\nx2 <- ymd_hms(\"2024-06-01 18:00:00\", tz = \"Europe/Copenhagen\")\nx2\n#> [1] \"2024-06-01 18:00:00 CEST\"\n\nx3 <- ymd_hms(\"2024-06-02 04:00:00\", tz = \"Pacific/Auckland\")\nx3\n#> [1] \"2024-06-02 04:00:00 NZST\"\n\nYou can verify that they’re the same time using subtraction:\n\nx1 - x2\n#> Time difference of 0 secs\nx1 - x3\n#> Time difference of 0 secs\n\nUnless otherwise specified, lubridate always uses UTC. UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and is roughly equivalent to GMT (Greenwich Mean Time). It does not have DST, which makes a convenient representation for computation. Operations that combine date-times, like c(), will often drop the time zone. In that case, the date-times will display in the time zone of the first element:\n\nx4 <- c(x1, x2, x3)\nx4\n#> [1] \"2024-06-01 12:00:00 EDT\" \"2024-06-01 12:00:00 EDT\"\n#> [3] \"2024-06-01 12:00:00 EDT\"\n\nYou can change the time zone in two ways:\n\n\nKeep the instant in time the same, and change how it’s displayed. Use this when the instant is correct, but you want a more natural display.\n\nx4a <- with_tz(x4, tzone = \"Australia/Lord_Howe\")\nx4a\n#> [1] \"2024-06-02 02:30:00 +1030\" \"2024-06-02 02:30:00 +1030\"\n#> [3] \"2024-06-02 02:30:00 +1030\"\nx4a - x4\n#> Time differences in secs\n#> [1] 0 0 0\n\n(This also illustrates another challenge of times zones: they’re not all integer hour offsets!)\n\n\nChange the underlying instant in time. Use this when you have an instant that has been labelled with the incorrect time zone, and you need to fix it.\n\nx4b <- force_tz(x4, tzone = \"Australia/Lord_Howe\")\nx4b\n#> [1] \"2024-06-01 12:00:00 +1030\" \"2024-06-01 12:00:00 +1030\"\n#> [3] \"2024-06-01 12:00:00 +1030\"\nx4b - x4\n#> Time differences in hours\n#> [1] -14.5 -14.5 -14.5" + }, + { + "objectID": "datetimes.html#summary", + "href": "datetimes.html#summary", + "title": "17  Dates and times", + "section": "\n17.6 Summary", + "text": "17.6 Summary\nThis chapter has introduced you to the tools that lubridate provides to help you work with date-time data. Working with dates and times can seem harder than necessary, but hopefully this chapter has helped you see why — date-times are more complex than they seem at first glance, and handling every possible situation adds complexity. Even if your data never crosses a day light savings boundary or involves a leap year, the functions need to be able to handle it.\nThe next chapter gives a round up of missing values. You’ve seen them in a few places and have no doubt encounter in your own analysis, and it’s now time to provide a grab bag of useful techniques for dealing with them." + }, + { + "objectID": "datetimes.html#footnotes", + "href": "datetimes.html#footnotes", + "title": "17  Dates and times", + "section": "", + "text": "A year is a leap year if it’s divisible by 4, unless it’s also divisible by 100, except if it’s also divisible by 400. In other words, in every set of 400 years, there’s 97 leap years.↩︎\nhttps://xkcd.com/1179/↩︎\nYou might wonder what UTC stands for. It’s a compromise between the English “Coordinated Universal Time” and French “Temps Universel Coordonné”.↩︎\nNo prizes for guessing which country came up with the longitude system.↩︎" + }, + { + "objectID": "missing-values.html#introduction", + "href": "missing-values.html#introduction", + "title": "18  Missing values", + "section": "\n18.1 Introduction", + "text": "18.1 Introduction\nYou’ve already learned the basics of missing values earlier in the book. You first saw them in Capítulo 1 where they resulted in a warning when making a plot as well as in Seção 3.5.2 where they interfered with computing summary statistics, and you learned about their infectious nature and how to check for their presence in Seção 12.2.2. Now we’ll come back to them in more depth, so you can learn more of the details.\nWe’ll start by discussing some general tools for working with missing values recorded as NAs. We’ll then explore the idea of implicitly missing values, values that are simply absent from your data, and show some tools you can use to make them explicit. We’ll finish off with a related discussion of empty groups, caused by factor levels that don’t appear in the data.\n\n18.1.1 Prerequisites\nThe functions for working with missing data mostly come from dplyr and tidyr, which are core members of the tidyverse.\n\nlibrary(tidyverse)" + }, + { + "objectID": "missing-values.html#explicit-missing-values", + "href": "missing-values.html#explicit-missing-values", + "title": "18  Missing values", + "section": "\n18.2 Explicit missing values", + "text": "18.2 Explicit missing values\nTo begin, let’s explore a few handy tools for creating or eliminating missing explicit values, i.e. cells where you see an NA.\n\n18.2.1 Last observation carried forward\nA common use for missing values is as a data entry convenience. When data is entered by hand, missing values sometimes indicate that the value in the previous row has been repeated (or carried forward):\n\ntreatment <- tribble(\n ~person, ~treatment, ~response,\n \"Derrick Whitmore\", 1, 7,\n NA, 2, 10,\n NA, 3, NA,\n \"Katherine Burke\", 1, 4\n)\n\nYou can fill in these missing values with tidyr::fill(). It works like select(), taking a set of columns:\n\ntreatment |>\n fill(everything())\n#> # A tibble: 4 × 3\n#> person treatment response\n#> <chr> <dbl> <dbl>\n#> 1 Derrick Whitmore 1 7\n#> 2 Derrick Whitmore 2 10\n#> 3 Derrick Whitmore 3 10\n#> 4 Katherine Burke 1 4\n\nThis treatment is sometimes called “last observation carried forward”, or locf for short. You can use the .direction argument to fill in missing values that have been generated in more exotic ways.\n\n18.2.2 Fixed values\nSome times missing values represent some fixed and known value, most commonly 0. You can use dplyr::coalesce() to replace them:\n\nx <- c(1, 4, 5, 7, NA)\ncoalesce(x, 0)\n#> [1] 1 4 5 7 0\n\nSometimes you’ll hit the opposite problem where some concrete value actually represents a missing value. This typically arises in data generated by older software that doesn’t have a proper way to represent missing values, so it must instead use some special value like 99 or -999.\nIf possible, handle this when reading in the data, for example, by using the na argument to readr::read_csv(), e.g., read_csv(path, na = \"99\"). If you discover the problem later, or your data source doesn’t provide a way to handle it on read, you can use dplyr::na_if():\n\nx <- c(1, 4, 5, 7, -99)\nna_if(x, -99)\n#> [1] 1 4 5 7 NA\n\n\n18.2.3 NaN\nBefore we continue, there’s one special type of missing value that you’ll encounter from time to time: a NaN (pronounced “nan”), or not a number. It’s not that important to know about because it generally behaves just like NA:\n\nx <- c(NA, NaN)\nx * 10\n#> [1] NA NaN\nx == 1\n#> [1] NA NA\nis.na(x)\n#> [1] TRUE TRUE\n\nIn the rare case you need to distinguish an NA from a NaN, you can use is.nan(x).\nYou’ll generally encounter a NaN when you perform a mathematical operation that has an indeterminate result:\n\n0 / 0 \n#> [1] NaN\n0 * Inf\n#> [1] NaN\nInf - Inf\n#> [1] NaN\nsqrt(-1)\n#> Warning in sqrt(-1): NaNs produced\n#> [1] NaN" + }, + { + "objectID": "missing-values.html#sec-missing-implicit", + "href": "missing-values.html#sec-missing-implicit", + "title": "18  Missing values", + "section": "\n18.3 Implicit missing values", + "text": "18.3 Implicit missing values\nSo far we’ve talked about missing values that are explicitly missing, i.e. you can see an NA in your data. But missing values can also be implicitly missing, if an entire row of data is simply absent from the data. Let’s illustrate the difference with a simple dataset that records the price of some stock each quarter:\n\nstocks <- tibble(\n year = c(2020, 2020, 2020, 2020, 2021, 2021, 2021),\n qtr = c( 1, 2, 3, 4, 2, 3, 4),\n price = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)\n)\n\nThis dataset has two missing observations:\n\nThe price in the fourth quarter of 2020 is explicitly missing, because its value is NA.\nThe price for the first quarter of 2021 is implicitly missing, because it simply does not appear in the dataset.\n\nOne way to think about the difference is with this Zen-like koan:\n\nAn explicit missing value is the presence of an absence.\nAn implicit missing value is the absence of a presence.\n\nSometimes you want to make implicit missings explicit in order to have something physical to work with. In other cases, explicit missings are forced upon you by the structure of the data and you want to get rid of them. The following sections discuss some tools for moving between implicit and explicit missingness.\n\n18.3.1 Pivoting\nYou’ve already seen one tool that can make implicit missings explicit and vice versa: pivoting. Making data wider can make implicit missing values explicit because every combination of the rows and new columns must have some value. For example, if we pivot stocks to put the quarter in the columns, both missing values become explicit:\n\nstocks |>\n pivot_wider(\n names_from = qtr, \n values_from = price\n )\n#> # A tibble: 2 × 5\n#> year `1` `2` `3` `4`\n#> <dbl> <dbl> <dbl> <dbl> <dbl>\n#> 1 2020 1.88 0.59 0.35 NA \n#> 2 2021 NA 0.92 0.17 2.66\n\nBy default, making data longer preserves explicit missing values, but if they are structurally missing values that only exist because the data is not tidy, you can drop them (make them implicit) by setting values_drop_na = TRUE. See the examples in Seção 5.2 for more details.\n\n18.3.2 Complete\ntidyr::complete() allows you to generate explicit missing values by providing a set of variables that define the combination of rows that should exist. For example, we know that all combinations of year and qtr should exist in the stocks data:\n\nstocks |>\n complete(year, qtr)\n#> # A tibble: 8 × 3\n#> year qtr price\n#> <dbl> <dbl> <dbl>\n#> 1 2020 1 1.88\n#> 2 2020 2 0.59\n#> 3 2020 3 0.35\n#> 4 2020 4 NA \n#> 5 2021 1 NA \n#> 6 2021 2 0.92\n#> # ℹ 2 more rows\n\nTypically, you’ll call complete() with names of existing variables, filling in the missing combinations. However, sometimes the individual variables are themselves incomplete, so you can instead provide your own data. For example, you might know that the stocks dataset is supposed to run from 2019 to 2021, so you could explicitly supply those values for year:\n\nstocks |>\n complete(year = 2019:2021, qtr)\n#> # A tibble: 12 × 3\n#> year qtr price\n#> <dbl> <dbl> <dbl>\n#> 1 2019 1 NA \n#> 2 2019 2 NA \n#> 3 2019 3 NA \n#> 4 2019 4 NA \n#> 5 2020 1 1.88\n#> 6 2020 2 0.59\n#> # ℹ 6 more rows\n\nIf the range of a variable is correct, but not all values are present, you could use full_seq(x, 1) to generate all values from min(x) to max(x) spaced out by 1.\nIn some cases, the complete set of observations can’t be generated by a simple combination of variables. In that case, you can do manually what complete() does for you: create a data frame that contains all the rows that should exist (using whatever combination of techniques you need), then combine it with your original dataset with dplyr::full_join().\n\n18.3.3 Joins\nThis brings us to another important way of revealing implicitly missing observations: joins. You’ll learn more about joins in Capítulo 19, but we wanted to quickly mention them to you here since you can often only know that values are missing from one dataset when you compare it to another.\ndplyr::anti_join(x, y) is a particularly useful tool here because it selects only the rows in x that don’t have a match in y. For example, we can use two anti_join()s to reveal that we’re missing information for four airports and 722 planes mentioned in flights:\n\nlibrary(nycflights13)\n\nflights |> \n distinct(faa = dest) |> \n anti_join(airports)\n#> Joining with `by = join_by(faa)`\n#> # A tibble: 4 × 1\n#> faa \n#> <chr>\n#> 1 BQN \n#> 2 SJU \n#> 3 STT \n#> 4 PSE\n\nflights |> \n distinct(tailnum) |> \n anti_join(planes)\n#> Joining with `by = join_by(tailnum)`\n#> # A tibble: 722 × 1\n#> tailnum\n#> <chr> \n#> 1 N3ALAA \n#> 2 N3DUAA \n#> 3 N542MQ \n#> 4 N730MQ \n#> 5 N9EAMQ \n#> 6 N532UA \n#> # ℹ 716 more rows\n\n\n18.3.4 Exercises\n\nCan you find any relationship between the carrier and the rows that appear to be missing from planes?" + }, + { + "objectID": "missing-values.html#factors-and-empty-groups", + "href": "missing-values.html#factors-and-empty-groups", + "title": "18  Missing values", + "section": "\n18.4 Factors and empty groups", + "text": "18.4 Factors and empty groups\nA final type of missingness is the empty group, a group that doesn’t contain any observations, which can arise when working with factors. For example, imagine we have a dataset that contains some health information about people:\n\nhealth <- tibble(\n name = c(\"Ikaia\", \"Oletta\", \"Leriah\", \"Dashay\", \"Tresaun\"),\n smoker = factor(c(\"no\", \"no\", \"no\", \"no\", \"no\"), levels = c(\"yes\", \"no\")),\n age = c(34, 88, 75, 47, 56),\n)\n\nAnd we want to count the number of smokers with dplyr::count():\n\nhealth |> count(smoker)\n#> # A tibble: 1 × 2\n#> smoker n\n#> <fct> <int>\n#> 1 no 5\n\nThis dataset only contains non-smokers, but we know that smokers exist; the group of non-smokers is empty. We can request count() to keep all the groups, even those not seen in the data by using .drop = FALSE:\n\nhealth |> count(smoker, .drop = FALSE)\n#> # A tibble: 2 × 2\n#> smoker n\n#> <fct> <int>\n#> 1 yes 0\n#> 2 no 5\n\nThe same principle applies to ggplot2’s discrete axes, which will also drop levels that don’t have any values. You can force them to display by supplying drop = FALSE to the appropriate discrete axis:\n\nggplot(health, aes(x = smoker)) +\n geom_bar() +\n scale_x_discrete()\n\nggplot(health, aes(x = smoker)) +\n geom_bar() +\n scale_x_discrete(drop = FALSE)\n\n\n\n\n\n\n\n\n\n\n\nThe same problem comes up more generally with dplyr::group_by(). And again you can use .drop = FALSE to preserve all factor levels:\n\nhealth |> \n group_by(smoker, .drop = FALSE) |> \n summarize(\n n = n(),\n mean_age = mean(age),\n min_age = min(age),\n max_age = max(age),\n sd_age = sd(age)\n )\n#> # A tibble: 2 × 6\n#> smoker n mean_age min_age max_age sd_age\n#> <fct> <int> <dbl> <dbl> <dbl> <dbl>\n#> 1 yes 0 NaN Inf -Inf NA \n#> 2 no 5 60 34 88 21.6\n\nWe get some interesting results here because when summarizing an empty group, the summary functions are applied to zero-length vectors. There’s an important distinction between empty vectors, which have length 0, and missing values, each of which has length 1.\n\n# A vector containing two missing values\nx1 <- c(NA, NA)\nlength(x1)\n#> [1] 2\n\n# A vector containing nothing\nx2 <- numeric()\nlength(x2)\n#> [1] 0\n\nAll summary functions work with zero-length vectors, but they may return results that are surprising at first glance. Here we see mean(age) returning NaN because mean(age) = sum(age)/length(age) which here is 0/0. max() and min() return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute you’ll get the minimum or maximum of the new data1.\nSometimes a simpler approach is to perform the summary and then make the implicit missings explicit with complete().\n\nhealth |> \n group_by(smoker) |> \n summarize(\n n = n(),\n mean_age = mean(age),\n min_age = min(age),\n max_age = max(age),\n sd_age = sd(age)\n ) |> \n complete(smoker)\n#> # A tibble: 2 × 6\n#> smoker n mean_age min_age max_age sd_age\n#> <fct> <int> <dbl> <dbl> <dbl> <dbl>\n#> 1 yes NA NA NA NA NA \n#> 2 no 5 60 34 88 21.6\n\nThe main drawback of this approach is that you get an NA for the count, even though you know that it should be zero." + }, + { + "objectID": "missing-values.html#summary", + "href": "missing-values.html#summary", + "title": "18  Missing values", + "section": "\n18.5 Summary", + "text": "18.5 Summary\nMissing values are weird! Sometimes they’re recorded as an explicit NA but other times you only notice them by their absence. This chapter has given you some tools for working with explicit missing values, tools for uncovering implicit missing values, and discussed some of the ways that implicit can become explicit and vice versa.\nIn the next chapter, we tackle the final chapter in this part of the book: joins. This is a bit of a change from the chapters so far because we’re going to discuss tools that work with data frames as a whole, not something that you put inside a data frame." + }, + { + "objectID": "missing-values.html#footnotes", + "href": "missing-values.html#footnotes", + "title": "18  Missing values", + "section": "", + "text": "In other words, min(c(x, y)) is always equal to min(min(x), min(y)).↩︎" + }, + { + "objectID": "joins.html#introduction", + "href": "joins.html#introduction", + "title": "19  Joins", + "section": "\n19.1 Introduction", + "text": "19.1 Introduction\nIt’s rare that a data analysis involves only a single data frame. Typically you have many data frames, and you must join them together to answer the questions that you’re interested in. This chapter will introduce you to two important types of joins:\n\nMutating joins, which add new variables to one data frame from matching observations in another.\nFiltering joins, which filter observations from one data frame based on whether or not they match an observation in another.\n\nWe’ll begin by discussing keys, the variables used to connect a pair of data frames in a join. We cement the theory with an examination of the keys in the datasets from the nycflights13 package, then use that knowledge to start joining data frames together. Next we’ll discuss how joins work, focusing on their action on the rows. We’ll finish up with a discussion of non-equi joins, a family of joins that provide a more flexible way of matching keys than the default equality relationship.\n\n19.1.1 Prerequisites\nIn this chapter, we’ll explore the five related datasets from nycflights13 using the join functions from dplyr.\n\nlibrary(tidyverse)\nlibrary(nycflights13)" + }, + { + "objectID": "joins.html#keys", + "href": "joins.html#keys", + "title": "19  Joins", + "section": "\n19.2 Keys", + "text": "19.2 Keys\nTo understand joins, you need to first understand how two tables can be connected through a pair of keys, within each table. In this section, you’ll learn about the two types of key and see examples of both in the datasets of the nycflights13 package. You’ll also learn how to check that your keys are valid, and what to do if your table lacks a key.\n\n19.2.1 Primary and foreign keys\nEvery join involves a pair of keys: a primary key and a foreign key. A primary key is a variable or set of variables that uniquely identifies each observation. When more than one variable is needed, the key is called a compound key. For example, in nycflights13:\n\n\nairlines records two pieces of data about each airline: its carrier code and its full name. You can identify an airline with its two letter carrier code, making carrier the primary key.\n\nairlines\n#> # A tibble: 16 × 2\n#> carrier name \n#> <chr> <chr> \n#> 1 9E Endeavor Air Inc. \n#> 2 AA American Airlines Inc. \n#> 3 AS Alaska Airlines Inc. \n#> 4 B6 JetBlue Airways \n#> 5 DL Delta Air Lines Inc. \n#> 6 EV ExpressJet Airlines Inc.\n#> # ℹ 10 more rows\n\n\n\nairports records data about each airport. You can identify each airport by its three letter airport code, making faa the primary key.\n\nairports\n#> # A tibble: 1,458 × 8\n#> faa name lat lon alt tz dst \n#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>\n#> 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A \n#> 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A \n#> 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A \n#> 4 06N Randall Airport 41.4 -74.4 523 -5 A \n#> 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A \n#> 6 0A9 Elizabethton Municipal Airpo… 36.4 -82.2 1593 -5 A \n#> # ℹ 1,452 more rows\n#> # ℹ 1 more variable: tzone <chr>\n\n\n\nplanes records data about each plane. You can identify a plane by its tail number, making tailnum the primary key.\n\nplanes\n#> # A tibble: 3,322 × 9\n#> tailnum year type manufacturer model engines\n#> <chr> <int> <chr> <chr> <chr> <int>\n#> 1 N10156 2004 Fixed wing multi… EMBRAER EMB-145XR 2\n#> 2 N102UW 1998 Fixed wing multi… AIRBUS INDUSTR… A320-214 2\n#> 3 N103US 1999 Fixed wing multi… AIRBUS INDUSTR… A320-214 2\n#> 4 N104UW 1999 Fixed wing multi… AIRBUS INDUSTR… A320-214 2\n#> 5 N10575 2002 Fixed wing multi… EMBRAER EMB-145LR 2\n#> 6 N105UW 1999 Fixed wing multi… AIRBUS INDUSTR… A320-214 2\n#> # ℹ 3,316 more rows\n#> # ℹ 3 more variables: seats <int>, speed <int>, engine <chr>\n\n\n\nweather records data about the weather at the origin airports. You can identify each observation by the combination of location and time, making origin and time_hour the compound primary key.\n\nweather\n#> # A tibble: 26,115 × 15\n#> origin year month day hour temp dewp humid wind_dir\n#> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>\n#> 1 EWR 2013 1 1 1 39.0 26.1 59.4 270\n#> 2 EWR 2013 1 1 2 39.0 27.0 61.6 250\n#> 3 EWR 2013 1 1 3 39.0 28.0 64.4 240\n#> 4 EWR 2013 1 1 4 39.9 28.0 62.2 250\n#> 5 EWR 2013 1 1 5 39.0 28.0 64.4 260\n#> 6 EWR 2013 1 1 6 37.9 28.0 67.2 240\n#> # ℹ 26,109 more rows\n#> # ℹ 6 more variables: wind_speed <dbl>, wind_gust <dbl>, …\n\n\n\nA foreign key is a variable (or set of variables) that corresponds to a primary key in another table. For example:\n\n\nflights$tailnum is a foreign key that corresponds to the primary key planes$tailnum.\n\nflights$carrier is a foreign key that corresponds to the primary key airlines$carrier.\n\nflights$origin is a foreign key that corresponds to the primary key airports$faa.\n\nflights$dest is a foreign key that corresponds to the primary key airports$faa.\n\nflights$origin-flights$time_hour is a compound foreign key that corresponds to the compound primary key weather$origin-weather$time_hour.\n\nThese relationships are summarized visually in Figura 19.1.\n\n\n\n\nFigura 19.1: Connections between all five data frames in the nycflights13 package. Variables making up a primary key are colored grey, and are connected to their corresponding foreign keys with arrows.\n\n\n\nYou’ll notice a nice feature in the design of these keys: the primary and foreign keys almost always have the same names, which, as you’ll see shortly, will make your joining life much easier. It’s also worth noting the opposite relationship: almost every variable name used in multiple tables has the same meaning in each place. There’s only one exception: year means year of departure in flights and year of manufacturer in planes. This will become important when we start actually joining tables together.\n\n19.2.2 Checking primary keys\nNow that that we’ve identified the primary keys in each table, it’s good practice to verify that they do indeed uniquely identify each observation. One way to do that is to count() the primary keys and look for entries where n is greater than one. This reveals that planes and weather both look good:\n\nplanes |> \n count(tailnum) |> \n filter(n > 1)\n#> # A tibble: 0 × 2\n#> # ℹ 2 variables: tailnum <chr>, n <int>\n\nweather |> \n count(time_hour, origin) |> \n filter(n > 1)\n#> # A tibble: 0 × 3\n#> # ℹ 3 variables: time_hour <dttm>, origin <chr>, n <int>\n\nYou should also check for missing values in your primary keys — if a value is missing then it can’t identify an observation!\n\nplanes |> \n filter(is.na(tailnum))\n#> # A tibble: 0 × 9\n#> # ℹ 9 variables: tailnum <chr>, year <int>, type <chr>, manufacturer <chr>,\n#> # model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>\n\nweather |> \n filter(is.na(time_hour) | is.na(origin))\n#> # A tibble: 0 × 15\n#> # ℹ 15 variables: origin <chr>, year <int>, month <int>, day <int>,\n#> # hour <int>, temp <dbl>, dewp <dbl>, humid <dbl>, wind_dir <dbl>, …\n\n\n19.2.3 Surrogate keys\nSo far we haven’t talked about the primary key for flights. It’s not super important here, because there are no data frames that use it as a foreign key, but it’s still useful to consider because it’s easier to work with observations if we have some way to describe them to others.\nAfter a little thinking and experimentation, we determined that there are three variables that together uniquely identify each flight:\n\nflights |> \n count(time_hour, carrier, flight) |> \n filter(n > 1)\n#> # A tibble: 0 × 4\n#> # ℹ 4 variables: time_hour <dttm>, carrier <chr>, flight <int>, n <int>\n\nDoes the absence of duplicates automatically make time_hour-carrier-flight a primary key? It’s certainly a good start, but it doesn’t guarantee it. For example, are altitude and latitude a good primary key for airports?\n\nairports |>\n count(alt, lat) |> \n filter(n > 1)\n#> # A tibble: 1 × 3\n#> alt lat n\n#> <dbl> <dbl> <int>\n#> 1 13 40.6 2\n\nIdentifying an airport by its altitude and latitude is clearly a bad idea, and in general it’s not possible to know from the data alone whether or not a combination of variables makes a good a primary key. But for flights, the combination of time_hour, carrier, and flight seems reasonable because it would be really confusing for an airline and its customers if there were multiple flights with the same flight number in the air at the same time.\nThat said, we might be better off introducing a simple numeric surrogate key using the row number:\n\nflights2 <- flights |> \n mutate(id = row_number(), .before = 1)\nflights2\n#> # A tibble: 336,776 × 20\n#> id year month day dep_time sched_dep_time dep_delay arr_time\n#> <int> <int> <int> <int> <int> <int> <dbl> <int>\n#> 1 1 2013 1 1 517 515 2 830\n#> 2 2 2013 1 1 533 529 4 850\n#> 3 3 2013 1 1 542 540 2 923\n#> 4 4 2013 1 1 544 545 -1 1004\n#> 5 5 2013 1 1 554 600 -6 812\n#> 6 6 2013 1 1 554 558 -4 740\n#> # ℹ 336,770 more rows\n#> # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, …\n\nSurrogate keys can be particularly useful when communicating to other humans: it’s much easier to tell someone to take a look at flight 2001 than to say look at UA430 which departed 9am 2013-01-03.\n\n19.2.4 Exercises\n\nWe forgot to draw the relationship between weather and airports in Figura 19.1. What is the relationship and how should it appear in the diagram?\nweather only contains information for the three origin airports in NYC. If it contained weather records for all airports in the USA, what additional connection would it make to flights?\nThe year, month, day, hour, and origin variables almost form a compound key for weather, but there’s one hour that has duplicate observations. Can you figure out what’s special about that hour?\nWe know that some days of the year are special and fewer people than usual fly on them (e.g., Christmas eve and Christmas day). How might you represent that data as a data frame? What would be the primary key? How would it connect to the existing data frames?\nDraw a diagram illustrating the connections between the Batting, People, and Salaries data frames in the Lahman package. Draw another diagram that shows the relationship between People, Managers, AwardsManagers. How would you characterize the relationship between the Batting, Pitching, and Fielding data frames?" + }, + { + "objectID": "joins.html#sec-mutating-joins", + "href": "joins.html#sec-mutating-joins", + "title": "19  Joins", + "section": "\n19.3 Basic joins", + "text": "19.3 Basic joins\nNow that you understand how data frames are connected via keys, we can start using joins to better understand the flights dataset. dplyr provides six join functions: left_join(), inner_join(), right_join(), full_join(), semi_join(), and anti_join(). They all have the same interface: they take a pair of data frames (x and y) and return a data frame. The order of the rows and columns in the output is primarily determined by x.\nIn this section, you’ll learn how to use one mutating join, left_join(), and two filtering joins, semi_join() and anti_join(). In the next section, you’ll learn exactly how these functions work, and about the remaining inner_join(), right_join() and full_join().\n\n19.3.1 Mutating joins\nA mutating join allows you to combine variables from two data frames: it first matches observations by their keys, then copies across variables from one data frame to the other. Like mutate(), the join functions add variables to the right, so if your dataset has many variables, you won’t see the new ones. For these examples, we’ll make it easier to see what’s going on by creating a narrower dataset with just six variables1:\n\nflights2 <- flights |> \n select(year, time_hour, origin, dest, tailnum, carrier)\nflights2\n#> # A tibble: 336,776 × 6\n#> year time_hour origin dest tailnum carrier\n#> <int> <dttm> <chr> <chr> <chr> <chr> \n#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA \n#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA \n#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA \n#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 \n#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL \n#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA \n#> # ℹ 336,770 more rows\n\nThere are four types of mutating join, but there’s one that you’ll use almost all of the time: left_join(). It’s special because the output will always have the same rows as x, the data frame you’re joining to2. The primary use of left_join() is to add in additional metadata. For example, we can use left_join() to add the full airline name to the flights2 data:\n\nflights2 |>\n left_join(airlines)\n#> Joining with `by = join_by(carrier)`\n#> # A tibble: 336,776 × 7\n#> year time_hour origin dest tailnum carrier name \n#> <int> <dttm> <chr> <chr> <chr> <chr> <chr> \n#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA United Air Lines In…\n#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA United Air Lines In…\n#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA American Airlines I…\n#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 JetBlue Airways \n#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Delta Air Lines Inc.\n#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA United Air Lines In…\n#> # ℹ 336,770 more rows\n\nOr we could find out the temperature and wind speed when each plane departed:\n\nflights2 |> \n left_join(weather |> select(origin, time_hour, temp, wind_speed))\n#> Joining with `by = join_by(time_hour, origin)`\n#> # A tibble: 336,776 × 8\n#> year time_hour origin dest tailnum carrier temp wind_speed\n#> <int> <dttm> <chr> <chr> <chr> <chr> <dbl> <dbl>\n#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA 39.0 12.7\n#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA 39.9 15.0\n#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA 39.0 15.0\n#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 39.0 15.0\n#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL 39.9 16.1\n#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA 39.0 12.7\n#> # ℹ 336,770 more rows\n\nOr what size of plane was flying:\n\nflights2 |> \n left_join(planes |> select(tailnum, type, engines, seats))\n#> Joining with `by = join_by(tailnum)`\n#> # A tibble: 336,776 × 9\n#> year time_hour origin dest tailnum carrier type \n#> <int> <dttm> <chr> <chr> <chr> <chr> <chr> \n#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Fixed wing multi en…\n#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA Fixed wing multi en…\n#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Fixed wing multi en…\n#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 Fixed wing multi en…\n#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Fixed wing multi en…\n#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Fixed wing multi en…\n#> # ℹ 336,770 more rows\n#> # ℹ 2 more variables: engines <int>, seats <int>\n\nWhen left_join() fails to find a match for a row in x, it fills in the new variables with missing values. For example, there’s no information about the plane with tail number N3ALAA so the type, engines, and seats will be missing:\n\nflights2 |> \n filter(tailnum == \"N3ALAA\") |> \n left_join(planes |> select(tailnum, type, engines, seats))\n#> Joining with `by = join_by(tailnum)`\n#> # A tibble: 63 × 9\n#> year time_hour origin dest tailnum carrier type engines seats\n#> <int> <dttm> <chr> <chr> <chr> <chr> <chr> <int> <int>\n#> 1 2013 2013-01-01 06:00:00 LGA ORD N3ALAA AA <NA> NA NA\n#> 2 2013 2013-01-02 18:00:00 LGA ORD N3ALAA AA <NA> NA NA\n#> 3 2013 2013-01-03 06:00:00 LGA ORD N3ALAA AA <NA> NA NA\n#> 4 2013 2013-01-07 19:00:00 LGA ORD N3ALAA AA <NA> NA NA\n#> 5 2013 2013-01-08 17:00:00 JFK ORD N3ALAA AA <NA> NA NA\n#> 6 2013 2013-01-16 06:00:00 LGA ORD N3ALAA AA <NA> NA NA\n#> # ℹ 57 more rows\n\nWe’ll come back to this problem a few times in the rest of the chapter.\n\n19.3.2 Specifying join keys\nBy default, left_join() will use all variables that appear in both data frames as the join key, the so called natural join. This is a useful heuristic, but it doesn’t always work. For example, what happens if we try to join flights2 with the complete planes dataset?\n\nflights2 |> \n left_join(planes)\n#> Joining with `by = join_by(year, tailnum)`\n#> # A tibble: 336,776 × 13\n#> year time_hour origin dest tailnum carrier type manufacturer\n#> <int> <dttm> <chr> <chr> <chr> <chr> <chr> <chr> \n#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA <NA> <NA> \n#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA <NA> <NA> \n#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA <NA> <NA> \n#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 <NA> <NA> \n#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL <NA> <NA> \n#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA <NA> <NA> \n#> # ℹ 336,770 more rows\n#> # ℹ 5 more variables: model <chr>, engines <int>, seats <int>, …\n\nWe get a lot of missing matches because our join is trying to use tailnum and year as a compound key. Both flights and planes have a year column but they mean different things: flights$year is the year the flight occurred and planes$year is the year the plane was built. We only want to join on tailnum so we need to provide an explicit specification with join_by():\n\nflights2 |> \n left_join(planes, join_by(tailnum))\n#> # A tibble: 336,776 × 14\n#> year.x time_hour origin dest tailnum carrier year.y\n#> <int> <dttm> <chr> <chr> <chr> <chr> <int>\n#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA 1999\n#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA 1998\n#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA 1990\n#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 2012\n#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL 1991\n#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA 2012\n#> # ℹ 336,770 more rows\n#> # ℹ 7 more variables: type <chr>, manufacturer <chr>, model <chr>, …\n\nNote that the year variables are disambiguated in the output with a suffix (year.x and year.y), which tells you whether the variable came from the x or y argument. You can override the default suffixes with the suffix argument.\njoin_by(tailnum) is short for join_by(tailnum == tailnum). It’s important to know about this fuller form for two reasons. Firstly, it describes the relationship between the two tables: the keys must be equal. That’s why this type of join is often called an equi join. You’ll learn about non-equi joins in Seção 19.5.\nSecondly, it’s how you specify different join keys in each table. For example, there are two ways to join the flight2 and airports table: either by dest or origin:\n\nflights2 |> \n left_join(airports, join_by(dest == faa))\n#> # A tibble: 336,776 × 13\n#> year time_hour origin dest tailnum carrier name \n#> <int> <dttm> <chr> <chr> <chr> <chr> <chr> \n#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA George Bush Interco…\n#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA George Bush Interco…\n#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Miami Intl \n#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 <NA> \n#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Hartsfield Jackson …\n#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Chicago Ohare Intl \n#> # ℹ 336,770 more rows\n#> # ℹ 6 more variables: lat <dbl>, lon <dbl>, alt <dbl>, tz <dbl>, …\n\nflights2 |> \n left_join(airports, join_by(origin == faa))\n#> # A tibble: 336,776 × 13\n#> year time_hour origin dest tailnum carrier name \n#> <int> <dttm> <chr> <chr> <chr> <chr> <chr> \n#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Newark Liberty Intl\n#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA La Guardia \n#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA John F Kennedy Intl\n#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 John F Kennedy Intl\n#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL La Guardia \n#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Newark Liberty Intl\n#> # ℹ 336,770 more rows\n#> # ℹ 6 more variables: lat <dbl>, lon <dbl>, alt <dbl>, tz <dbl>, …\n\nIn older code you might see a different way of specifying the join keys, using a character vector:\n\n\nby = \"x\" corresponds to join_by(x).\n\nby = c(\"a\" = \"x\") corresponds to join_by(a == x).\n\nNow that it exists, we prefer join_by() since it provides a clearer and more flexible specification.\ninner_join(), right_join(), full_join() have the same interface as left_join(). The difference is which rows they keep: left join keeps all the rows in x, the right join keeps all rows in y, the full join keeps all rows in either x or y, and the inner join only keeps rows that occur in both x and y. We’ll come back to these in more detail later.\n\n19.3.3 Filtering joins\nAs you might guess the primary action of a filtering join is to filter the rows. There are two types: semi-joins and anti-joins. Semi-joins keep all rows in x that have a match in y. For example, we could use a semi-join to filter the airports dataset to show just the origin airports:\n\nairports |> \n semi_join(flights2, join_by(faa == origin))\n#> # A tibble: 3 × 8\n#> faa name lat lon alt tz dst tzone \n#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> \n#> 1 EWR Newark Liberty Intl 40.7 -74.2 18 -5 A America/New_York\n#> 2 JFK John F Kennedy Intl 40.6 -73.8 13 -5 A America/New_York\n#> 3 LGA La Guardia 40.8 -73.9 22 -5 A America/New_York\n\nOr just the destinations:\n\nairports |> \n semi_join(flights2, join_by(faa == dest))\n#> # A tibble: 101 × 8\n#> faa name lat lon alt tz dst tzone \n#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> \n#> 1 ABQ Albuquerque Internati… 35.0 -107. 5355 -7 A America/Denver \n#> 2 ACK Nantucket Mem 41.3 -70.1 48 -5 A America/New_Yo…\n#> 3 ALB Albany Intl 42.7 -73.8 285 -5 A America/New_Yo…\n#> 4 ANC Ted Stevens Anchorage… 61.2 -150. 152 -9 A America/Anchor…\n#> 5 ATL Hartsfield Jackson At… 33.6 -84.4 1026 -5 A America/New_Yo…\n#> 6 AUS Austin Bergstrom Intl 30.2 -97.7 542 -6 A America/Chicago\n#> # ℹ 95 more rows\n\nAnti-joins are the opposite: they return all rows in x that don’t have a match in y. They’re useful for finding missing values that are implicit in the data, the topic of Seção 18.3. Implicitly missing values don’t show up as NAs but instead only exist as an absence. For example, we can find rows that are missing from airports by looking for flights that don’t have a matching destination airport:\n\nflights2 |> \n anti_join(airports, join_by(dest == faa)) |> \n distinct(dest)\n#> # A tibble: 4 × 1\n#> dest \n#> <chr>\n#> 1 BQN \n#> 2 SJU \n#> 3 STT \n#> 4 PSE\n\nOr we can find which tailnums are missing from planes:\n\nflights2 |>\n anti_join(planes, join_by(tailnum)) |> \n distinct(tailnum)\n#> # A tibble: 722 × 1\n#> tailnum\n#> <chr> \n#> 1 N3ALAA \n#> 2 N3DUAA \n#> 3 N542MQ \n#> 4 N730MQ \n#> 5 N9EAMQ \n#> 6 N532UA \n#> # ℹ 716 more rows\n\n\n19.3.4 Exercises\n\nFind the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the weather data. Can you see any patterns?\n\nImagine you’ve found the top 10 most popular destinations using this code:\n\ntop_dest <- flights2 |>\n count(dest, sort = TRUE) |>\n head(10)\n\nHow can you find all flights to those destinations?\n\nDoes every departing flight have corresponding weather data for that hour?\nWhat do the tail numbers that don’t have a matching record in planes have in common? (Hint: one variable explains ~90% of the problems.)\nAdd a column to planes that lists every carrier that has flown that plane. You might expect that there’s an implicit relationship between plane and airline, because each plane is flown by a single airline. Confirm or reject this hypothesis using the tools you’ve learned in previous chapters.\nAdd the latitude and the longitude of the origin and destination airport to flights. Is it easier to rename the columns before or after the join?\n\nCompute the average delay by destination, then join on the airports data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:\n\nairports |>\n semi_join(flights, join_by(faa == dest)) |>\n ggplot(aes(x = lon, y = lat)) +\n borders(\"state\") +\n geom_point() +\n coord_quickmap()\n\nYou might want to use the size or color of the points to display the average delay for each airport.\n\nWhat happened on June 13 2013? Draw a map of the delays, and then use Google to cross-reference with the weather." + }, + { + "objectID": "joins.html#how-do-joins-work", + "href": "joins.html#how-do-joins-work", + "title": "19  Joins", + "section": "\n19.4 How do joins work?", + "text": "19.4 How do joins work?\nNow that you’ve used joins a few times it’s time to learn more about how they work, focusing on how each row in x matches rows in y. We’ll begin by introducing a visual representation of joins, using the simple tibbles defined below and shown in Figura 19.2. In these examples we’ll use a single key called key and a single value column (val_x and val_y), but the ideas all generalize to multiple keys and multiple values.\n\nx <- tribble(\n ~key, ~val_x,\n 1, \"x1\",\n 2, \"x2\",\n 3, \"x3\"\n)\ny <- tribble(\n ~key, ~val_y,\n 1, \"y1\",\n 2, \"y2\",\n 4, \"y3\"\n)\n\n\n\n\n\nFigura 19.2: Graphical representation of two simple tables. The colored key columns map background color to key value. The grey columns represent the “value” columns that are carried along for the ride.\n\n\n\nFigura 19.3 introduces the foundation for our visual representation. It shows all potential matches between x and y as the intersection between lines drawn from each row of x and each row of y. The rows and columns in the output are primarily determined by x, so the x table is horizontal and lines up with the output.\n\n\n\n\nFigura 19.3: To understand how joins work, it’s useful to think of every possible match. Here we show that with a grid of connecting lines.\n\n\n\nTo describe a specific type of join, we indicate matches with dots. The matches determine the rows in the output, a new data frame that contains the key, the x values, and the y values. For example, Figura 19.4 shows an inner join, where rows are retained if and only if the keys are equal.\n\n\n\n\nFigura 19.4: An inner join matches each row in x to the row in y that has the same value of key. Each match becomes a row in the output.\n\n\n\nWe can apply the same principles to explain the outer joins, which keep observations that appear in at least one of the data frames. These joins work by adding an additional “virtual” observation to each data frame. This observation has a key that matches if no other key matches, and values filled with NA. There are three types of outer joins:\n\n\nA left join keeps all observations in x, Figura 19.5. Every row of x is preserved in the output because it can fall back to matching a row of NAs in y.\n\n\n\n\nFigura 19.5: A visual representation of the left join where every row in x appears in the output.\n\n\n\n\n\nA right join keeps all observations in y, Figura 19.6. Every row of y is preserved in the output because it can fall back to matching a row of NAs in x. The output still matches x as much as possible; any extra rows from y are added to the end.\n\n\n\n\nFigura 19.6: A visual representation of the right join where every row of y appears in the output.\n\n\n\n\n\nA full join keeps all observations that appear in x or y, Figura 19.7. Every row of x and y is included in the output because both x and y have a fall back row of NAs. Again, the output starts with all rows from x, followed by the remaining unmatched y rows.\n\n\n\n\nFigura 19.7: A visual representation of the full join where every row in x and y appears in the output.\n\n\n\n\n\nAnother way to show how the types of outer join differ is with a Venn diagram, as in Figura 19.8. However, this is not a great representation because while it might jog your memory about which rows are preserved, it fails to illustrate what’s happening with the columns.\n\n\n\n\nFigura 19.8: Venn diagrams showing the difference between inner, left, right, and full joins.\n\n\n\nThe joins shown here are the so-called equi joins, where rows match if the keys are equal. Equi joins are the most common type of join, so we’ll typically omit the equi prefix, and just say “inner join” rather than “equi inner join”. We’ll come back to non-equi joins in Seção 19.5.\n\n19.4.1 Row matching\nSo far we’ve explored what happens if a row in x matches zero or one row in y. What happens if it matches more than one row? To understand what’s going let’s first narrow our focus to the inner_join() and then draw a picture, Figura 19.9.\n\n\n\n\nFigura 19.9: The three ways a row in x can match. x1 matches one row in y, x2 matches two rows in y, x3 matches zero rows in y. Note that while there are three rows in x and three rows in the output, there isn’t a direct correspondence between the rows.\n\n\n\nThere are three possible outcomes for a row in x:\n\nIf it doesn’t match anything, it’s dropped.\nIf it matches 1 row in y, it’s preserved.\nIf it matches more than 1 row in y, it’s duplicated once for each match.\n\nIn principle, this means that there’s no guaranteed correspondence between the rows in the output and the rows in x, but in practice, this rarely causes problems. There is, however, one particularly dangerous case which can cause a combinatorial explosion of rows. Imagine joining the following two tables:\n\ndf1 <- tibble(key = c(1, 2, 2), val_x = c(\"x1\", \"x2\", \"x3\"))\ndf2 <- tibble(key = c(1, 2, 2), val_y = c(\"y1\", \"y2\", \"y3\"))\n\nWhile the first row in df1 only matches one row in df2, the second and third rows both match two rows. This is sometimes called a many-to-many join, and will cause dplyr to emit a warning:\n\ndf1 |> \n inner_join(df2, join_by(key))\n#> Warning in inner_join(df1, df2, join_by(key)): Detected an unexpected many-to-many relationship between `x` and `y`.\n#> ℹ Row 2 of `x` matches multiple rows in `y`.\n#> ℹ Row 2 of `y` matches multiple rows in `x`.\n#> ℹ If a many-to-many relationship is expected, set `relationship =\n#> \"many-to-many\"` to silence this warning.\n#> # A tibble: 5 × 3\n#> key val_x val_y\n#> <dbl> <chr> <chr>\n#> 1 1 x1 y1 \n#> 2 2 x2 y2 \n#> 3 2 x2 y3 \n#> 4 2 x3 y2 \n#> 5 2 x3 y3\n\nIf you are doing this deliberately, you can set relationship = \"many-to-many\", as the warning suggests.\n\n19.4.2 Filtering joins\nThe number of matches also determines the behavior of the filtering joins. The semi-join keeps rows in x that have one or more matches in y, as in Figura 19.10. The anti-join keeps rows in x that match zero rows in y, as in Figura 19.11. In both cases, only the existence of a match is important; it doesn’t matter how many times it matches. This means that filtering joins never duplicate rows like mutating joins do.\n\n\n\n\nFigura 19.10: In a semi-join it only matters that there is a match; otherwise values in y don’t affect the output.\n\n\n\n\n\n\n\nFigura 19.11: An anti-join is the inverse of a semi-join, dropping rows from x that have a match in y." + }, + { + "objectID": "joins.html#sec-non-equi-joins", + "href": "joins.html#sec-non-equi-joins", + "title": "19  Joins", + "section": "\n19.5 Non-equi joins", + "text": "19.5 Non-equi joins\nSo far you’ve only seen equi joins, joins where the rows match if the x key equals the y key. Now we’re going to relax that restriction and discuss other ways of determining if a pair of rows match.\nBut before we can do that, we need to revisit a simplification we made above. In equi joins the x keys and y are always equal, so we only need to show one in the output. We can request that dplyr keep both keys with keep = TRUE, leading to the code below and the re-drawn inner_join() in Figura 19.12.\n\nx |> inner_join(y, join_by(key == key), keep = TRUE)\n#> # A tibble: 2 × 4\n#> key.x val_x key.y val_y\n#> <dbl> <chr> <dbl> <chr>\n#> 1 1 x1 1 y1 \n#> 2 2 x2 2 y2\n\n\n\n\n\nFigura 19.12: An inner join showing both x and y keys in the output.\n\n\n\nWhen we move away from equi joins we’ll always show the keys, because the key values will often be different. For example, instead of matching only when the x$key and y$key are equal, we could match whenever the x$key is greater than or equal to the y$key, leading to Figura 19.13. dplyr’s join functions understand this distinction equi and non-equi joins so will always show both keys when you perform a non-equi join.\n\n\n\n\nFigura 19.13: A non-equi join where the x key must be greater than or equal to the y key. Many rows generate multiple matches.\n\n\n\nNon-equi join isn’t a particularly useful term because it only tells you what the join is not, not what it is. dplyr helps by identifying four particularly useful types of non-equi join:\n\n\nCross joins match every pair of rows.\n\nInequality joins use <, <=, >, and >= instead of ==.\n\nRolling joins are similar to inequality joins but only find the closest match.\n\nOverlap joins are a special type of inequality join designed to work with ranges.\n\nEach of these is described in more detail in the following sections.\n\n19.5.1 Cross joins\nA cross join matches everything, as in Figura 19.14, generating the Cartesian product of rows. This means the output will have nrow(x) * nrow(y) rows.\n\n\n\n\nFigura 19.14: A cross join matches each row in x with every row in y.\n\n\n\nCross joins are useful when generating permutations. For example, the code below generates every possible pair of names. Since we’re joining df to itself, this is sometimes called a self-join. Cross joins use a different join function because there’s no distinction between inner/left/right/full when you’re matching every row.\n\ndf <- tibble(name = c(\"John\", \"Simon\", \"Tracy\", \"Max\"))\ndf |> cross_join(df)\n#> # A tibble: 16 × 2\n#> name.x name.y\n#> <chr> <chr> \n#> 1 John John \n#> 2 John Simon \n#> 3 John Tracy \n#> 4 John Max \n#> 5 Simon John \n#> 6 Simon Simon \n#> # ℹ 10 more rows\n\n\n19.5.2 Inequality joins\nInequality joins use <, <=, >=, or > to restrict the set of possible matches, as in Figura 19.13 and Figura 19.15.\n\n\n\n\nFigura 19.15: An inequality join where x is joined to y on rows where the key of x is less than the key of y. This makes a triangular shape in the top-left corner.\n\n\n\nInequality joins are extremely general, so general that it’s hard to come up with meaningful specific use cases. One small useful technique is to use them to restrict the cross join so that instead of generating all permutations, we generate all combinations:\n\ndf <- tibble(id = 1:4, name = c(\"John\", \"Simon\", \"Tracy\", \"Max\"))\n\ndf |> inner_join(df, join_by(id < id))\n#> # A tibble: 6 × 4\n#> id.x name.x id.y name.y\n#> <int> <chr> <int> <chr> \n#> 1 1 John 2 Simon \n#> 2 1 John 3 Tracy \n#> 3 1 John 4 Max \n#> 4 2 Simon 3 Tracy \n#> 5 2 Simon 4 Max \n#> 6 3 Tracy 4 Max\n\n\n19.5.3 Rolling joins\nRolling joins are a special type of inequality join where instead of getting every row that satisfies the inequality, you get just the closest row, as in Figura 19.16. You can turn any inequality join into a rolling join by adding closest(). For example join_by(closest(x <= y)) matches the smallest y that’s greater than or equal to x, and join_by(closest(x > y)) matches the biggest y that’s less than x.\n\n\n\n\nFigura 19.16: A rolling join is similar to a greater-than-or-equal inequality join but only matches the first value.\n\n\n\nRolling joins are particularly useful when you have two tables of dates that don’t perfectly line up and you want to find (e.g.) the closest date in table 1 that comes before (or after) some date in table 2.\nFor example, imagine that you’re in charge of the party planning commission for your office. Your company is rather cheap so instead of having individual parties, you only have a party once each quarter. The rules for determining when a party will be held are a little complex: parties are always on a Monday, you skip the first week of January since a lot of people are on holiday, and the first Monday of Q3 2022 is July 4, so that has to be pushed back a week. That leads to the following party days:\n\nparties <- tibble(\n q = 1:4,\n party = ymd(c(\"2022-01-10\", \"2022-04-04\", \"2022-07-11\", \"2022-10-03\"))\n)\n\nNow imagine that you have a table of employee birthdays:\n\nset.seed(123)\nemployees <- tibble(\n name = sample(babynames::babynames$name, 100),\n birthday = ymd(\"2022-01-01\") + (sample(365, 100, replace = TRUE) - 1)\n)\nemployees\n#> # A tibble: 100 × 2\n#> name birthday \n#> <chr> <date> \n#> 1 Kemba 2022-01-22\n#> 2 Orean 2022-06-26\n#> 3 Kirstyn 2022-02-11\n#> 4 Amparo 2022-11-11\n#> 5 Belen 2022-03-25\n#> 6 Rayshaun 2022-01-11\n#> # ℹ 94 more rows\n\nAnd for each employee we want to find the first party date that comes after (or on) their birthday. We can express that with a rolling join:\n\nemployees |> \n left_join(parties, join_by(closest(birthday >= party)))\n#> # A tibble: 100 × 4\n#> name birthday q party \n#> <chr> <date> <int> <date> \n#> 1 Kemba 2022-01-22 1 2022-01-10\n#> 2 Orean 2022-06-26 2 2022-04-04\n#> 3 Kirstyn 2022-02-11 1 2022-01-10\n#> 4 Amparo 2022-11-11 4 2022-10-03\n#> 5 Belen 2022-03-25 1 2022-01-10\n#> 6 Rayshaun 2022-01-11 1 2022-01-10\n#> # ℹ 94 more rows\n\nThere is, however, one problem with this approach: the folks with birthdays before January 10 don’t get a party:\n\nemployees |> \n anti_join(parties, join_by(closest(birthday >= party)))\n#> # A tibble: 2 × 2\n#> name birthday \n#> <chr> <date> \n#> 1 Maks 2022-01-07\n#> 2 Nalani 2022-01-04\n\nTo resolve that issue we’ll need to tackle the problem a different way, with overlap joins.\n\n19.5.4 Overlap joins\nOverlap joins provide three helpers that use inequality joins to make it easier to work with intervals:\n\n\nbetween(x, y_lower, y_upper) is short for x >= y_lower, x <= y_upper.\n\nwithin(x_lower, x_upper, y_lower, y_upper) is short for x_lower >= y_lower, x_upper <= y_upper.\n\noverlaps(x_lower, x_upper, y_lower, y_upper) is short for x_lower <= y_upper, x_upper >= y_lower.\n\nLet’s continue the birthday example to see how you might use them. There’s one problem with the strategy we used above: there’s no party preceding the birthdays Jan 1-9. So it might be better to be explicit about the date ranges that each party spans, and make a special case for those early birthdays:\n\nparties <- tibble(\n q = 1:4,\n party = ymd(c(\"2022-01-10\", \"2022-04-04\", \"2022-07-11\", \"2022-10-03\")),\n start = ymd(c(\"2022-01-01\", \"2022-04-04\", \"2022-07-11\", \"2022-10-03\")),\n end = ymd(c(\"2022-04-03\", \"2022-07-11\", \"2022-10-02\", \"2022-12-31\"))\n)\nparties\n#> # A tibble: 4 × 4\n#> q party start end \n#> <int> <date> <date> <date> \n#> 1 1 2022-01-10 2022-01-01 2022-04-03\n#> 2 2 2022-04-04 2022-04-04 2022-07-11\n#> 3 3 2022-07-11 2022-07-11 2022-10-02\n#> 4 4 2022-10-03 2022-10-03 2022-12-31\n\nHadley is hopelessly bad at data entry so he also wanted to check that the party periods don’t overlap. One way to do this is by using a self-join to check if any start-end interval overlap with another:\n\nparties |> \n inner_join(parties, join_by(overlaps(start, end, start, end), q < q)) |> \n select(start.x, end.x, start.y, end.y)\n#> # A tibble: 1 × 4\n#> start.x end.x start.y end.y \n#> <date> <date> <date> <date> \n#> 1 2022-04-04 2022-07-11 2022-07-11 2022-10-02\n\nOoops, there is an overlap, so let’s fix that problem and continue:\n\nparties <- tibble(\n q = 1:4,\n party = ymd(c(\"2022-01-10\", \"2022-04-04\", \"2022-07-11\", \"2022-10-03\")),\n start = ymd(c(\"2022-01-01\", \"2022-04-04\", \"2022-07-11\", \"2022-10-03\")),\n end = ymd(c(\"2022-04-03\", \"2022-07-10\", \"2022-10-02\", \"2022-12-31\"))\n)\n\nNow we can match each employee to their party. This is a good place to use unmatched = \"error\" because we want to quickly find out if any employees didn’t get assigned a party.\n\nemployees |> \n inner_join(parties, join_by(between(birthday, start, end)), unmatched = \"error\")\n#> # A tibble: 100 × 6\n#> name birthday q party start end \n#> <chr> <date> <int> <date> <date> <date> \n#> 1 Kemba 2022-01-22 1 2022-01-10 2022-01-01 2022-04-03\n#> 2 Orean 2022-06-26 2 2022-04-04 2022-04-04 2022-07-10\n#> 3 Kirstyn 2022-02-11 1 2022-01-10 2022-01-01 2022-04-03\n#> 4 Amparo 2022-11-11 4 2022-10-03 2022-10-03 2022-12-31\n#> 5 Belen 2022-03-25 1 2022-01-10 2022-01-01 2022-04-03\n#> 6 Rayshaun 2022-01-11 1 2022-01-10 2022-01-01 2022-04-03\n#> # ℹ 94 more rows\n\n\n19.5.5 Exercises\n\n\nCan you explain what’s happening with the keys in this equi join? Why are they different?\n\nx |> full_join(y, join_by(key == key))\n#> # A tibble: 4 × 3\n#> key val_x val_y\n#> <dbl> <chr> <chr>\n#> 1 1 x1 y1 \n#> 2 2 x2 y2 \n#> 3 3 x3 <NA> \n#> 4 4 <NA> y3\n\nx |> full_join(y, join_by(key == key), keep = TRUE)\n#> # A tibble: 4 × 4\n#> key.x val_x key.y val_y\n#> <dbl> <chr> <dbl> <chr>\n#> 1 1 x1 1 y1 \n#> 2 2 x2 2 y2 \n#> 3 3 x3 NA <NA> \n#> 4 NA <NA> 4 y3\n\n\nWhen finding if any party period overlapped with another party period we used q < q in the join_by()? Why? What happens if you remove this inequality?" + }, + { + "objectID": "joins.html#summary", + "href": "joins.html#summary", + "title": "19  Joins", + "section": "\n19.6 Summary", + "text": "19.6 Summary\nIn this chapter, you’ve learned how to use mutating and filtering joins to combine data from a pair of data frames. Along the way you learned how to identify keys, and the difference between primary and foreign keys. You also understand how joins work and how to figure out how many rows the output will have. Finally, you’ve gained a glimpse into the power of non-equi joins and seen a few interesting use cases.\nThis chapter concludes the “Transform” part of the book where the focus was on the tools you could use with individual columns and tibbles. You learned about dplyr and base functions for working with logical vectors, numbers, and complete tables, stringr functions for working with strings, lubridate functions for working with date-times, and forcats functions for working with factors.\nIn the next part of the book, you’ll learn more about getting various types of data into R in a tidy form." + }, + { + "objectID": "joins.html#footnotes", + "href": "joins.html#footnotes", + "title": "19  Joins", + "section": "", + "text": "Remember that in RStudio you can also use View() to avoid this problem.↩︎\nThat’s not 100% true, but you’ll get a warning whenever it isn’t.↩︎" }, { "objectID": "import.html", "href": "import.html", "title": "Import", "section": "", - "text": "In this part of the book, you’ll learn how to import a wider range of data into R, as well as how to get it into a form useful for analysis. Sometimes this is just a matter of calling a function from the appropriate data import package. But in more complex cases it might require both tidying and transformation in order to get to the tidy rectangle that you’d prefer to work with.\n\n\n\n\nFigura 1: Data import is the beginning of the data science process; without data you can’t do data science!\n\n\n\nIn this part of the book you’ll learn how to access data stored in the following ways:\n\nIn ?sec-import-spreadsheets, you’ll learn how to import data from Excel spreadsheets and Google Sheets.\nIn ?sec-import-databases, you’ll learn about getting data out of a database and into R (and you’ll also learn a little about how to get data out of R and into a database).\nIn ?sec-arrow, you’ll learn about Arrow, a powerful tool for working with out-of-memory data, particularly when it’s stored in the parquet format.\nIn ?sec-rectangling, you’ll learn how to work with hierarchical data, including the deeply nested lists produced by data stored in the JSON format.\nIn ?sec-scraping, you’ll learn web “scraping”, the art and science of extracting data from web pages.\n\nThere are two important tidyverse packages that we don’t discuss here: haven and xml2. If you’re working with data from SPSS, Stata, and SAS files, check out the haven package, https://haven.tidyverse.org. If you’re working with XML data, check out the xml2 package, https://xml2.r-lib.org. Otherwise, you’ll need to do some research to figure which package you’ll need to use; google is your friend here 😃." + "text": "In this part of the book, you’ll learn how to import a wider range of data into R, as well as how to get it into a form useful for analysis. Sometimes this is just a matter of calling a function from the appropriate data import package. But in more complex cases it might require both tidying and transformation in order to get to the tidy rectangle that you’d prefer to work with.\n\n\n\n\nFigura 1: Data import is the beginning of the data science process; without data you can’t do data science!\n\n\n\nIn this part of the book you’ll learn how to access data stored in the following ways:\n\nIn Capítulo 20, you’ll learn how to import data from Excel spreadsheets and Google Sheets.\nIn Capítulo 21, you’ll learn about getting data out of a database and into R (and you’ll also learn a little about how to get data out of R and into a database).\nIn Capítulo 22, you’ll learn about Arrow, a powerful tool for working with out-of-memory data, particularly when it’s stored in the parquet format.\nIn Capítulo 23, you’ll learn how to work with hierarchical data, including the deeply nested lists produced by data stored in the JSON format.\nIn Capítulo 24, you’ll learn web “scraping”, the art and science of extracting data from web pages.\n\nThere are two important tidyverse packages that we don’t discuss here: haven and xml2. If you’re working with data from SPSS, Stata, and SAS files, check out the haven package, https://haven.tidyverse.org. If you’re working with XML data, check out the xml2 package, https://xml2.r-lib.org. Otherwise, you’ll need to do some research to figure which package you’ll need to use; google is your friend here 😃." + }, + { + "objectID": "spreadsheets.html#introduction", + "href": "spreadsheets.html#introduction", + "title": "20  Spreadsheets", + "section": "\n20.1 Introduction", + "text": "20.1 Introduction\nIn Capítulo 7 you learned about importing data from plain text files like .csv and .tsv. Now it’s time to learn how to get data out of a spreadsheet, either an Excel spreadsheet or a Google Sheet. This will build on much of what you’ve learned in Capítulo 7, but we will also discuss additional considerations and complexities when working with data from spreadsheets.\nIf you or your collaborators are using spreadsheets for organizing data, we strongly recommend reading the paper “Data Organization in Spreadsheets” by Karl Broman and Kara Woo: https://doi.org/10.1080/00031305.2017.1375989. The best practices presented in this paper will save you much headache when you import data from a spreadsheet into R to analyze and visualize." + }, + { + "objectID": "spreadsheets.html#excel", + "href": "spreadsheets.html#excel", + "title": "20  Spreadsheets", + "section": "\n20.2 Excel", + "text": "20.2 Excel\nMicrosoft Excel is a widely used spreadsheet software program where data are organized in worksheets inside of spreadsheet files.\n\n20.2.1 Prerequisites\nIn this section, you’ll learn how to load data from Excel spreadsheets in R with the readxl package. This package is non-core tidyverse, so you need to load it explicitly, but it is installed automatically when you install the tidyverse package. Later, we’ll also use the writexl package, which allows us to create Excel spreadsheets.\n\nlibrary(readxl)\nlibrary(tidyverse)\nlibrary(writexl)\n\n\n20.2.2 Getting started\nMost of readxl’s functions allow you to load Excel spreadsheets into R:\n\n\nread_xls() reads Excel files with xls format.\n\nread_xlsx() read Excel files with xlsx format.\n\nread_excel() can read files with both xls and xlsx format. It guesses the file type based on the input.\n\nThese functions all have similar syntax just like other functions we have previously introduced for reading other types of files, e.g., read_csv(), read_table(), etc. For the rest of the chapter we will focus on using read_excel().\n\n20.2.3 Reading Excel spreadsheets\nFigura 20.1 shows what the spreadsheet we’re going to read into R looks like in Excel. This spreadsheet can be downloaded an Excel file from https://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w/.\n\n\n\n\nFigura 20.1: Spreadsheet called students.xlsx in Excel.\n\n\n\nThe first argument to read_excel() is the path to the file to read.\n\nstudents <- read_excel(\"data/students.xlsx\")\n\nread_excel() will read the file in as a tibble.\n\nstudents\n#> # A tibble: 6 × 5\n#> `Student ID` `Full Name` favourite.food mealPlan AGE \n#> <dbl> <chr> <chr> <chr> <chr>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 \n#> 2 2 Barclay Lynn French fries Lunch only 5 \n#> 3 3 Jayendra Lyne N/A Breakfast and lunch 7 \n#> 4 4 Leon Rossini Anchovies Lunch only <NA> \n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five \n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\nWe have six students in the data and five variables on each student. However there are a few things we might want to address in this dataset:\n\n\nThe column names are all over the place. You can provide column names that follow a consistent format; we recommend snake_case using the col_names argument.\n\nread_excel(\n \"data/students.xlsx\",\n col_names = c(\"student_id\", \"full_name\", \"favourite_food\", \"meal_plan\", \"age\")\n)\n#> # A tibble: 7 × 5\n#> student_id full_name favourite_food meal_plan age \n#> <chr> <chr> <chr> <chr> <chr>\n#> 1 Student ID Full Name favourite.food mealPlan AGE \n#> 2 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 \n#> 3 2 Barclay Lynn French fries Lunch only 5 \n#> 4 3 Jayendra Lyne N/A Breakfast and lunch 7 \n#> 5 4 Leon Rossini Anchovies Lunch only <NA> \n#> 6 5 Chidiegwu Dunkel Pizza Breakfast and lunch five \n#> 7 6 Güvenç Attila Ice cream Lunch only 6\n\nUnfortunately, this didn’t quite do the trick. We now have the variable names we want, but what was previously the header row now shows up as the first observation in the data. You can explicitly skip that row using the skip argument.\n\nread_excel(\n \"data/students.xlsx\",\n col_names = c(\"student_id\", \"full_name\", \"favourite_food\", \"meal_plan\", \"age\"),\n skip = 1\n)\n#> # A tibble: 6 × 5\n#> student_id full_name favourite_food meal_plan age \n#> <dbl> <chr> <chr> <chr> <chr>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 \n#> 2 2 Barclay Lynn French fries Lunch only 5 \n#> 3 3 Jayendra Lyne N/A Breakfast and lunch 7 \n#> 4 4 Leon Rossini Anchovies Lunch only <NA> \n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five \n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\n\n\nIn the favourite_food column, one of the observations is N/A, which stands for “not available” but it’s currently not recognized as an NA (note the contrast between this N/A and the age of the fourth student in the list). You can specify which character strings should be recognized as NAs with the na argument. By default, only \"\" (empty string, or, in the case of reading from a spreadsheet, an empty cell or a cell with the formula =NA()) is recognized as an NA.\n\nread_excel(\n \"data/students.xlsx\",\n col_names = c(\"student_id\", \"full_name\", \"favourite_food\", \"meal_plan\", \"age\"),\n skip = 1,\n na = c(\"\", \"N/A\")\n)\n#> # A tibble: 6 × 5\n#> student_id full_name favourite_food meal_plan age \n#> <dbl> <chr> <chr> <chr> <chr>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 \n#> 2 2 Barclay Lynn French fries Lunch only 5 \n#> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7 \n#> 4 4 Leon Rossini Anchovies Lunch only <NA> \n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five \n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\n\n\nOne other remaining issue is that age is read in as a character variable, but it really should be numeric. Just like with read_csv() and friends for reading data from flat files, you can supply a col_types argument to read_excel() and specify the column types for the variables you read in. The syntax is a bit different, though. Your options are \"skip\", \"guess\", \"logical\", \"numeric\", \"date\", \"text\" or \"list\".\n\nread_excel(\n \"data/students.xlsx\",\n col_names = c(\"student_id\", \"full_name\", \"favourite_food\", \"meal_plan\", \"age\"),\n skip = 1,\n na = c(\"\", \"N/A\"),\n col_types = c(\"numeric\", \"text\", \"text\", \"text\", \"numeric\")\n)\n#> Warning: Expecting numeric in E6 / R6C5: got 'five'\n#> # A tibble: 6 × 5\n#> student_id full_name favourite_food meal_plan age\n#> <dbl> <chr> <chr> <chr> <dbl>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4\n#> 2 2 Barclay Lynn French fries Lunch only 5\n#> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7\n#> 4 4 Leon Rossini Anchovies Lunch only NA\n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch NA\n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\nHowever, this didn’t quite produce the desired result either. By specifying that age should be numeric, we have turned the one cell with the non-numeric entry (which had the value five) into an NA. In this case, we should read age in as \"text\" and then make the change once the data is loaded in R.\n\nstudents <- read_excel(\n \"data/students.xlsx\",\n col_names = c(\"student_id\", \"full_name\", \"favourite_food\", \"meal_plan\", \"age\"),\n skip = 1,\n na = c(\"\", \"N/A\"),\n col_types = c(\"numeric\", \"text\", \"text\", \"text\", \"text\")\n)\n\nstudents <- students |>\n mutate(\n age = if_else(age == \"five\", \"5\", age),\n age = parse_number(age)\n )\n\nstudents\n#> # A tibble: 6 × 5\n#> student_id full_name favourite_food meal_plan age\n#> <dbl> <chr> <chr> <chr> <dbl>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4\n#> 2 2 Barclay Lynn French fries Lunch only 5\n#> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7\n#> 4 4 Leon Rossini Anchovies Lunch only NA\n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5\n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\n\n\nIt took us multiple steps and trial-and-error to load the data in exactly the format we want, and this is not unexpected. Data science is an iterative process, and the process of iteration can be even more tedious when reading data in from spreadsheets compared to other plain text, rectangular data files because humans tend to input data into spreadsheets and use them not just for data storage but also for sharing and communication.\nThere is no way to know exactly what the data will look like until you load it and take a look at it. Well, there is one way, actually. You can open the file in Excel and take a peek. If you’re going to do so, we recommend making a copy of the Excel file to open and browse interactively while leaving the original data file untouched and reading into R from the untouched file. This will ensure you don’t accidentally overwrite anything in the spreadsheet while inspecting it. You should also not be afraid of doing what we did here: load the data, take a peek, make adjustments to your code, load it again, and repeat until you’re happy with the result.\n\n20.2.4 Reading worksheets\nAn important feature that distinguishes spreadsheets from flat files is the notion of multiple sheets, called worksheets. Figura 20.2 shows an Excel spreadsheet with multiple worksheets. The data come from the palmerpenguins package, and you can download this spreadsheet as an Excel file from https://docs.google.com/spreadsheets/d/1aFu8lnD_g0yjF5O-K6SFgSEWiHPpgvFCF0NY9D6LXnY/. Each worksheet contains information on penguins from a different island where data were collected.\n\n\n\n\nFigura 20.2: Spreadsheet called penguins.xlsx in Excel containing three worksheets.\n\n\n\nYou can read a single worksheet from a spreadsheet with the sheet argument in read_excel(). The default, which we’ve been relying on up until now, is the first sheet.\n\nread_excel(\"data/penguins.xlsx\", sheet = \"Torgersen Island\")\n#> # A tibble: 52 × 8\n#> species island bill_length_mm bill_depth_mm flipper_length_mm\n#> <chr> <chr> <chr> <chr> <chr> \n#> 1 Adelie Torgersen 39.1 18.7 181 \n#> 2 Adelie Torgersen 39.5 17.399999999999999 186 \n#> 3 Adelie Torgersen 40.299999999999997 18 195 \n#> 4 Adelie Torgersen NA NA NA \n#> 5 Adelie Torgersen 36.700000000000003 19.3 193 \n#> 6 Adelie Torgersen 39.299999999999997 20.6 190 \n#> # ℹ 46 more rows\n#> # ℹ 3 more variables: body_mass_g <chr>, sex <chr>, year <dbl>\n\nSome variables that appear to contain numerical data are read in as characters due to the character string \"NA\" not being recognized as a true NA.\n\npenguins_torgersen <- read_excel(\"data/penguins.xlsx\", sheet = \"Torgersen Island\", na = \"NA\")\n\npenguins_torgersen\n#> # A tibble: 52 × 8\n#> species island bill_length_mm bill_depth_mm flipper_length_mm\n#> <chr> <chr> <dbl> <dbl> <dbl>\n#> 1 Adelie Torgersen 39.1 18.7 181\n#> 2 Adelie Torgersen 39.5 17.4 186\n#> 3 Adelie Torgersen 40.3 18 195\n#> 4 Adelie Torgersen NA NA NA\n#> 5 Adelie Torgersen 36.7 19.3 193\n#> 6 Adelie Torgersen 39.3 20.6 190\n#> # ℹ 46 more rows\n#> # ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>\n\nAlternatively, you can use excel_sheets() to get information on all worksheets in an Excel spreadsheet, and then read the one(s) you’re interested in.\n\nexcel_sheets(\"data/penguins.xlsx\")\n#> [1] \"Torgersen Island\" \"Biscoe Island\" \"Dream Island\"\n\nOnce you know the names of the worksheets, you can read them in individually with read_excel().\n\npenguins_biscoe <- read_excel(\"data/penguins.xlsx\", sheet = \"Biscoe Island\", na = \"NA\")\npenguins_dream <- read_excel(\"data/penguins.xlsx\", sheet = \"Dream Island\", na = \"NA\")\n\nIn this case the full penguins dataset is spread across three worksheets in the spreadsheet. Each worksheet has the same number of columns but different numbers of rows.\n\ndim(penguins_torgersen)\n#> [1] 52 8\ndim(penguins_biscoe)\n#> [1] 168 8\ndim(penguins_dream)\n#> [1] 124 8\n\nWe can put them together with bind_rows().\n\npenguins <- bind_rows(penguins_torgersen, penguins_biscoe, penguins_dream)\npenguins\n#> # A tibble: 344 × 8\n#> species island bill_length_mm bill_depth_mm flipper_length_mm\n#> <chr> <chr> <dbl> <dbl> <dbl>\n#> 1 Adelie Torgersen 39.1 18.7 181\n#> 2 Adelie Torgersen 39.5 17.4 186\n#> 3 Adelie Torgersen 40.3 18 195\n#> 4 Adelie Torgersen NA NA NA\n#> 5 Adelie Torgersen 36.7 19.3 193\n#> 6 Adelie Torgersen 39.3 20.6 190\n#> # ℹ 338 more rows\n#> # ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>\n\nIn Capítulo 26 we’ll talk about ways of doing this sort of task without repetitive code.\n\n20.2.5 Reading part of a sheet\nSince many use Excel spreadsheets for presentation as well as for data storage, it’s quite common to find cell entries in a spreadsheet that are not part of the data you want to read into R. Figura 20.3 shows such a spreadsheet: in the middle of the sheet is what looks like a data frame but there is extraneous text in cells above and below the data.\n\n\n\n\nFigura 20.3: Spreadsheet called deaths.xlsx in Excel.\n\n\n\nThis spreadsheet is one of the example spreadsheets provided in the readxl package. You can use the readxl_example() function to locate the spreadsheet on your system in the directory where the package is installed. This function returns the path to the spreadsheet, which you can use in read_excel() as usual.\n\ndeaths_path <- readxl_example(\"deaths.xlsx\")\ndeaths <- read_excel(deaths_path)\n#> New names:\n#> • `` -> `...2`\n#> • `` -> `...3`\n#> • `` -> `...4`\n#> • `` -> `...5`\n#> • `` -> `...6`\ndeaths\n#> # A tibble: 18 × 6\n#> `Lots of people` ...2 ...3 ...4 ...5 ...6 \n#> <chr> <chr> <chr> <chr> <chr> <chr> \n#> 1 simply cannot resi… <NA> <NA> <NA> <NA> some notes \n#> 2 at the top <NA> of their spreadsh…\n#> 3 or merging <NA> <NA> <NA> cells \n#> 4 Name Profession Age Has kids Date of birth Date of death \n#> 5 David Bowie musician 69 TRUE 17175 42379 \n#> 6 Carrie Fisher actor 60 TRUE 20749 42731 \n#> # ℹ 12 more rows\n\nThe top three rows and the bottom four rows are not part of the data frame. It’s possible to eliminate these extraneous rows using the skip and n_max arguments, but we recommend using cell ranges. In Excel, the top left cell is A1. As you move across columns to the right, the cell label moves down the alphabet, i.e. B1, C1, etc. And as you move down a column, the number in the cell label increases, i.e. A2, A3, etc.\nHere the data we want to read in starts in cell A5 and ends in cell F15. In spreadsheet notation, this is A5:F15, which we supply to the range argument:\n\nread_excel(deaths_path, range = \"A5:F15\")\n#> # A tibble: 10 × 6\n#> Name Profession Age `Has kids` `Date of birth` \n#> <chr> <chr> <dbl> <lgl> <dttm> \n#> 1 David Bowie musician 69 TRUE 1947-01-08 00:00:00\n#> 2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00\n#> 3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00\n#> 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00\n#> 5 Prince musician 57 TRUE 1958-06-07 00:00:00\n#> 6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00\n#> # ℹ 4 more rows\n#> # ℹ 1 more variable: `Date of death` <dttm>\n\n\n20.2.6 Data types\nIn CSV files, all values are strings. This is not particularly true to the data, but it is simple: everything is a string.\nThe underlying data in Excel spreadsheets is more complex. A cell can be one of four things:\n\nA boolean, like TRUE, FALSE, or NA.\nA number, like “10” or “10.5”.\nA datetime, which can also include time like “11/1/21” or “11/1/21 3:00 PM”.\nA text string, like “ten”.\n\nWhen working with spreadsheet data, it’s important to keep in mind that the underlying data can be very different than what you see in the cell. For example, Excel has no notion of an integer. All numbers are stored as floating points, but you can choose to display the data with a customizable number of decimal points. Similarly, dates are actually stored as numbers, specifically the number of seconds since January 1, 1970. You can customize how you display the date by applying formatting in Excel. Confusingly, it’s also possible to have something that looks like a number but is actually a string (e.g., type '10 into a cell in Excel).\nThese differences between how the underlying data are stored vs. how they’re displayed can cause surprises when the data are loaded into R. By default readxl will guess the data type in a given column. A recommended workflow is to let readxl guess the column types, confirm that you’re happy with the guessed column types, and if not, go back and re-import specifying col_types as shown in Seção 20.2.3.\nAnother challenge is when you have a column in your Excel spreadsheet that has a mix of these types, e.g., some cells are numeric, others text, others dates. When importing the data into R readxl has to make some decisions. In these cases you can set the type for this column to \"list\", which will load the column as a list of length 1 vectors, where the type of each element of the vector is guessed.\n\n\n\n\n\n\nSometimes data is stored in more exotic ways, like the color of the cell background, or whether or not the text is bold. In such cases, you might find the tidyxl package useful. See https://nacnudus.github.io/spreadsheet-munging-strategies/ for more on strategies for working with non-tabular data from Excel.\n\n\n\n\n20.2.7 Writing to Excel\nLet’s create a small data frame that we can then write out. Note that item is a factor and quantity is an integer.\n\nbake_sale <- tibble(\n item = factor(c(\"brownie\", \"cupcake\", \"cookie\")),\n quantity = c(10, 5, 8)\n)\n\nbake_sale\n#> # A tibble: 3 × 2\n#> item quantity\n#> <fct> <dbl>\n#> 1 brownie 10\n#> 2 cupcake 5\n#> 3 cookie 8\n\nYou can write data back to disk as an Excel file using the write_xlsx() from the writexl package:\n\nwrite_xlsx(bake_sale, path = \"data/bake-sale.xlsx\")\n\nFigura 20.4 shows what the data looks like in Excel. Note that column names are included and bolded. These can be turned off by setting col_names and format_headers arguments to FALSE.\n\n\n\n\nFigura 20.4: Spreadsheet called bake_sale.xlsx in Excel.\n\n\n\nJust like reading from a CSV, information on data type is lost when we read the data back in. This makes Excel files unreliable for caching interim results as well. For alternatives, see Seção 7.5.\n\nread_excel(\"data/bake-sale.xlsx\")\n#> # A tibble: 3 × 2\n#> item quantity\n#> <chr> <dbl>\n#> 1 brownie 10\n#> 2 cupcake 5\n#> 3 cookie 8\n\n\n20.2.8 Formatted output\nThe writexl package is a light-weight solution for writing a simple Excel spreadsheet, but if you’re interested in additional features like writing to sheets within a spreadsheet and styling, you will want to use the openxlsx package. We won’t go into the details of using this package here, but we recommend reading https://ycphs.github.io/openxlsx/articles/Formatting.html for an extensive discussion on further formatting functionality for data written from R to Excel with openxlsx.\nNote that this package is not part of the tidyverse so the functions and workflows may feel unfamiliar. For example, function names are camelCase, multiple functions can’t be composed in pipelines, and arguments are in a different order than they tend to be in the tidyverse. However, this is ok. As your R learning and usage expands outside of this book you will encounter lots of different styles used in various R packages that you might use to accomplish specific goals in R. A good way of familiarizing yourself with the coding style used in a new package is to run the examples provided in function documentation to get a feel for the syntax and the output formats as well as reading any vignettes that might come with the package.\n\n20.2.9 Exercises\n\n\nIn an Excel file, create the following dataset and save it as survey.xlsx. Alternatively, you can download it as an Excel file from here.\n\n\n\n\n\nThen, read it into R, with survey_id as a character variable and n_pets as a numerical variable.\n\n#> # A tibble: 6 × 2\n#> survey_id n_pets\n#> <chr> <dbl>\n#> 1 1 0\n#> 2 2 1\n#> 3 3 NA\n#> 4 4 2\n#> 5 5 2\n#> 6 6 NA\n\n\n\nIn another Excel file, create the following dataset and save it as roster.xlsx. Alternatively, you can download it as an Excel file from here.\n\n\n\n\n\nThen, read it into R. The resulting data frame should be called roster and should look like the following.\n\n#> # A tibble: 12 × 3\n#> group subgroup id\n#> <dbl> <chr> <dbl>\n#> 1 1 A 1\n#> 2 1 A 2\n#> 3 1 A 3\n#> 4 1 B 4\n#> 5 1 B 5\n#> 6 1 B 6\n#> 7 1 B 7\n#> 8 2 A 8\n#> 9 2 A 9\n#> 10 2 B 10\n#> 11 2 B 11\n#> 12 2 B 12\n\n\n\nIn a new Excel file, create the following dataset and save it as sales.xlsx. Alternatively, you can download it as an Excel file from here.\n\n\n\n\n\na. Read sales.xlsx in and save as sales. The data frame should look like the following, with id and n as column names and with 9 rows.\n\n#> # A tibble: 9 × 2\n#> id n \n#> <chr> <chr>\n#> 1 Brand 1 n \n#> 2 1234 8 \n#> 3 8721 2 \n#> 4 1822 3 \n#> 5 Brand 2 n \n#> 6 3333 1 \n#> 7 2156 3 \n#> 8 3987 6 \n#> 9 3216 5\n\nb. Modify sales further to get it into the following tidy format with three columns (brand, id, and n) and 7 rows of data. Note that id and n are numeric, brand is a character variable.\n\n#> # A tibble: 7 × 3\n#> brand id n\n#> <chr> <dbl> <dbl>\n#> 1 Brand 1 1234 8\n#> 2 Brand 1 8721 2\n#> 3 Brand 1 1822 3\n#> 4 Brand 2 3333 1\n#> 5 Brand 2 2156 3\n#> 6 Brand 2 3987 6\n#> 7 Brand 2 3216 5\n\n\nRecreate the bake_sale data frame, write it out to an Excel file using the write.xlsx() function from the openxlsx package.\nIn Capítulo 7 you learned about the janitor::clean_names() function to turn column names into snake case. Read the students.xlsx file that we introduced earlier in this section and use this function to “clean” the column names.\nWhat happens if you try to read in a file with .xlsx extension with read_xls()?" + }, + { + "objectID": "spreadsheets.html#google-sheets", + "href": "spreadsheets.html#google-sheets", + "title": "20  Spreadsheets", + "section": "\n20.3 Google Sheets", + "text": "20.3 Google Sheets\nGoogle Sheets is another widely used spreadsheet program. It’s free and web-based. Just like with Excel, in Google Sheets data are organized in worksheets (also called sheets) inside of spreadsheet files.\n\n20.3.1 Prerequisites\nThis section will also focus on spreadsheets, but this time you’ll be loading data from a Google Sheet with the googlesheets4 package. This package is non-core tidyverse as well, you need to load it explicitly.\n\nlibrary(googlesheets4)\nlibrary(tidyverse)\n\nA quick note about the name of the package: googlesheets4 uses v4 of the Sheets API v4 to provide an R interface to Google Sheets, hence the name.\n\n20.3.2 Getting started\nThe main function of the googlesheets4 package is read_sheet(), which reads a Google Sheet from a URL or a file id. This function also goes by the name range_read().\nYou can also create a brand new sheet with gs4_create() or write to an existing sheet with sheet_write() and friends.\nIn this section we’ll work with the same datasets as the ones in the Excel section to highlight similarities and differences between workflows for reading data from Excel and Google Sheets. readxl and googlesheets4 packages are both designed to mimic the functionality of the readr package, which provides the read_csv() function you’ve seen in Capítulo 7. Therefore, many of the tasks can be accomplished with simply swapping out read_excel() for read_sheet(). However you’ll also see that Excel and Google Sheets don’t behave in exactly the same way, therefore other tasks may require further updates to the function calls.\n\n20.3.3 Reading Google Sheets\nFigura 20.5 shows what the spreadsheet we’re going to read into R looks like in Google Sheets. This is the same dataset as in Figura 20.1, except it’s stored in a Google Sheet instead of Excel.\n\n\n\n\nFigura 20.5: Google Sheet called students in a browser window.\n\n\n\nThe first argument to read_sheet() is the URL of the file to read, and it returns a tibble:https://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w. These URLs are not pleasant to work with, so you’ll often want to identify a sheet by its ID.\n\ngs4_deauth()\n\n\nstudents_sheet_id <- \"1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w\"\nstudents <- read_sheet(students_sheet_id)\n#> ✔ Reading from students.\n#> ✔ Range Sheet1.\nstudents\n#> # A tibble: 6 × 5\n#> `Student ID` `Full Name` favourite.food mealPlan AGE \n#> <dbl> <chr> <chr> <chr> <list>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only <dbl> \n#> 2 2 Barclay Lynn French fries Lunch only <dbl> \n#> 3 3 Jayendra Lyne N/A Breakfast and lunch <dbl> \n#> 4 4 Leon Rossini Anchovies Lunch only <NULL>\n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch <chr> \n#> 6 6 Güvenç Attila Ice cream Lunch only <dbl>\n\nJust like we did with read_excel(), we can supply column names, NA strings, and column types to read_sheet().\n\nstudents <- read_sheet(\n students_sheet_id,\n col_names = c(\"student_id\", \"full_name\", \"favourite_food\", \"meal_plan\", \"age\"),\n skip = 1,\n na = c(\"\", \"N/A\"),\n col_types = \"dcccc\"\n)\n#> ✔ Reading from students.\n#> ✔ Range 2:10000000.\n\nstudents\n#> # A tibble: 6 × 5\n#> student_id full_name favourite_food meal_plan age \n#> <dbl> <chr> <chr> <chr> <chr>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 \n#> 2 2 Barclay Lynn French fries Lunch only 5 \n#> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7 \n#> 4 4 Leon Rossini Anchovies Lunch only <NA> \n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five \n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\nNote that we defined column types a bit differently here, using short codes. For example, “dcccc” stands for “double, character, character, character, character”.\nIt’s also possible to read individual sheets from Google Sheets as well. Let’s read the “Torgersen Island” sheet from the penguins Google Sheet:\n\npenguins_sheet_id <- \"1aFu8lnD_g0yjF5O-K6SFgSEWiHPpgvFCF0NY9D6LXnY\"\nread_sheet(penguins_sheet_id, sheet = \"Torgersen Island\")\n#> ✔ Reading from penguins.\n#> ✔ Range ''Torgersen Island''.\n#> # A tibble: 52 × 8\n#> species island bill_length_mm bill_depth_mm flipper_length_mm\n#> <chr> <chr> <list> <list> <list> \n#> 1 Adelie Torgersen <dbl [1]> <dbl [1]> <dbl [1]> \n#> 2 Adelie Torgersen <dbl [1]> <dbl [1]> <dbl [1]> \n#> 3 Adelie Torgersen <dbl [1]> <dbl [1]> <dbl [1]> \n#> 4 Adelie Torgersen <chr [1]> <chr [1]> <chr [1]> \n#> 5 Adelie Torgersen <dbl [1]> <dbl [1]> <dbl [1]> \n#> 6 Adelie Torgersen <dbl [1]> <dbl [1]> <dbl [1]> \n#> # ℹ 46 more rows\n#> # ℹ 3 more variables: body_mass_g <list>, sex <chr>, year <dbl>\n\nYou can obtain a list of all sheets within a Google Sheet with sheet_names():\n\nsheet_names(penguins_sheet_id)\n#> [1] \"Torgersen Island\" \"Biscoe Island\" \"Dream Island\"\n\nFinally, just like with read_excel(), we can read in a portion of a Google Sheet by defining a range in read_sheet(). Note that we’re also using the gs4_example() function below to locate an example Google Sheet that comes with the googlesheets4 package.\n\ndeaths_url <- gs4_example(\"deaths\")\ndeaths <- read_sheet(deaths_url, range = \"A5:F15\")\n#> ✔ Reading from deaths.\n#> ✔ Range A5:F15.\ndeaths\n#> # A tibble: 10 × 6\n#> Name Profession Age `Has kids` `Date of birth` \n#> <chr> <chr> <dbl> <lgl> <dttm> \n#> 1 David Bowie musician 69 TRUE 1947-01-08 00:00:00\n#> 2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00\n#> 3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00\n#> 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00\n#> 5 Prince musician 57 TRUE 1958-06-07 00:00:00\n#> 6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00\n#> # ℹ 4 more rows\n#> # ℹ 1 more variable: `Date of death` <dttm>\n\n\n20.3.4 Writing to Google Sheets\nYou can write from R to Google Sheets with write_sheet(). The first argument is the data frame to write, and the second argument is the name (or other identifier) of the Google Sheet to write to:\n\nwrite_sheet(bake_sale, ss = \"bake-sale\")\n\nIf you’d like to write your data to a specific (work)sheet inside a Google Sheet, you can specify that with the sheet argument as well.\n\nwrite_sheet(bake_sale, ss = \"bake-sale\", sheet = \"Sales\")\n\n\n20.3.5 Authentication\nWhile you can read from a public Google Sheet without authenticating with your Google account and with gs4_deauth(), reading a private sheet or writing to a sheet requires authentication so that googlesheets4 can view and manage your Google Sheets.\nWhen you attempt to read in a sheet that requires authentication, googlesheets4 will direct you to a web browser with a prompt to sign in to your Google account and grant permission to operate on your behalf with Google Sheets. However, if you want to specify a specific Google account, authentication scope, etc. you can do so with gs4_auth(), e.g., gs4_auth(email = \"mine@example.com\"), which will force the use of a token associated with a specific email. For further authentication details, we recommend reading the documentation googlesheets4 auth vignette: https://googlesheets4.tidyverse.org/articles/auth.html.\n\n20.3.6 Exercises\n\nRead the students dataset from earlier in the chapter from Excel and also from Google Sheets, with no additional arguments supplied to the read_excel() and read_sheet() functions. Are the resulting data frames in R exactly the same? If not, how are they different?\nRead the Google Sheet titled survey from https://pos.it/r4ds-survey, with survey_id as a character variable and n_pets as a numerical variable.\n\nRead the Google Sheet titled roster from https://pos.it/r4ds-roster. The resulting data frame should be called roster and should look like the following.\n\n#> # A tibble: 12 × 3\n#> group subgroup id\n#> <dbl> <chr> <dbl>\n#> 1 1 A 1\n#> 2 1 A 2\n#> 3 1 A 3\n#> 4 1 B 4\n#> 5 1 B 5\n#> 6 1 B 6\n#> 7 1 B 7\n#> 8 2 A 8\n#> 9 2 A 9\n#> 10 2 B 10\n#> 11 2 B 11\n#> 12 2 B 12" + }, + { + "objectID": "spreadsheets.html#summary", + "href": "spreadsheets.html#summary", + "title": "20  Spreadsheets", + "section": "\n20.4 Summary", + "text": "20.4 Summary\nMicrosoft Excel and Google Sheets are two of the most popular spreadsheet systems. Being able to interact with data stored in Excel and Google Sheets files directly from R is a superpower! In this chapter you learned how to read data into R from spreadsheets from Excel with read_excel() from the readxl package and from Google Sheets with read_sheet() from the googlesheets4 package. These functions work very similarly to each other and have similar arguments for specifying column names, NA strings, rows to skip on top of the file you’re reading in, etc. Additionally, both functions make it possible to read a single sheet from a spreadsheet as well.\nOn the other hand, writing to an Excel file requires a different package and function (writexl::write_xlsx()) while you can write to a Google Sheet with the googlesheets4 package, with write_sheet().\nIn the next chapter, you’ll learn about a different data source and how to read data from that source into R: databases." + }, + { + "objectID": "databases.html#introduction", + "href": "databases.html#introduction", + "title": "21  Databases", + "section": "\n21.1 Introduction", + "text": "21.1 Introduction\nA huge amount of data lives in databases, so it’s essential that you know how to access it. Sometimes you can ask someone to download a snapshot into a .csv for you, but this gets painful quickly: every time you need to make a change you’ll have to communicate with another human. You want to be able to reach into the database directly to get the data you need, when you need it.\nIn this chapter, you’ll first learn the basics of the DBI package: how to use it to connect to a database and then retrieve data with a SQL1 query. SQL, short for structured query language, is the lingua franca of databases, and is an important language for all data scientists to learn. That said, we’re not going to start with SQL, but instead we’ll teach you dbplyr, which can translate your dplyr code to the SQL. We’ll use that as a way to teach you some of the most important features of SQL. You won’t become a SQL master by the end of the chapter, but you will be able to identify the most important components and understand what they do.\n\n21.1.1 Prerequisites\nIn this chapter, we’ll introduce DBI and dbplyr. DBI is a low-level interface that connects to databases and executes SQL; dbplyr is a high-level interface that translates your dplyr code to SQL queries then executes them with DBI.\n\nlibrary(DBI)\nlibrary(dbplyr)\nlibrary(tidyverse)" + }, + { + "objectID": "databases.html#database-basics", + "href": "databases.html#database-basics", + "title": "21  Databases", + "section": "\n21.2 Database basics", + "text": "21.2 Database basics\nAt the simplest level, you can think about a database as a collection of data frames, called tables in database terminology. Like a data frame, a database table is a collection of named columns, where every value in the column is the same type. There are three high level differences between data frames and database tables:\n\nDatabase tables are stored on disk and can be arbitrarily large. Data frames are stored in memory, and are fundamentally limited (although that limit is still plenty large for many problems).\nDatabase tables almost always have indexes. Much like the index of a book, a database index makes it possible to quickly find rows of interest without having to look at every single row. Data frames and tibbles don’t have indexes, but data.tables do, which is one of the reasons that they’re so fast.\nMost classical databases are optimized for rapidly collecting data, not analyzing existing data. These databases are called row-oriented because the data is stored row-by-row, rather than column-by-column like R. More recently, there’s been much development of column-oriented databases that make analyzing the existing data much faster.\n\nDatabases are run by database management systems (DBMS’s for short), which come in three basic forms:\n\n\nClient-server DBMS’s run on a powerful central server, which you connect from your computer (the client). They are great for sharing data with multiple people in an organization. Popular client-server DBMS’s include PostgreSQL, MariaDB, SQL Server, and Oracle.\n\nCloud DBMS’s, like Snowflake, Amazon’s RedShift, and Google’s BigQuery, are similar to client server DBMS’s, but they run in the cloud. This means that they can easily handle extremely large datasets and can automatically provide more compute resources as needed.\n\nIn-process DBMS’s, like SQLite or duckdb, run entirely on your computer. They’re great for working with large datasets where you’re the primary user." + }, + { + "objectID": "databases.html#connecting-to-a-database", + "href": "databases.html#connecting-to-a-database", + "title": "21  Databases", + "section": "\n21.3 Connecting to a database", + "text": "21.3 Connecting to a database\nTo connect to the database from R, you’ll use a pair of packages:\n\nYou’ll always use DBI (database interface) because it provides a set of generic functions that connect to the database, upload data, run SQL queries, etc.\nYou’ll also use a package tailored for the DBMS you’re connecting to. This package translates the generic DBI commands into the specifics needed for a given DBMS. There’s usually one package for each DBMS, e.g. RPostgres for PostgreSQL and RMariaDB for MySQL.\n\nIf you can’t find a specific package for your DBMS, you can usually use the odbc package instead. This uses the ODBC protocol supported by many DBMS. odbc requires a little more setup because you’ll also need to install an ODBC driver and tell the odbc package where to find it.\nConcretely, you create a database connection using DBI::dbConnect(). The first argument selects the DBMS2, then the second and subsequent arguments describe how to connect to it (i.e. where it lives and the credentials that you need to access it). The following code shows a couple of typical examples:\n\ncon <- DBI::dbConnect(\n RMariaDB::MariaDB(), \n username = \"foo\"\n)\ncon <- DBI::dbConnect(\n RPostgres::Postgres(), \n hostname = \"databases.mycompany.com\", \n port = 1234\n)\n\nThe precise details of the connection vary a lot from DBMS to DBMS so unfortunately we can’t cover all the details here. This means you’ll need to do a little research on your own. Typically you can ask the other data scientists in your team or talk to your DBA (database administrator). The initial setup will often take a little fiddling (and maybe some googling) to get it right, but you’ll generally only need to do it once.\n\n21.3.1 In this book\nSetting up a client-server or cloud DBMS would be a pain for this book, so we’ll instead use an in-process DBMS that lives entirely in an R package: duckdb. Thanks to the magic of DBI, the only difference between using duckdb and any other DBMS is how you’ll connect to the database. This makes it great to teach with because you can easily run this code as well as easily take what you learn and apply it elsewhere.\nConnecting to duckdb is particularly simple because the defaults create a temporary database that is deleted when you quit R. That’s great for learning because it guarantees that you’ll start from a clean slate every time you restart R:\n\ncon <- DBI::dbConnect(duckdb::duckdb())\n\nduckdb is a high-performance database that’s designed very much for the needs of a data scientist. We use it here because it’s very easy to get started with, but it’s also capable of handling gigabytes of data with great speed. If you want to use duckdb for a real data analysis project, you’ll also need to supply the dbdir argument to make a persistent database and tell duckdb where to save it. Assuming you’re using a project (Capítulo 6), it’s reasonable to store it in the duckdb directory of the current project:\n\ncon <- DBI::dbConnect(duckdb::duckdb(), dbdir = \"duckdb\")\n\n\n21.3.2 Load some data\nSince this is a new database, we need to start by adding some data. Here we’ll add mpg and diamonds datasets from ggplot2 using DBI::dbWriteTable(). The simplest usage of dbWriteTable() needs three arguments: a database connection, the name of the table to create in the database, and a data frame of data.\n\ndbWriteTable(con, \"mpg\", ggplot2::mpg)\ndbWriteTable(con, \"diamonds\", ggplot2::diamonds)\n\nIf you’re using duckdb in a real project, we highly recommend learning about duckdb_read_csv() and duckdb_register_arrow(). These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it into R. We’ll also show off a useful technique for loading multiple files into a database in Seção 26.4.1.\n\n21.3.3 DBI basics\nYou can check that the data is loaded correctly by using a couple of other DBI functions: dbListTables() lists all tables in the database3 and dbReadTable() retrieves the contents of a table.\n\ndbListTables(con)\n#> [1] \"diamonds\" \"mpg\"\n\ncon |> \n dbReadTable(\"diamonds\") |> \n as_tibble()\n#> # A tibble: 53,940 × 10\n#> carat cut color clarity depth table price x y z\n#> <dbl> <fct> <fct> <fct> <dbl> <dbl> <int> <dbl> <dbl> <dbl>\n#> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43\n#> 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31\n#> 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31\n#> 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63\n#> 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75\n#> 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48\n#> # ℹ 53,934 more rows\n\ndbReadTable() returns a data.frame so we use as_tibble() to convert it into a tibble so that it prints nicely.\nIf you already know SQL, you can use dbGetQuery() to get the results of running a query on the database:\n\nsql <- \"\n SELECT carat, cut, clarity, color, price \n FROM diamonds \n WHERE price > 15000\n\"\nas_tibble(dbGetQuery(con, sql))\n#> # A tibble: 1,655 × 5\n#> carat cut clarity color price\n#> <dbl> <fct> <fct> <fct> <int>\n#> 1 1.54 Premium VS2 E 15002\n#> 2 1.19 Ideal VVS1 F 15005\n#> 3 2.1 Premium SI1 I 15007\n#> 4 1.69 Ideal SI1 D 15011\n#> 5 1.5 Very Good VVS2 G 15013\n#> 6 1.73 Very Good VS1 G 15014\n#> # ℹ 1,649 more rows\n\nIf you’ve never seen SQL before, don’t worry! You’ll learn more about it shortly. But if you read it carefully, you might guess that it selects five columns of the diamonds dataset and all the rows where price is greater than 15,000." + }, + { + "objectID": "databases.html#dbplyr-basics", + "href": "databases.html#dbplyr-basics", + "title": "21  Databases", + "section": "\n21.4 dbplyr basics", + "text": "21.4 dbplyr basics\nNow that we’ve connected to a database and loaded up some data, we can start to learn about dbplyr. dbplyr is a dplyr backend, which means that you keep writing dplyr code but the backend executes it differently. In this, dbplyr translates to SQL; other backends include dtplyr which translates to data.table, and multidplyr which executes your code on multiple cores.\nTo use dbplyr, you must first use tbl() to create an object that represents a database table:\n\ndiamonds_db <- tbl(con, \"diamonds\")\ndiamonds_db\n#> # Source: table<diamonds> [?? x 10]\n#> # Database: DuckDB v0.9.1 [unknown@Linux 6.2.0-1015-azure:R 4.3.2/:memory:]\n#> carat cut color clarity depth table price x y z\n#> <dbl> <fct> <fct> <fct> <dbl> <dbl> <int> <dbl> <dbl> <dbl>\n#> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43\n#> 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31\n#> 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31\n#> 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63\n#> 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75\n#> 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48\n#> # ℹ more rows\n\n\n\n\n\n\n\nThere are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organized. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:\n\ndiamonds_db <- tbl(con, in_schema(\"sales\", \"diamonds\"))\ndiamonds_db <- tbl(con, in_catalog(\"north_america\", \"sales\", \"diamonds\"))\n\nOther times you might want to use your own SQL query as a starting point:\n\ndiamonds_db <- tbl(con, sql(\"SELECT * FROM diamonds\"))\n\n\n\n\nThis object is lazy; when you use dplyr verbs on it, dplyr doesn’t do any work: it just records the sequence of operations that you want to perform and only performs them when needed. For example, take the following pipeline:\n\nbig_diamonds_db <- diamonds_db |> \n filter(price > 15000) |> \n select(carat:clarity, price)\n\nbig_diamonds_db\n#> # Source: SQL [?? x 5]\n#> # Database: DuckDB v0.9.1 [unknown@Linux 6.2.0-1015-azure:R 4.3.2/:memory:]\n#> carat cut color clarity price\n#> <dbl> <fct> <fct> <fct> <int>\n#> 1 1.54 Premium E VS2 15002\n#> 2 1.19 Ideal F VVS1 15005\n#> 3 2.1 Premium I SI1 15007\n#> 4 1.69 Ideal D SI1 15011\n#> 5 1.5 Very Good G VVS2 15013\n#> 6 1.73 Very Good G VS1 15014\n#> # ℹ more rows\n\nYou can tell this object represents a database query because it prints the DBMS name at the top, and while it tells you the number of columns, it typically doesn’t know the number of rows. This is because finding the total number of rows usually requires executing the complete query, something we’re trying to avoid.\nYou can see the SQL code generated by the dplyr function show_query(). If you know dplyr, this is a great way to learn SQL! Write some dplyr code, get dbplyr to translate it to SQL, and then try to figure out how the two languages match up.\n\nbig_diamonds_db |>\n show_query()\n#> <SQL>\n#> SELECT carat, cut, color, clarity, price\n#> FROM diamonds\n#> WHERE (price > 15000.0)\n\nTo get all the data back into R, you call collect(). Behind the scenes, this generates the SQL, calls dbGetQuery() to get the data, then turns the result into a tibble:\n\nbig_diamonds <- big_diamonds_db |> \n collect()\nbig_diamonds\n#> # A tibble: 1,655 × 5\n#> carat cut color clarity price\n#> <dbl> <fct> <fct> <fct> <int>\n#> 1 1.54 Premium E VS2 15002\n#> 2 1.19 Ideal F VVS1 15005\n#> 3 2.1 Premium I SI1 15007\n#> 4 1.69 Ideal D SI1 15011\n#> 5 1.5 Very Good G VVS2 15013\n#> 6 1.73 Very Good G VS1 15014\n#> # ℹ 1,649 more rows\n\nTypically, you’ll use dbplyr to select the data you want from the database, performing basic filtering and aggregation using the translations described below. Then, once you’re ready to analyse the data with functions that are unique to R, you’ll collect() the data to get an in-memory tibble, and continue your work with pure R code." + }, + { + "objectID": "databases.html#sql", + "href": "databases.html#sql", + "title": "21  Databases", + "section": "\n21.5 SQL", + "text": "21.5 SQL\nThe rest of the chapter will teach you a little SQL through the lens of dbplyr. It’s a rather non-traditional introduction to SQL but we hope it will get you quickly up to speed with the basics. Luckily, if you understand dplyr you’re in a great place to quickly pick up SQL because so many of the concepts are the same.\nWe’ll explore the relationship between dplyr and SQL using a couple of old friends from the nycflights13 package: flights and planes. These datasets are easy to get into our learning database because dbplyr comes with a function that copies the tables from nycflights13 to our database:\n\ndbplyr::copy_nycflights13(con)\n#> Creating table: airlines\n#> Creating table: airports\n#> Creating table: flights\n#> Creating table: planes\n#> Creating table: weather\nflights <- tbl(con, \"flights\")\nplanes <- tbl(con, \"planes\")\n\n\n21.5.1 SQL basics\nThe top-level components of SQL are called statements. Common statements include CREATE for defining new tables, INSERT for adding data, and SELECT for retrieving data. We will focus on SELECT statements, also called queries, because they are almost exclusively what you’ll use as a data scientist.\nA query is made up of clauses. There are five important clauses: SELECT, FROM, WHERE, ORDER BY, and GROUP BY. Every query must have the SELECT4 and FROM5 clauses and the simplest query is SELECT * FROM table, which selects all columns from the specified table . This is what dbplyr generates for an unadulterated table :\n\nflights |> show_query()\n#> <SQL>\n#> SELECT *\n#> FROM flights\nplanes |> show_query()\n#> <SQL>\n#> SELECT *\n#> FROM planes\n\nWHERE and ORDER BY control which rows are included and how they are ordered:\n\nflights |> \n filter(dest == \"IAH\") |> \n arrange(dep_delay) |>\n show_query()\n#> <SQL>\n#> SELECT flights.*\n#> FROM flights\n#> WHERE (dest = 'IAH')\n#> ORDER BY dep_delay\n\nGROUP BY converts the query to a summary, causing aggregation to happen:\n\nflights |> \n group_by(dest) |> \n summarize(dep_delay = mean(dep_delay, na.rm = TRUE)) |> \n show_query()\n#> <SQL>\n#> SELECT dest, AVG(dep_delay) AS dep_delay\n#> FROM flights\n#> GROUP BY dest\n\nThere are two important differences between dplyr verbs and SELECT clauses:\n\nIn SQL, case doesn’t matter: you can write select, SELECT, or even SeLeCt. In this book we’ll stick with the common convention of writing SQL keywords in uppercase to distinguish them from table or variables names.\nIn SQL, order matters: you must always write the clauses in the order SELECT, FROM, WHERE, GROUP BY, ORDER BY. Confusingly, this order doesn’t match how the clauses actually evaluated which is first FROM, then WHERE, GROUP BY, SELECT, and ORDER BY.\n\nThe following sections explore each clause in more detail.\n\n\n\n\n\n\nNote that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that we’ll focus on in this book are very similar between DBMS’s, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. It’s not perfect, but it’s continually improving, and if you hit a problem you can file an issue on GitHub to help us do better.\n\n\n\n\n21.5.2 SELECT\nThe SELECT clause is the workhorse of queries and performs the same job as select(), mutate(), rename(), relocate(), and, as you’ll learn in the next section, summarize().\nselect(), rename(), and relocate() have very direct translations to SELECT as they just affect where a column appears (if at all) along with its name:\n\nplanes |> \n select(tailnum, type, manufacturer, model, year) |> \n show_query()\n#> <SQL>\n#> SELECT tailnum, \"type\", manufacturer, model, \"year\"\n#> FROM planes\n\nplanes |> \n select(tailnum, type, manufacturer, model, year) |> \n rename(year_built = year) |> \n show_query()\n#> <SQL>\n#> SELECT tailnum, \"type\", manufacturer, model, \"year\" AS year_built\n#> FROM planes\n\nplanes |> \n select(tailnum, type, manufacturer, model, year) |> \n relocate(manufacturer, model, .before = type) |> \n show_query()\n#> <SQL>\n#> SELECT tailnum, manufacturer, model, \"type\", \"year\"\n#> FROM planes\n\nThis example also shows you how SQL does renaming. In SQL terminology renaming is called aliasing and is done with AS. Note that unlike mutate(), the old name is on the left and the new name is on the right.\n\n\n\n\n\n\nIn the examples above note that \"year\" and \"type\" are wrapped in double quotes. That’s because these are reserved words in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.\nWhen working with other databases you’re likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.\nSELECT \"tailnum\", \"type\", \"manufacturer\", \"model\", \"year\"\nFROM \"planes\"\nSome other database systems use backticks instead of quotes:\nSELECT `tailnum`, `type`, `manufacturer`, `model`, `year`\nFROM `planes`\n\n\n\nThe translations for mutate() are similarly straightforward: each variable becomes a new expression in SELECT:\n\nflights |> \n mutate(\n speed = distance / (air_time / 60)\n ) |> \n show_query()\n#> <SQL>\n#> SELECT flights.*, distance / (air_time / 60.0) AS speed\n#> FROM flights\n\nWe’ll come back to the translation of individual components (like /) in Seção 21.6.\n\n21.5.3 FROM\nThe FROM clause defines the data source. It’s going to be rather uninteresting for a little while, because we’re just using single tables. You’ll see more complex examples once we hit the join functions.\n\n21.5.4 GROUP BY\ngroup_by() is translated to the GROUP BY6 clause and summarize() is translated to the SELECT clause:\n\ndiamonds_db |> \n group_by(cut) |> \n summarize(\n n = n(),\n avg_price = mean(price, na.rm = TRUE)\n ) |> \n show_query()\n#> <SQL>\n#> SELECT cut, COUNT(*) AS n, AVG(price) AS avg_price\n#> FROM diamonds\n#> GROUP BY cut\n\nWe’ll come back to what’s happening with translation n() and mean() in Seção 21.6.\n\n21.5.5 WHERE\nfilter() is translated to the WHERE clause:\n\nflights |> \n filter(dest == \"IAH\" | dest == \"HOU\") |> \n show_query()\n#> <SQL>\n#> SELECT flights.*\n#> FROM flights\n#> WHERE (dest = 'IAH' OR dest = 'HOU')\n\nflights |> \n filter(arr_delay > 0 & arr_delay < 20) |> \n show_query()\n#> <SQL>\n#> SELECT flights.*\n#> FROM flights\n#> WHERE (arr_delay > 0.0 AND arr_delay < 20.0)\n\nThere are a few important details to note here:\n\n\n| becomes OR and & becomes AND.\nSQL uses = for comparison, not ==. SQL doesn’t have assignment, so there’s no potential for confusion there.\nSQL uses only '' for strings, not \"\". In SQL, \"\" is used to identify variables, like R’s ``.\n\nAnother useful SQL operator is IN, which is very close to R’s %in%:\n\nflights |> \n filter(dest %in% c(\"IAH\", \"HOU\")) |> \n show_query()\n#> <SQL>\n#> SELECT flights.*\n#> FROM flights\n#> WHERE (dest IN ('IAH', 'HOU'))\n\nSQL uses NULL instead of NA. NULLs behave similarly to NAs. The main difference is that while they’re “infectious” in comparisons and arithmetic, they are silently dropped when summarizing. dbplyr will remind you about this behavior the first time you hit it:\n\nflights |> \n group_by(dest) |> \n summarize(delay = mean(arr_delay))\n#> Warning: Missing values are always removed in SQL aggregation functions.\n#> Use `na.rm = TRUE` to silence this warning\n#> This warning is displayed once every 8 hours.\n#> # Source: SQL [?? x 2]\n#> # Database: DuckDB v0.9.1 [unknown@Linux 6.2.0-1015-azure:R 4.3.2/:memory:]\n#> dest delay\n#> <chr> <dbl>\n#> 1 SFO 2.67\n#> 2 SJU 2.52\n#> 3 SNA -7.87\n#> 4 SRQ 3.08\n#> 5 CHS 10.6 \n#> 6 SAN 3.14\n#> # ℹ more rows\n\nIf you want to learn more about how NULLs work, you might enjoy “Three valued logic” by Markus Winand.\nIn general, you can work with NULLs using the functions you’d use for NAs in R:\n\nflights |> \n filter(!is.na(dep_delay)) |> \n show_query()\n#> <SQL>\n#> SELECT flights.*\n#> FROM flights\n#> WHERE (NOT((dep_delay IS NULL)))\n\nThis SQL query illustrates one of the drawbacks of dbplyr: while the SQL is correct, it isn’t as simple as you might write by hand. In this case, you could drop the parentheses and use a special operator that’s easier to read:\nWHERE \"dep_delay\" IS NOT NULL\nNote that if you filter() a variable that you created using a summarize, dbplyr will generate a HAVING clause, rather than a WHERE clause. This is a one of the idiosyncrasies of SQL: WHERE is evaluated before SELECT and GROUP BY, so SQL needs another clause that’s evaluated afterwards.\n\ndiamonds_db |> \n group_by(cut) |> \n summarize(n = n()) |> \n filter(n > 100) |> \n show_query()\n#> <SQL>\n#> SELECT cut, COUNT(*) AS n\n#> FROM diamonds\n#> GROUP BY cut\n#> HAVING (COUNT(*) > 100.0)\n\n\n21.5.6 ORDER BY\nOrdering rows involves a straightforward translation from arrange() to the ORDER BY clause:\n\nflights |> \n arrange(year, month, day, desc(dep_delay)) |> \n show_query()\n#> <SQL>\n#> SELECT flights.*\n#> FROM flights\n#> ORDER BY \"year\", \"month\", \"day\", dep_delay DESC\n\nNotice how desc() is translated to DESC: this is one of the many dplyr functions whose name was directly inspired by SQL.\n\n21.5.7 Subqueries\nSometimes it’s not possible to translate a dplyr pipeline into a single SELECT statement and you need to use a subquery. A subquery is just a query used as a data source in the FROM clause, instead of the usual table.\ndbplyr typically uses subqueries to work around limitations of SQL. For example, expressions in the SELECT clause can’t refer to columns that were just created. That means that the following (silly) dplyr pipeline needs to happen in two steps: the first (inner) query computes year1 and then the second (outer) query can compute year2.\n\nflights |> \n mutate(\n year1 = year + 1,\n year2 = year1 + 1\n ) |> \n show_query()\n#> <SQL>\n#> SELECT q01.*, year1 + 1.0 AS year2\n#> FROM (\n#> SELECT flights.*, \"year\" + 1.0 AS year1\n#> FROM flights\n#> ) q01\n\nYou’ll also see this if you attempted to filter() a variable that you just created. Remember, even though WHERE is written after SELECT, it’s evaluated before it, so we need a subquery in this (silly) example:\n\nflights |> \n mutate(year1 = year + 1) |> \n filter(year1 == 2014) |> \n show_query()\n#> <SQL>\n#> SELECT q01.*\n#> FROM (\n#> SELECT flights.*, \"year\" + 1.0 AS year1\n#> FROM flights\n#> ) q01\n#> WHERE (year1 = 2014.0)\n\nSometimes dbplyr will create a subquery where it’s not needed because it doesn’t yet know how to optimize that translation. As dbplyr improves over time, these cases will get rarer but will probably never go away.\n\n21.5.8 Joins\nIf you’re familiar with dplyr’s joins, SQL joins are very similar. Here’s a simple example:\n\nflights |> \n left_join(planes |> rename(year_built = year), by = \"tailnum\") |> \n show_query()\n#> <SQL>\n#> SELECT\n#> flights.*,\n#> planes.\"year\" AS year_built,\n#> \"type\",\n#> manufacturer,\n#> model,\n#> engines,\n#> seats,\n#> speed,\n#> engine\n#> FROM flights\n#> LEFT JOIN planes\n#> ON (flights.tailnum = planes.tailnum)\n\nThe main thing to notice here is the syntax: SQL joins use sub-clauses of the FROM clause to bring in additional tables, using ON to define how the tables are related.\ndplyr’s names for these functions are so closely connected to SQL that you can easily guess the equivalent SQL for inner_join(), right_join(), and full_join():\nSELECT flights.*, \"type\", manufacturer, model, engines, seats, speed\nFROM flights\nINNER JOIN planes ON (flights.tailnum = planes.tailnum)\n\nSELECT flights.*, \"type\", manufacturer, model, engines, seats, speed\nFROM flights\nRIGHT JOIN planes ON (flights.tailnum = planes.tailnum)\n\nSELECT flights.*, \"type\", manufacturer, model, engines, seats, speed\nFROM flights\nFULL JOIN planes ON (flights.tailnum = planes.tailnum)\nYou’re likely to need many joins when working with data from a database. That’s because database tables are often stored in a highly normalized form, where each “fact” is stored in a single place and to keep a complete dataset for analysis you need to navigate a complex network of tables connected by primary and foreign keys. If you hit this scenario, the dm package, by Tobias Schieferdecker, Kirill Müller, and Darko Bergant, is a life saver. It can automatically determine the connections between tables using the constraints that DBAs often supply, visualize the connections so you can see what’s going on, and generate the joins you need to connect one table to another.\n\n21.5.9 Other verbs\ndbplyr also translates other verbs like distinct(), slice_*(), and intersect(), and a growing selection of tidyr functions like pivot_longer() and pivot_wider(). The easiest way to see the full set of what’s currently available is to visit the dbplyr website: https://dbplyr.tidyverse.org/reference/.\n\n21.5.10 Exercises\n\nWhat is distinct() translated to? How about head()?\n\nExplain what each of the following SQL queries do and try recreate them using dbplyr.\nSELECT * \nFROM flights\nWHERE dep_delay < arr_delay\n\nSELECT *, distance / (air_time / 60) AS speed\nFROM flights" + }, + { + "objectID": "databases.html#sec-sql-expressions", + "href": "databases.html#sec-sql-expressions", + "title": "21  Databases", + "section": "\n21.6 Function translations", + "text": "21.6 Function translations\nSo far we’ve focused on the big picture of how dplyr verbs are translated to the clauses of a query. Now we’re going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g., what happens when you use mean(x) in a summarize()?\nTo help see what’s going on, we’ll use a couple of little helper functions that run a summarize() or mutate() and show the generated SQL. That will make it a little easier to explore a few variations and see how summaries and transformations can differ.\n\nsummarize_query <- function(df, ...) {\n df |> \n summarize(...) |> \n show_query()\n}\nmutate_query <- function(df, ...) {\n df |> \n mutate(..., .keep = \"none\") |> \n show_query()\n}\n\nLet’s dive in with some summaries! Looking at the code below you’ll notice that some summary functions, like mean(), have a relatively simple translation while others, like median(), are much more complex. The complexity is typically higher for operations that are common in statistics but less common in databases.\n\nflights |> \n group_by(year, month, day) |> \n summarize_query(\n mean = mean(arr_delay, na.rm = TRUE),\n median = median(arr_delay, na.rm = TRUE)\n )\n#> `summarise()` has grouped output by \"year\" and \"month\". You can override\n#> using the `.groups` argument.\n#> <SQL>\n#> SELECT\n#> \"year\",\n#> \"month\",\n#> \"day\",\n#> AVG(arr_delay) AS mean,\n#> PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY arr_delay) AS median\n#> FROM flights\n#> GROUP BY \"year\", \"month\", \"day\"\n\nThe translation of summary functions becomes more complicated when you use them inside a mutate() because they have to turn into so-called window functions. In SQL, you turn an ordinary aggregation function into a window function by adding OVER after it:\n\nflights |> \n group_by(year, month, day) |> \n mutate_query(\n mean = mean(arr_delay, na.rm = TRUE),\n )\n#> <SQL>\n#> SELECT\n#> \"year\",\n#> \"month\",\n#> \"day\",\n#> AVG(arr_delay) OVER (PARTITION BY \"year\", \"month\", \"day\") AS mean\n#> FROM flights\n\nIn SQL, the GROUP BY clause is used exclusively for summaries so here you can see that the grouping has moved from the PARTITION BY argument to OVER.\nWindow functions include all functions that look forward or backwards, like lead() and lag() which look at the “previous” or “next” value respectively:\n\nflights |> \n group_by(dest) |> \n arrange(time_hour) |> \n mutate_query(\n lead = lead(arr_delay),\n lag = lag(arr_delay)\n )\n#> <SQL>\n#> SELECT\n#> dest,\n#> LEAD(arr_delay, 1, NULL) OVER (PARTITION BY dest ORDER BY time_hour) AS lead,\n#> LAG(arr_delay, 1, NULL) OVER (PARTITION BY dest ORDER BY time_hour) AS lag\n#> FROM flights\n#> ORDER BY time_hour\n\nHere it’s important to arrange() the data, because SQL tables have no intrinsic order. In fact, if you don’t use arrange() you might get the rows back in a different order every time! Notice for window functions, the ordering information is repeated: the ORDER BY clause of the main query doesn’t automatically apply to window functions.\nAnother important SQL function is CASE WHEN. It’s used as the translation of if_else() and case_when(), the dplyr function that it directly inspired. Here are a couple of simple examples:\n\nflights |> \n mutate_query(\n description = if_else(arr_delay > 0, \"delayed\", \"on-time\")\n )\n#> <SQL>\n#> SELECT CASE WHEN (arr_delay > 0.0) THEN 'delayed' WHEN NOT (arr_delay > 0.0) THEN 'on-time' END AS description\n#> FROM flights\nflights |> \n mutate_query(\n description = \n case_when(\n arr_delay < -5 ~ \"early\", \n arr_delay < 5 ~ \"on-time\",\n arr_delay >= 5 ~ \"late\"\n )\n )\n#> <SQL>\n#> SELECT CASE\n#> WHEN (arr_delay < -5.0) THEN 'early'\n#> WHEN (arr_delay < 5.0) THEN 'on-time'\n#> WHEN (arr_delay >= 5.0) THEN 'late'\n#> END AS description\n#> FROM flights\n\nCASE WHEN is also used for some other functions that don’t have a direct translation from R to SQL. A good example of this is cut():\n\nflights |> \n mutate_query(\n description = cut(\n arr_delay, \n breaks = c(-Inf, -5, 5, Inf), \n labels = c(\"early\", \"on-time\", \"late\")\n )\n )\n#> <SQL>\n#> SELECT CASE\n#> WHEN (arr_delay <= -5.0) THEN 'early'\n#> WHEN (arr_delay <= 5.0) THEN 'on-time'\n#> WHEN (arr_delay > 5.0) THEN 'late'\n#> END AS description\n#> FROM flights\n\ndbplyr also translates common string and date-time manipulation functions, which you can learn about in vignette(\"translation-function\", package = \"dbplyr\"). dbplyr’s translations are certainly not perfect, and there are many R functions that aren’t translated yet, but dbplyr does a surprisingly good job covering the functions that you’ll use most of the time." + }, + { + "objectID": "databases.html#summary", + "href": "databases.html#summary", + "title": "21  Databases", + "section": "\n21.7 Summary", + "text": "21.7 Summary\nIn this chapter you learned how to access data from databases. We focused on dbplyr, a dplyr “backend” that allows you to write the dplyr code you’re familiar with, and have it be automatically translated to SQL. We used that translation to teach you a little SQL; it’s important to learn some SQL because it’s the most commonly used language for working with data and knowing some will make it easier for you to communicate with other data folks who don’t use R. If you’ve finished this chapter and would like to learn more about SQL. We have two recommendations:\n\n\nSQL for Data Scientists by Renée M. P. Teate is an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data you’re likely to encounter in real organizations.\n\nPractical SQL by Anthony DeBarros is written from the perspective of a data journalist (a data scientist specialized in telling compelling stories) and goes into more detail about getting your data into a database and running your own DBMS.\n\nIn the next chapter, we’ll learn about another dplyr backend for working with large data: arrow. Arrow is designed for working with large files on disk, and is a natural complement to databases." + }, + { + "objectID": "databases.html#footnotes", + "href": "databases.html#footnotes", + "title": "21  Databases", + "section": "", + "text": "SQL is either pronounced “s”-“q”-“l” or “sequel”.↩︎\nTypically, this is the only function you’ll use from the client package, so we recommend using :: to pull out that one function, rather than loading the complete package with library().↩︎\nAt least, all the tables that you have permission to see.↩︎\nConfusingly, depending on the context, SELECT is either a statement or a clause. To avoid this confusion, we’ll generally use SELECT query instead of SELECT statement.↩︎\nOk, technically, only the SELECT is required, since you can write queries like SELECT 1+1 to perform basic calculations. But if you want to work with data (as you always do!) you’ll also need a FROM clause.↩︎\nThis is no coincidence: the dplyr function name was inspired by the SQL clause.↩︎" + }, + { + "objectID": "arrow.html#introduction", + "href": "arrow.html#introduction", + "title": "22  Arrow", + "section": "\n22.1 Introduction", + "text": "22.1 Introduction\nCSV files are designed to be easily read by humans. They’re a good interchange format because they’re very simple and they can be read by every tool under the sun. But CSV files aren’t very efficient: you have to do quite a lot of work to read the data into R. In this chapter, you’ll learn about a powerful alternative: the parquet format, an open standards-based format widely used by big data systems.\nWe’ll pair parquet files with Apache Arrow, a multi-language toolbox designed for efficient analysis and transport of large datasets. We’ll use Apache Arrow via the arrow package, which provides a dplyr backend allowing you to analyze larger-than-memory datasets using familiar dplyr syntax. As an additional benefit, arrow is extremely fast: you’ll see some examples later in the chapter.\nBoth arrow and dbplyr provide dplyr backends, so you might wonder when to use each. In many cases, the choice is made for you, as the data is already in a database or in parquet files, and you’ll want to work with it as is. But if you’re starting with your own data (perhaps CSV files), you can either load it into a database or convert it to parquet. In general, it’s hard to know what will work best, so in the early stages of your analysis we’d encourage you to try both and pick the one that works the best for you.\n(A big thanks to Danielle Navarro who contributed the initial version of this chapter.)\n\n22.1.1 Prerequisites\nIn this chapter, we’ll continue to use the tidyverse, particularly dplyr, but we’ll pair it with the arrow package which is designed specifically for working with large data.\n\nlibrary(tidyverse)\nlibrary(arrow)\n\nLater in the chapter, we’ll also see some connections between arrow and duckdb, so we’ll also need dbplyr and duckdb.\n\nlibrary(dbplyr, warn.conflicts = FALSE)\nlibrary(duckdb)\n#> Loading required package: DBI" + }, + { + "objectID": "arrow.html#getting-the-data", + "href": "arrow.html#getting-the-data", + "title": "22  Arrow", + "section": "\n22.2 Getting the data", + "text": "22.2 Getting the data\nWe begin by getting a dataset worthy of these tools: a dataset of item checkouts from Seattle public libraries, available online at data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6. This dataset contains 41,389,465 rows that tell you how many times each book was checked out each month from April 2005 to October 2022.\nThe following code will get you a cached copy of the data. The data is a 9GB CSV file, so it will take some time to download. I highly recommend using curl::multi_download() to get very large files as it’s built for exactly this purpose: it gives you a progress bar and it can resume the download if its interrupted.\n\ndir.create(\"data\", showWarnings = FALSE)\n\ncurl::multi_download(\n \"https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv\",\n \"data/seattle-library-checkouts.csv\",\n resume = TRUE\n)\n#> # A tibble: 1 × 10\n#> success status_code resumefrom url destfile error\n#> <lgl> <int> <dbl> <chr> <chr> <chr>\n#> 1 TRUE 200 0 https://r4ds.s3.us-we… data/seattle-l… <NA> \n#> # ℹ 4 more variables: type <chr>, modified <dttm>, time <dbl>,\n#> # headers <list>" + }, + { + "objectID": "arrow.html#opening-a-dataset", + "href": "arrow.html#opening-a-dataset", + "title": "22  Arrow", + "section": "\n22.3 Opening a dataset", + "text": "22.3 Opening a dataset\nLet’s start by taking a look at the data. At 9 GB, this file is large enough that we probably don’t want to load the whole thing into memory. A good rule of thumb is that you usually want at least twice as much memory as the size of the data, and many laptops top out at 16 GB. This means we want to avoid read_csv() and instead use the arrow::open_dataset():\n\nseattle_csv <- open_dataset(\n sources = \"data/seattle-library-checkouts.csv\", \n col_types = schema(ISBN = string()),\n format = \"csv\"\n)\n\nWhat happens when this code is run? open_dataset() will scan a few thousand rows to figure out the structure of the dataset. The ISBN column contains blank values for the first 80,000 rows, so we have to specify the column type to help arrow work out the data structure. Once the data has been scanned by open_dataset(), it records what it’s found and stops; it will only read further rows as you specifically request them. This metadata is what we see if we print seattle_csv:\n\nseattle_csv\n#> FileSystemDataset with 1 csv file\n#> UsageClass: string\n#> CheckoutType: string\n#> MaterialType: string\n#> CheckoutYear: int64\n#> CheckoutMonth: int64\n#> Checkouts: int64\n#> Title: string\n#> ISBN: string\n#> Creator: string\n#> Subjects: string\n#> Publisher: string\n#> PublicationYear: string\n\nThe first line in the output tells you that seattle_csv is stored locally on-disk as a single CSV file; it will only be loaded into memory as needed. The remainder of the output tells you the column type that arrow has imputed for each column.\nWe can see what’s actually in with glimpse(). This reveals that there are ~41 million rows and 12 columns, and shows us a few values.\n\nseattle_csv |> glimpse()\n#> FileSystemDataset with 1 csv file\n#> 41,389,465 rows x 12 columns\n#> $ UsageClass <string> \"Physical\", \"Physical\", \"Digital\", \"Physical\", \"Ph…\n#> $ CheckoutType <string> \"Horizon\", \"Horizon\", \"OverDrive\", \"Horizon\", \"Hor…\n#> $ MaterialType <string> \"BOOK\", \"BOOK\", \"EBOOK\", \"BOOK\", \"SOUNDDISC\", \"BOO…\n#> $ CheckoutYear <int64> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 20…\n#> $ CheckoutMonth <int64> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…\n#> $ Checkouts <int64> 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 2, 3, 2, 1, 3, 2,…\n#> $ Title <string> \"Super rich : a guide to having it all / Russell S…\n#> $ ISBN <string> \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\"…\n#> $ Creator <string> \"Simmons, Russell\", \"Barclay, James, 1965-\", \"Tim …\n#> $ Subjects <string> \"Self realization, Conduct of life, Attitude Psych…\n#> $ Publisher <string> \"Gotham Books,\", \"Pyr,\", \"Random House, Inc.\", \"Di…\n#> $ PublicationYear <string> \"c2011.\", \"2010.\", \"2015\", \"2005.\", \"c2004.\", \"c20…\n\nWe can start to use this dataset with dplyr verbs, using collect() to force arrow to perform the computation and return some data. For example, this code tells us the total number of checkouts per year:\n\nseattle_csv |> \n group_by(CheckoutYear) |> \n summarise(Checkouts = sum(Checkouts)) |> \n arrange(CheckoutYear) |> \n collect()\n#> # A tibble: 18 × 2\n#> CheckoutYear Checkouts\n#> <int> <int>\n#> 1 2005 3798685\n#> 2 2006 6599318\n#> 3 2007 7126627\n#> 4 2008 8438486\n#> 5 2009 9135167\n#> 6 2010 8608966\n#> # ℹ 12 more rows\n\nThanks to arrow, this code will work regardless of how large the underlying dataset is. But it’s currently rather slow: on Hadley’s computer, it took ~10s to run. That’s not terrible given how much data we have, but we can make it much faster by switching to a better format." + }, + { + "objectID": "arrow.html#sec-parquet", + "href": "arrow.html#sec-parquet", + "title": "22  Arrow", + "section": "\n22.4 The parquet format", + "text": "22.4 The parquet format\nTo make this data easier to work with, let’s switch to the parquet file format and split it up into multiple files. The following sections will first introduce you to parquet and partitioning, and then apply what we learned to the Seattle library data.\n\n22.4.1 Advantages of parquet\nLike CSV, parquet is used for rectangular data, but instead of being a text format that you can read with any file editor, it’s a custom binary format designed specifically for the needs of big data. This means that:\n\nParquet files are usually smaller than the equivalent CSV file. Parquet relies on efficient encodings to keep file size down, and supports file compression. This helps make parquet files fast because there’s less data to move from disk to memory.\nParquet files have a rich type system. As we talked about in Seção 7.3, a CSV file does not provide any information about column types. For example, a CSV reader has to guess whether \"08-10-2022\" should be parsed as a string or a date. In contrast, parquet files store data in a way that records the type along with the data.\nParquet files are “column-oriented”. This means that they’re organized column-by-column, much like R’s data frame. This typically leads to better performance for data analysis tasks compared to CSV files, which are organized row-by-row.\nParquet files are “chunked”, which makes it possible to work on different parts of the file at the same time, and, if you’re lucky, to skip some chunks altogether.\n\nThere’s one primary disadvantage to parquet files: they are no longer “human readable”, i.e. if you look at a parquet file using readr::read_file(), you’ll just see a bunch of gibberish.\n\n22.4.2 Partitioning\nAs datasets get larger and larger, storing all the data in a single file gets increasingly painful and it’s often useful to split large datasets across many files. When this structuring is done intelligently, this strategy can lead to significant improvements in performance because many analyses will only require a subset of the files.\nThere are no hard and fast rules about how to partition your dataset: the results will depend on your data, access patterns, and the systems that read the data. You’re likely to need to do some experimentation before you find the ideal partitioning for your situation. As a rough guide, arrow suggests that you avoid files smaller than 20MB and larger than 2GB and avoid partitions that produce more than 10,000 files. You should also try to partition by variables that you filter by; as you’ll see shortly, that allows arrow to skip a lot of work by reading only the relevant files.\n\n22.4.3 Rewriting the Seattle library data\nLet’s apply these ideas to the Seattle library data to see how they play out in practice. We’re going to partition by CheckoutYear, since it’s likely some analyses will only want to look at recent data and partitioning by year yields 18 chunks of a reasonable size.\nTo rewrite the data we define the partition using dplyr::group_by() and then save the partitions to a directory with arrow::write_dataset(). write_dataset() has two important arguments: a directory where we’ll create the files and the format we’ll use.\n\npq_path <- \"data/seattle-library-checkouts\"\n\n\nseattle_csv |>\n group_by(CheckoutYear) |>\n write_dataset(path = pq_path, format = \"parquet\")\n\nThis takes about a minute to run; as we’ll see shortly this is an initial investment that pays off by making future operations much much faster.\nLet’s take a look at what we just produced:\n\ntibble(\n files = list.files(pq_path, recursive = TRUE),\n size_MB = file.size(file.path(pq_path, files)) / 1024^2\n)\n#> # A tibble: 18 × 2\n#> files size_MB\n#> <chr> <dbl>\n#> 1 CheckoutYear=2005/part-0.parquet 109.\n#> 2 CheckoutYear=2006/part-0.parquet 164.\n#> 3 CheckoutYear=2007/part-0.parquet 178.\n#> 4 CheckoutYear=2008/part-0.parquet 195.\n#> 5 CheckoutYear=2009/part-0.parquet 214.\n#> 6 CheckoutYear=2010/part-0.parquet 222.\n#> # ℹ 12 more rows\n\nOur single 9GB CSV file has been rewritten into 18 parquet files. The file names use a “self-describing” convention used by the Apache Hive project. Hive-style partitions name folders with a “key=value” convention, so as you might guess, the CheckoutYear=2005 directory contains all the data where CheckoutYear is 2005. Each file is between 100 and 300 MB and the total size is now around 4 GB, a little over half the size of the original CSV file. This is as we expect since parquet is a much more efficient format." + }, + { + "objectID": "arrow.html#using-dplyr-with-arrow", + "href": "arrow.html#using-dplyr-with-arrow", + "title": "22  Arrow", + "section": "\n22.5 Using dplyr with arrow", + "text": "22.5 Using dplyr with arrow\nNow we’ve created these parquet files, we’ll need to read them in again. We use open_dataset() again, but this time we give it a directory:\n\nseattle_pq <- open_dataset(pq_path)\n\nNow we can write our dplyr pipeline. For example, we could count the total number of books checked out in each month for the last five years:\n\nquery <- seattle_pq |> \n filter(CheckoutYear >= 2018, MaterialType == \"BOOK\") |>\n group_by(CheckoutYear, CheckoutMonth) |>\n summarize(TotalCheckouts = sum(Checkouts)) |>\n arrange(CheckoutYear, CheckoutMonth)\n\nWriting dplyr code for arrow data is conceptually similar to dbplyr, Capítulo 21: you write dplyr code, which is automatically transformed into a query that the Apache Arrow C++ library understands, which is then executed when you call collect(). If we print out the query object we can see a little information about what we expect Arrow to return when the execution takes place:\n\nquery\n#> FileSystemDataset (query)\n#> CheckoutYear: int32\n#> CheckoutMonth: int64\n#> TotalCheckouts: int64\n#> \n#> * Grouped by CheckoutYear\n#> * Sorted by CheckoutYear [asc], CheckoutMonth [asc]\n#> See $.data for the source Arrow object\n\nAnd we can get the results by calling collect():\n\nquery |> collect()\n#> # A tibble: 58 × 3\n#> # Groups: CheckoutYear [5]\n#> CheckoutYear CheckoutMonth TotalCheckouts\n#> <int> <int> <int>\n#> 1 2018 1 355101\n#> 2 2018 2 309813\n#> 3 2018 3 344487\n#> 4 2018 4 330988\n#> 5 2018 5 318049\n#> 6 2018 6 341825\n#> # ℹ 52 more rows\n\nLike dbplyr, arrow only understands some R expressions, so you may not be able to write exactly the same code you usually would. However, the list of operations and functions supported is fairly extensive and continues to grow; find a complete list of currently supported functions in ?acero.\n\n22.5.1 Performance\nLet’s take a quick look at the performance impact of switching from CSV to parquet. First, let’s time how long it takes to calculate the number of books checked out in each month of 2021, when the data is stored as a single large csv:\n\nseattle_csv |> \n filter(CheckoutYear == 2021, MaterialType == \"BOOK\") |>\n group_by(CheckoutMonth) |>\n summarize(TotalCheckouts = sum(Checkouts)) |>\n arrange(desc(CheckoutMonth)) |>\n collect() |> \n system.time()\n#> user system elapsed \n#> 11.951 1.297 11.387\n\nNow let’s use our new version of the dataset in which the Seattle library checkout data has been partitioned into 18 smaller parquet files:\n\nseattle_pq |> \n filter(CheckoutYear == 2021, MaterialType == \"BOOK\") |>\n group_by(CheckoutMonth) |>\n summarize(TotalCheckouts = sum(Checkouts)) |>\n arrange(desc(CheckoutMonth)) |>\n collect() |> \n system.time()\n#> user system elapsed \n#> 0.263 0.058 0.063\n\nThe ~100x speedup in performance is attributable to two factors: the multi-file partitioning, and the format of individual files:\n\nPartitioning improves performance because this query uses CheckoutYear == 2021 to filter the data, and arrow is smart enough to recognize that it only needs to read 1 of the 18 parquet files.\nThe parquet format improves performance by storing data in a binary format that can be read more directly into memory. The column-wise format and rich metadata means that arrow only needs to read the four columns actually used in the query (CheckoutYear, MaterialType, CheckoutMonth, and Checkouts).\n\nThis massive difference in performance is why it pays off to convert large CSVs to parquet!\n\n22.5.2 Using duckdb with arrow\nThere’s one last advantage of parquet and arrow — it’s very easy to turn an arrow dataset into a DuckDB database (Capítulo 21) by calling arrow::to_duckdb():\n\nseattle_pq |> \n to_duckdb() |>\n filter(CheckoutYear >= 2018, MaterialType == \"BOOK\") |>\n group_by(CheckoutYear) |>\n summarize(TotalCheckouts = sum(Checkouts)) |>\n arrange(desc(CheckoutYear)) |>\n collect()\n#> Warning: Missing values are always removed in SQL aggregation functions.\n#> Use `na.rm = TRUE` to silence this warning\n#> This warning is displayed once every 8 hours.\n#> # A tibble: 5 × 2\n#> CheckoutYear TotalCheckouts\n#> <int> <dbl>\n#> 1 2022 2431502\n#> 2 2021 2266438\n#> 3 2020 1241999\n#> 4 2019 3931688\n#> 5 2018 3987569\n\nThe neat thing about to_duckdb() is that the transfer doesn’t involve any memory copying, and speaks to the goals of the arrow ecosystem: enabling seamless transitions from one computing environment to another.\n\n22.5.3 Exercises\n\nFigure out the most popular book each year.\nWhich author has the most books in the Seattle library system?\nHow has checkouts of books vs ebooks changed over the last 10 years?" + }, + { + "objectID": "arrow.html#summary", + "href": "arrow.html#summary", + "title": "22  Arrow", + "section": "\n22.6 Summary", + "text": "22.6 Summary\nIn this chapter, you’ve been given a taste of the arrow package, which provides a dplyr backend for working with large on-disk datasets. It can work with CSV files, and it’s much much faster if you convert your data to parquet. Parquet is a binary data format that’s designed specifically for data analysis on modern computers. Far fewer tools can work with parquet files compared to CSV, but its partitioned, compressed, and columnar structure makes it much more efficient to analyze.\nNext up you’ll learn about your first non-rectangular data source, which you’ll handle using tools provided by the tidyr package. We’ll focus on data that comes from JSON files, but the general principles apply to tree-like data regardless of its source." + }, + { + "objectID": "rectangling.html#introduction", + "href": "rectangling.html#introduction", + "title": "23  Hierarchical data", + "section": "\n23.1 Introduction", + "text": "23.1 Introduction\nIn this chapter, you’ll learn the art of data rectangling: taking data that is fundamentally hierarchical, or tree-like, and converting it into a rectangular data frame made up of rows and columns. This is important because hierarchical data is surprisingly common, especially when working with data that comes from the web.\nTo learn about rectangling, you’ll need to first learn about lists, the data structure that makes hierarchical data possible. Then you’ll learn about two crucial tidyr functions: tidyr::unnest_longer() and tidyr::unnest_wider(). We’ll then show you a few case studies, applying these simple functions again and again to solve real problems. We’ll finish off by talking about JSON, the most frequent source of hierarchical datasets and a common format for data exchange on the web.\n\n23.1.1 Prerequisites\nIn this chapter, we’ll use many functions from tidyr, a core member of the tidyverse. We’ll also use repurrrsive to provide some interesting datasets for rectangling practice, and we’ll finish by using jsonlite to read JSON files into R lists.\n\nlibrary(tidyverse)\nlibrary(repurrrsive)\nlibrary(jsonlite)" + }, + { + "objectID": "rectangling.html#lists", + "href": "rectangling.html#lists", + "title": "23  Hierarchical data", + "section": "\n23.2 Lists", + "text": "23.2 Lists\nSo far you’ve worked with data frames that contain simple vectors like integers, numbers, characters, date-times, and factors. These vectors are simple because they’re homogeneous: every element is of the same data type. If you want to store elements of different types in the same vector, you’ll need a list, which you create with list():\n\nx1 <- list(1:4, \"a\", TRUE)\nx1\n#> [[1]]\n#> [1] 1 2 3 4\n#> \n#> [[2]]\n#> [1] \"a\"\n#> \n#> [[3]]\n#> [1] TRUE\n\nIt’s often convenient to name the components, or children, of a list, which you can do in the same way as naming the columns of a tibble:\n\nx2 <- list(a = 1:2, b = 1:3, c = 1:4)\nx2\n#> $a\n#> [1] 1 2\n#> \n#> $b\n#> [1] 1 2 3\n#> \n#> $c\n#> [1] 1 2 3 4\n\nEven for these very simple lists, printing takes up quite a lot of space. A useful alternative is str(), which generates a compact display of the structure, de-emphasizing the contents:\n\nstr(x1)\n#> List of 3\n#> $ : int [1:4] 1 2 3 4\n#> $ : chr \"a\"\n#> $ : logi TRUE\nstr(x2)\n#> List of 3\n#> $ a: int [1:2] 1 2\n#> $ b: int [1:3] 1 2 3\n#> $ c: int [1:4] 1 2 3 4\n\nAs you can see, str() displays each child of the list on its own line. It displays the name, if present, then an abbreviation of the type, then the first few values.\n\n23.2.1 Hierarchy\nLists can contain any type of object, including other lists. This makes them suitable for representing hierarchical (tree-like) structures:\n\nx3 <- list(list(1, 2), list(3, 4))\nstr(x3)\n#> List of 2\n#> $ :List of 2\n#> ..$ : num 1\n#> ..$ : num 2\n#> $ :List of 2\n#> ..$ : num 3\n#> ..$ : num 4\n\nThis is notably different to c(), which generates a flat vector:\n\nc(c(1, 2), c(3, 4))\n#> [1] 1 2 3 4\n\nx4 <- c(list(1, 2), list(3, 4))\nstr(x4)\n#> List of 4\n#> $ : num 1\n#> $ : num 2\n#> $ : num 3\n#> $ : num 4\n\nAs lists get more complex, str() gets more useful, as it lets you see the hierarchy at a glance:\n\nx5 <- list(1, list(2, list(3, list(4, list(5)))))\nstr(x5)\n#> List of 2\n#> $ : num 1\n#> $ :List of 2\n#> ..$ : num 2\n#> ..$ :List of 2\n#> .. ..$ : num 3\n#> .. ..$ :List of 2\n#> .. .. ..$ : num 4\n#> .. .. ..$ :List of 1\n#> .. .. .. ..$ : num 5\n\nAs lists get even larger and more complex, str() eventually starts to fail, and you’ll need to switch to View()1. Figura 23.1 shows the result of calling View(x5). The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in Figura 23.2. RStudio will also show you the code you need to access that element, as in Figura 23.3. We’ll come back to how this code works in Seção 27.3.\n\n\n\n\nFigura 23.1: The RStudio view lets you interactively explore a complex list. The viewer opens showing only the top level of the list.\n\n\n\n\n\n\n\nFigura 23.2: Clicking on the rightward facing triangle expands that component of the list so that you can also see its children.\n\n\n\n\n\n\n\nFigura 23.3: You can repeat this operation as many times as needed to get to the data you’re interested in. Note the bottom-left corner: if you click an element of the list, RStudio will give you the subsetting code needed to access it, in this case x5[[2]][[2]][[2]].\n\n\n\n\n23.2.2 List-columns\nLists can also live inside a tibble, where we call them list-columns. List-columns are useful because they allow you to place objects in a tibble that wouldn’t usually belong in there. In particular, list-columns are used a lot in the tidymodels ecosystem, because they allow you to store things like model outputs or resamples in a data frame.\nHere’s a simple example of a list-column:\n\ndf <- tibble(\n x = 1:2, \n y = c(\"a\", \"b\"),\n z = list(list(1, 2), list(3, 4, 5))\n)\ndf\n#> # A tibble: 2 × 3\n#> x y z \n#> <int> <chr> <list> \n#> 1 1 a <list [2]>\n#> 2 2 b <list [3]>\n\nThere’s nothing special about lists in a tibble; they behave like any other column:\n\ndf |> \n filter(x == 1)\n#> # A tibble: 1 × 3\n#> x y z \n#> <int> <chr> <list> \n#> 1 1 a <list [2]>\n\nComputing with list-columns is harder, but that’s because computing with lists is harder in general; we’ll come back to that in Capítulo 26. In this chapter, we’ll focus on unnesting list-columns out into regular variables so you can use your existing tools on them.\nThe default print method just displays a rough summary of the contents. The list column could be arbitrarily complex, so there’s no good way to print it. If you want to see it, you’ll need to pull out just the one list-column and apply one of the techniques that you’ve learned above, like df |> pull(z) |> str() or df |> pull(z) |> View().\n\n\n\n\n\n\nBase R\n\n\n\nIt’s possible to put a list in a column of a data.frame, but it’s a lot fiddlier because data.frame() treats a list as a list of columns:\n\ndata.frame(x = list(1:3, 3:5))\n#> x.1.3 x.3.5\n#> 1 1 3\n#> 2 2 4\n#> 3 3 5\n\nYou can force data.frame() to treat a list as a list of rows by wrapping it in list I(), but the result doesn’t print particularly well:\n\ndata.frame(\n x = I(list(1:2, 3:5)), \n y = c(\"1, 2\", \"3, 4, 5\")\n)\n#> x y\n#> 1 1, 2 1, 2\n#> 2 3, 4, 5 3, 4, 5\n\nIt’s easier to use list-columns with tibbles because tibble() treats lists like vectors and the print method has been designed with lists in mind." + }, + { + "objectID": "rectangling.html#unnesting", + "href": "rectangling.html#unnesting", + "title": "23  Hierarchical data", + "section": "\n23.3 Unnesting", + "text": "23.3 Unnesting\nNow that you’ve learned the basics of lists and list-columns, let’s explore how you can turn them back into regular rows and columns. Here we’ll use very simple sample data so you can get the basic idea; in the next section we’ll switch to real data.\nList-columns tend to come in two basic forms: named and unnamed. When the children are named, they tend to have the same names in every row. For example, in df1, every element of list-column y has two elements named a and b. Named list-columns naturally unnest into columns: each named element becomes a new named column.\n\ndf1 <- tribble(\n ~x, ~y,\n 1, list(a = 11, b = 12),\n 2, list(a = 21, b = 22),\n 3, list(a = 31, b = 32),\n)\n\nWhen the children are unnamed, the number of elements tends to vary from row-to-row. For example, in df2, the elements of list-column y are unnamed and vary in length from one to three. Unnamed list-columns naturally unnest into rows: you’ll get one row for each child.\n\n\ndf2 <- tribble(\n ~x, ~y,\n 1, list(11, 12, 13),\n 2, list(21),\n 3, list(31, 32),\n)\n\ntidyr provides two functions for these two cases: unnest_wider() and unnest_longer(). The following sections explain how they work.\n\n23.3.1 unnest_wider()\n\nWhen each row has the same number of elements with the same names, like df1, it’s natural to put each component into its own column with unnest_wider():\n\ndf1 |> \n unnest_wider(y)\n#> # A tibble: 3 × 3\n#> x a b\n#> <dbl> <dbl> <dbl>\n#> 1 1 11 12\n#> 2 2 21 22\n#> 3 3 31 32\n\nBy default, the names of the new columns come exclusively from the names of the list elements, but you can use the names_sep argument to request that they combine the column name and the element name. This is useful for disambiguating repeated names.\n\ndf1 |> \n unnest_wider(y, names_sep = \"_\")\n#> # A tibble: 3 × 3\n#> x y_a y_b\n#> <dbl> <dbl> <dbl>\n#> 1 1 11 12\n#> 2 2 21 22\n#> 3 3 31 32\n\n\n23.3.2 unnest_longer()\n\nWhen each row contains an unnamed list, it’s most natural to put each element into its own row with unnest_longer():\n\ndf2 |> \n unnest_longer(y)\n#> # A tibble: 6 × 2\n#> x y\n#> <dbl> <dbl>\n#> 1 1 11\n#> 2 1 12\n#> 3 1 13\n#> 4 2 21\n#> 5 3 31\n#> 6 3 32\n\nNote how x is duplicated for each element inside of y: we get one row of output for each element inside the list-column. But what happens if one of the elements is empty, as in the following example?\n\ndf6 <- tribble(\n ~x, ~y,\n \"a\", list(1, 2),\n \"b\", list(3),\n \"c\", list()\n)\ndf6 |> unnest_longer(y)\n#> # A tibble: 3 × 2\n#> x y\n#> <chr> <dbl>\n#> 1 a 1\n#> 2 a 2\n#> 3 b 3\n\nWe get zero rows in the output, so the row effectively disappears. If you want to preserve that row, adding NA in y, set keep_empty = TRUE.\n\n23.3.3 Inconsistent types\nWhat happens if you unnest a list-column that contains different types of vector? For example, take the following dataset where the list-column y contains two numbers, a character, and a logical, which can’t normally be mixed in a single column.\n\ndf4 <- tribble(\n ~x, ~y,\n \"a\", list(1),\n \"b\", list(\"a\", TRUE, 5)\n)\n\nunnest_longer() always keeps the set of columns unchanged, while changing the number of rows. So what happens? How does unnest_longer() produce five rows while keeping everything in y?\n\ndf4 |> \n unnest_longer(y)\n#> # A tibble: 4 × 2\n#> x y \n#> <chr> <list> \n#> 1 a <dbl [1]>\n#> 2 b <chr [1]>\n#> 3 b <lgl [1]>\n#> 4 b <dbl [1]>\n\nAs you can see, the output contains a list-column, but every element of the list-column contains a single element. Because unnest_longer() can’t find a common type of vector, it keeps the original types in a list-column. You might wonder if this breaks the commandment that every element of a column must be the same type. It doesn’t: every element is a list, even though the contents are of different types.\nDealing with inconsistent types is challenging and the details depend on the precise nature of the problem and your goals, but you’ll most likely need tools from Capítulo 26.\n\n23.3.4 Other functions\ntidyr has a few other useful rectangling functions that we’re not going to cover in this book:\n\n\nunnest_auto() automatically picks between unnest_longer() and unnest_wider() based on the structure of the list-column. It’s great for rapid exploration, but ultimately it’s a bad idea because it doesn’t force you to understand how your data is structured, and makes your code harder to understand.\n\nunnest() expands both rows and columns. It’s useful when you have a list-column that contains a 2d structure like a data frame, which you don’t see in this book, but you might encounter if you use the tidymodels ecosystem.\n\nThese functions are good to know about as you might encounter them when reading other people’s code or tackling rarer rectangling challenges yourself.\n\n23.3.5 Exercises\n\nWhat happens when you use unnest_wider() with unnamed list-columns like df2? What argument is now necessary? What happens to missing values?\nWhat happens when you use unnest_longer() with named list-columns like df1? What additional information do you get in the output? How can you suppress that extra detail?\n\nFrom time-to-time you encounter data frames with multiple list-columns with aligned values. For example, in the following data frame, the values of y and z are aligned (i.e. y and z will always have the same length within a row, and the first value of y corresponds to the first value of z). What happens if you apply two unnest_longer() calls to this data frame? How can you preserve the relationship between x and y? (Hint: carefully read the docs).\n\ndf4 <- tribble(\n ~x, ~y, ~z,\n \"a\", list(\"y-a-1\", \"y-a-2\"), list(\"z-a-1\", \"z-a-2\"),\n \"b\", list(\"y-b-1\", \"y-b-2\", \"y-b-3\"), list(\"z-b-1\", \"z-b-2\", \"z-b-3\")\n)" + }, + { + "objectID": "rectangling.html#case-studies", + "href": "rectangling.html#case-studies", + "title": "23  Hierarchical data", + "section": "\n23.4 Case studies", + "text": "23.4 Case studies\nThe main difference between the simple examples we used above and real data is that real data typically contains multiple levels of nesting that require multiple calls to unnest_longer() and/or unnest_wider(). To show that in action, this section works through three real rectangling challenges using datasets from the repurrrsive package.\n\n23.4.1 Very wide data\nWe’ll start with gh_repos. This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It’s a very deeply nested list so it’s difficult to show the structure in this book; we recommend exploring a little on your own with View(gh_repos) before we continue.\ngh_repos is a list, but our tools work with list-columns, so we’ll begin by putting it into a tibble. We call this column json for reasons we’ll get to later.\n\nrepos <- tibble(json = gh_repos)\nrepos\n#> # A tibble: 6 × 1\n#> json \n#> <list> \n#> 1 <list [30]>\n#> 2 <list [30]>\n#> 3 <list [30]>\n#> 4 <list [26]>\n#> 5 <list [30]>\n#> 6 <list [30]>\n\nThis tibble contains 6 rows, one row for each child of gh_repos. Each row contains a unnamed list with either 26 or 30 rows. Since these are unnamed, we’ll start with unnest_longer() to put each child in its own row:\n\nrepos |> \n unnest_longer(json)\n#> # A tibble: 176 × 1\n#> json \n#> <list> \n#> 1 <named list [68]>\n#> 2 <named list [68]>\n#> 3 <named list [68]>\n#> 4 <named list [68]>\n#> 5 <named list [68]>\n#> 6 <named list [68]>\n#> # ℹ 170 more rows\n\nAt first glance, it might seem like we haven’t improved the situation: while we have more rows (176 instead of 6) each element of json is still a list. However, there’s an important difference: now each element is a named list so we can use unnest_wider() to put each element into its own column:\n\nrepos |> \n unnest_longer(json) |> \n unnest_wider(json) \n#> # A tibble: 176 × 68\n#> id name full_name owner private html_url \n#> <int> <chr> <chr> <list> <lgl> <chr> \n#> 1 61160198 after gaborcsardi/after <named list> FALSE https://github…\n#> 2 40500181 argufy gaborcsardi/argu… <named list> FALSE https://github…\n#> 3 36442442 ask gaborcsardi/ask <named list> FALSE https://github…\n#> 4 34924886 baseimports gaborcsardi/base… <named list> FALSE https://github…\n#> 5 61620661 citest gaborcsardi/cite… <named list> FALSE https://github…\n#> 6 33907457 clisymbols gaborcsardi/clis… <named list> FALSE https://github…\n#> # ℹ 170 more rows\n#> # ℹ 62 more variables: description <chr>, fork <lgl>, url <chr>, …\n\nThis has worked but the result is a little overwhelming: there are so many columns that tibble doesn’t even print all of them! We can see them all with names(); and here we look at the first 10:\n\nrepos |> \n unnest_longer(json) |> \n unnest_wider(json) |> \n names() |> \n head(10)\n#> [1] \"id\" \"name\" \"full_name\" \"owner\" \"private\" \n#> [6] \"html_url\" \"description\" \"fork\" \"url\" \"forks_url\"\n\nLet’s pull out a few that look interesting:\n\nrepos |> \n unnest_longer(json) |> \n unnest_wider(json) |> \n select(id, full_name, owner, description)\n#> # A tibble: 176 × 4\n#> id full_name owner description \n#> <int> <chr> <list> <chr> \n#> 1 61160198 gaborcsardi/after <named list [17]> Run Code in the Backgro…\n#> 2 40500181 gaborcsardi/argufy <named list [17]> Declarative function ar…\n#> 3 36442442 gaborcsardi/ask <named list [17]> Friendly CLI interactio…\n#> 4 34924886 gaborcsardi/baseimports <named list [17]> Do we get warnings for …\n#> 5 61620661 gaborcsardi/citest <named list [17]> Test R package and repo…\n#> 6 33907457 gaborcsardi/clisymbols <named list [17]> Unicode symbols for CLI…\n#> # ℹ 170 more rows\n\nYou can use this to work back to understand how gh_repos was structured: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.\nowner is another list-column, and since it contains a named list, we can use unnest_wider() to get at the values:\n\nrepos |> \n unnest_longer(json) |> \n unnest_wider(json) |> \n select(id, full_name, owner, description) |> \n unnest_wider(owner)\n#> Error in `unnest_wider()`:\n#> ! Can't duplicate names between the affected columns and the original\n#> data.\n#> ✖ These names are duplicated:\n#> ℹ `id`, from `owner`.\n#> ℹ Use `names_sep` to disambiguate using the column name.\n#> ℹ Or use `names_repair` to specify a repair strategy.\n\nUh oh, this list column also contains an id column and we can’t have two id columns in the same data frame. As suggested, lets use names_sep to resolve the problem:\n\nrepos |> \n unnest_longer(json) |> \n unnest_wider(json) |> \n select(id, full_name, owner, description) |> \n unnest_wider(owner, names_sep = \"_\")\n#> # A tibble: 176 × 20\n#> id full_name owner_login owner_id owner_avatar_url \n#> <int> <chr> <chr> <int> <chr> \n#> 1 61160198 gaborcsardi/after gaborcsardi 660288 https://avatars.gith…\n#> 2 40500181 gaborcsardi/argufy gaborcsardi 660288 https://avatars.gith…\n#> 3 36442442 gaborcsardi/ask gaborcsardi 660288 https://avatars.gith…\n#> 4 34924886 gaborcsardi/baseimports gaborcsardi 660288 https://avatars.gith…\n#> 5 61620661 gaborcsardi/citest gaborcsardi 660288 https://avatars.gith…\n#> 6 33907457 gaborcsardi/clisymbols gaborcsardi 660288 https://avatars.gith…\n#> # ℹ 170 more rows\n#> # ℹ 15 more variables: owner_gravatar_id <chr>, owner_url <chr>, …\n\nThis gives another wide dataset, but you can get the sense that owner appears to contain a lot of additional data about the person who “owns” the repository.\n\n23.4.2 Relational data\nNested data is sometimes used to represent data that we’d usually spread across multiple data frames. For example, take got_chars which contains data about characters that appear in the Game of Thrones books and TV series. Like gh_repos it’s a list, so we start by turning it into a list-column of a tibble:\n\nchars <- tibble(json = got_chars)\nchars\n#> # A tibble: 30 × 1\n#> json \n#> <list> \n#> 1 <named list [18]>\n#> 2 <named list [18]>\n#> 3 <named list [18]>\n#> 4 <named list [18]>\n#> 5 <named list [18]>\n#> 6 <named list [18]>\n#> # ℹ 24 more rows\n\nThe json column contains named elements, so we’ll start by widening it:\n\nchars |> \n unnest_wider(json)\n#> # A tibble: 30 × 18\n#> url id name gender culture born \n#> <chr> <int> <chr> <chr> <chr> <chr> \n#> 1 https://www.anapio… 1022 Theon Greyjoy Male \"Ironborn\" \"In 278 AC or …\n#> 2 https://www.anapio… 1052 Tyrion Lannist… Male \"\" \"In 273 AC, at…\n#> 3 https://www.anapio… 1074 Victarion Grey… Male \"Ironborn\" \"In 268 AC or …\n#> 4 https://www.anapio… 1109 Will Male \"\" \"\" \n#> 5 https://www.anapio… 1166 Areo Hotah Male \"Norvoshi\" \"In 257 AC or …\n#> 6 https://www.anapio… 1267 Chett Male \"\" \"At Hag's Mire\"\n#> # ℹ 24 more rows\n#> # ℹ 12 more variables: died <chr>, alive <lgl>, titles <list>, …\n\nAnd selecting a few columns to make it easier to read:\n\ncharacters <- chars |> \n unnest_wider(json) |> \n select(id, name, gender, culture, born, died, alive)\ncharacters\n#> # A tibble: 30 × 7\n#> id name gender culture born died \n#> <int> <chr> <chr> <chr> <chr> <chr> \n#> 1 1022 Theon Greyjoy Male \"Ironborn\" \"In 278 AC or 27… \"\" \n#> 2 1052 Tyrion Lannister Male \"\" \"In 273 AC, at C… \"\" \n#> 3 1074 Victarion Greyjoy Male \"Ironborn\" \"In 268 AC or be… \"\" \n#> 4 1109 Will Male \"\" \"\" \"In 297 AC, at…\n#> 5 1166 Areo Hotah Male \"Norvoshi\" \"In 257 AC or be… \"\" \n#> 6 1267 Chett Male \"\" \"At Hag's Mire\" \"In 299 AC, at…\n#> # ℹ 24 more rows\n#> # ℹ 1 more variable: alive <lgl>\n\nThis dataset contains also many list-columns:\n\nchars |> \n unnest_wider(json) |> \n select(id, where(is.list))\n#> # A tibble: 30 × 8\n#> id titles aliases allegiances books povBooks tvSeries playedBy\n#> <int> <list> <list> <list> <list> <list> <list> <list> \n#> 1 1022 <chr [2]> <chr [4]> <chr [1]> <chr [3]> <chr> <chr> <chr> \n#> 2 1052 <chr [2]> <chr [11]> <chr [1]> <chr [2]> <chr> <chr> <chr> \n#> 3 1074 <chr [2]> <chr [1]> <chr [1]> <chr [3]> <chr> <chr> <chr> \n#> 4 1109 <chr [1]> <chr [1]> <NULL> <chr [1]> <chr> <chr> <chr> \n#> 5 1166 <chr [1]> <chr [1]> <chr [1]> <chr [3]> <chr> <chr> <chr> \n#> 6 1267 <chr [1]> <chr [1]> <NULL> <chr [2]> <chr> <chr> <chr> \n#> # ℹ 24 more rows\n\nLet’s explore the titles column. It’s an unnamed list-column, so we’ll unnest it into rows:\n\nchars |> \n unnest_wider(json) |> \n select(id, titles) |> \n unnest_longer(titles)\n#> # A tibble: 59 × 2\n#> id titles \n#> <int> <chr> \n#> 1 1022 Prince of Winterfell \n#> 2 1022 Lord of the Iron Islands (by law of the green lands)\n#> 3 1052 Acting Hand of the King (former) \n#> 4 1052 Master of Coin (former) \n#> 5 1074 Lord Captain of the Iron Fleet \n#> 6 1074 Master of the Iron Victory \n#> # ℹ 53 more rows\n\nYou might expect to see this data in its own table because it would be easy to join to the characters data as needed. Let’s do that, which requires little cleaning: removing the rows containing empty strings and renaming titles to title since each row now only contains a single title.\n\ntitles <- chars |> \n unnest_wider(json) |> \n select(id, titles) |> \n unnest_longer(titles) |> \n filter(titles != \"\") |> \n rename(title = titles)\ntitles\n#> # A tibble: 52 × 2\n#> id title \n#> <int> <chr> \n#> 1 1022 Prince of Winterfell \n#> 2 1022 Lord of the Iron Islands (by law of the green lands)\n#> 3 1052 Acting Hand of the King (former) \n#> 4 1052 Master of Coin (former) \n#> 5 1074 Lord Captain of the Iron Fleet \n#> 6 1074 Master of the Iron Victory \n#> # ℹ 46 more rows\n\nYou could imagine creating a table like this for each of the list-columns, then using joins to combine them with the character data as you need it.\n\n23.4.3 Deeply nested\nWe’ll finish off these case studies with a list-column that’s very deeply nested and requires repeated rounds of unnest_wider() and unnest_longer() to unravel: gmaps_cities. This is a two column tibble containing five city names and the results of using Google’s geocoding API to determine their location:\n\ngmaps_cities\n#> # A tibble: 5 × 2\n#> city json \n#> <chr> <list> \n#> 1 Houston <named list [2]>\n#> 2 Washington <named list [2]>\n#> 3 New York <named list [2]>\n#> 4 Chicago <named list [2]>\n#> 5 Arlington <named list [2]>\n\njson is a list-column with internal names, so we start with an unnest_wider():\n\ngmaps_cities |> \n unnest_wider(json)\n#> # A tibble: 5 × 3\n#> city results status\n#> <chr> <list> <chr> \n#> 1 Houston <list [1]> OK \n#> 2 Washington <list [2]> OK \n#> 3 New York <list [1]> OK \n#> 4 Chicago <list [1]> OK \n#> 5 Arlington <list [2]> OK\n\nThis gives us the status and the results. We’ll drop the status column since they’re all OK; in a real analysis, you’d also want to capture all the rows where status != \"OK\" and figure out what went wrong. results is an unnamed list, with either one or two elements (we’ll see why shortly) so we’ll unnest it into rows:\n\ngmaps_cities |> \n unnest_wider(json) |> \n select(-status) |> \n unnest_longer(results)\n#> # A tibble: 7 × 2\n#> city results \n#> <chr> <list> \n#> 1 Houston <named list [5]>\n#> 2 Washington <named list [5]>\n#> 3 Washington <named list [5]>\n#> 4 New York <named list [5]>\n#> 5 Chicago <named list [5]>\n#> 6 Arlington <named list [5]>\n#> # ℹ 1 more row\n\nNow results is a named list, so we’ll use unnest_wider():\n\nlocations <- gmaps_cities |> \n unnest_wider(json) |> \n select(-status) |> \n unnest_longer(results) |> \n unnest_wider(results)\nlocations\n#> # A tibble: 7 × 6\n#> city address_components formatted_address geometry \n#> <chr> <list> <chr> <list> \n#> 1 Houston <list [4]> Houston, TX, USA <named list [4]>\n#> 2 Washington <list [2]> Washington, USA <named list [4]>\n#> 3 Washington <list [4]> Washington, DC, USA <named list [4]>\n#> 4 New York <list [3]> New York, NY, USA <named list [4]>\n#> 5 Chicago <list [4]> Chicago, IL, USA <named list [4]>\n#> 6 Arlington <list [4]> Arlington, TX, USA <named list [4]>\n#> # ℹ 1 more row\n#> # ℹ 2 more variables: place_id <chr>, types <list>\n\nNow we can see why two cities got two results: Washington matched both Washington state and Washington, DC, and Arlington matched Arlington, Virginia and Arlington, Texas.\nThere are a few different places we could go from here. We might want to determine the exact location of the match, which is stored in the geometry list-column:\n\nlocations |> \n select(city, formatted_address, geometry) |> \n unnest_wider(geometry)\n#> # A tibble: 7 × 6\n#> city formatted_address bounds location location_type\n#> <chr> <chr> <list> <list> <chr> \n#> 1 Houston Houston, TX, USA <named list [2]> <named list> APPROXIMATE \n#> 2 Washington Washington, USA <named list [2]> <named list> APPROXIMATE \n#> 3 Washington Washington, DC, USA <named list [2]> <named list> APPROXIMATE \n#> 4 New York New York, NY, USA <named list [2]> <named list> APPROXIMATE \n#> 5 Chicago Chicago, IL, USA <named list [2]> <named list> APPROXIMATE \n#> 6 Arlington Arlington, TX, USA <named list [2]> <named list> APPROXIMATE \n#> # ℹ 1 more row\n#> # ℹ 1 more variable: viewport <list>\n\nThat gives us new bounds (a rectangular region) and location (a point). We can unnest location to see the latitude (lat) and longitude (lng):\n\nlocations |> \n select(city, formatted_address, geometry) |> \n unnest_wider(geometry) |> \n unnest_wider(location)\n#> # A tibble: 7 × 7\n#> city formatted_address bounds lat lng location_type\n#> <chr> <chr> <list> <dbl> <dbl> <chr> \n#> 1 Houston Houston, TX, USA <named list [2]> 29.8 -95.4 APPROXIMATE \n#> 2 Washington Washington, USA <named list [2]> 47.8 -121. APPROXIMATE \n#> 3 Washington Washington, DC, USA <named list [2]> 38.9 -77.0 APPROXIMATE \n#> 4 New York New York, NY, USA <named list [2]> 40.7 -74.0 APPROXIMATE \n#> 5 Chicago Chicago, IL, USA <named list [2]> 41.9 -87.6 APPROXIMATE \n#> 6 Arlington Arlington, TX, USA <named list [2]> 32.7 -97.1 APPROXIMATE \n#> # ℹ 1 more row\n#> # ℹ 1 more variable: viewport <list>\n\nExtracting the bounds requires a few more steps:\n\nlocations |> \n select(city, formatted_address, geometry) |> \n unnest_wider(geometry) |> \n # focus on the variables of interest\n select(!location:viewport) |>\n unnest_wider(bounds)\n#> # A tibble: 7 × 4\n#> city formatted_address northeast southwest \n#> <chr> <chr> <list> <list> \n#> 1 Houston Houston, TX, USA <named list [2]> <named list [2]>\n#> 2 Washington Washington, USA <named list [2]> <named list [2]>\n#> 3 Washington Washington, DC, USA <named list [2]> <named list [2]>\n#> 4 New York New York, NY, USA <named list [2]> <named list [2]>\n#> 5 Chicago Chicago, IL, USA <named list [2]> <named list [2]>\n#> 6 Arlington Arlington, TX, USA <named list [2]> <named list [2]>\n#> # ℹ 1 more row\n\nWe then rename southwest and northeast (the corners of the rectangle) so we can use names_sep to create short but evocative names:\n\nlocations |> \n select(city, formatted_address, geometry) |> \n unnest_wider(geometry) |> \n select(!location:viewport) |>\n unnest_wider(bounds) |> \n rename(ne = northeast, sw = southwest) |> \n unnest_wider(c(ne, sw), names_sep = \"_\") \n#> # A tibble: 7 × 6\n#> city formatted_address ne_lat ne_lng sw_lat sw_lng\n#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>\n#> 1 Houston Houston, TX, USA 30.1 -95.0 29.5 -95.8\n#> 2 Washington Washington, USA 49.0 -117. 45.5 -125. \n#> 3 Washington Washington, DC, USA 39.0 -76.9 38.8 -77.1\n#> 4 New York New York, NY, USA 40.9 -73.7 40.5 -74.3\n#> 5 Chicago Chicago, IL, USA 42.0 -87.5 41.6 -87.9\n#> 6 Arlington Arlington, TX, USA 32.8 -97.0 32.6 -97.2\n#> # ℹ 1 more row\n\nNote how we unnest two columns simultaneously by supplying a vector of variable names to unnest_wider().\nOnce you’ve discovered the path to get to the components you’re interested in, you can extract them directly using another tidyr function, hoist():\n\nlocations |> \n select(city, formatted_address, geometry) |> \n hoist(\n geometry,\n ne_lat = c(\"bounds\", \"northeast\", \"lat\"),\n sw_lat = c(\"bounds\", \"southwest\", \"lat\"),\n ne_lng = c(\"bounds\", \"northeast\", \"lng\"),\n sw_lng = c(\"bounds\", \"southwest\", \"lng\"),\n )\n\nIf these case studies have whetted your appetite for more real-life rectangling, you can see a few more examples in vignette(\"rectangling\", package = \"tidyr\").\n\n23.4.4 Exercises\n\nRoughly estimate when gh_repos was created. Why can you only roughly estimate the date?\nThe owner column of gh_repo contains a lot of duplicated information because each owner can have many repos. Can you construct an owners data frame that contains one row for each owner? (Hint: does distinct() work with list-cols?)\nFollow the steps used for titles to create similar tables for the aliases, allegiances, books, and TV series for the Game of Thrones characters.\n\nExplain the following code line-by-line. Why is it interesting? Why does it work for got_chars but might not work in general?\n\ntibble(json = got_chars) |> \n unnest_wider(json) |> \n select(id, where(is.list)) |> \n pivot_longer(\n where(is.list), \n names_to = \"name\", \n values_to = \"value\"\n ) |> \n unnest_longer(value)\n\n\nIn gmaps_cities, what does address_components contain? Why does the length vary between rows? Unnest it appropriately to figure it out. (Hint: types always appears to contain two elements. Does unnest_wider() make it easier to work with than unnest_longer()?) ." + }, + { + "objectID": "rectangling.html#json", + "href": "rectangling.html#json", + "title": "23  Hierarchical data", + "section": "\n23.5 JSON", + "text": "23.5 JSON\nAll of the case studies in the previous section were sourced from wild-caught JSON. JSON is short for javascript object notation and is the way that most web APIs return data. It’s important to understand it because while JSON and R’s data types are pretty similar, there isn’t a perfect 1-to-1 mapping, so it’s good to understand a bit about JSON if things go wrong.\n\n23.5.1 Data types\nJSON is a simple format designed to be easily read and written by machines, not humans. It has six key data types. Four of them are scalars:\n\nThe simplest type is a null (null) which plays the same role as NA in R. It represents the absence of data.\nA string is much like a string in R, but must always use double quotes.\nA number is similar to R’s numbers: they can use integer (e.g., 123), decimal (e.g., 123.45), or scientific (e.g., 1.23e3) notation. JSON doesn’t support Inf, -Inf, or NaN.\nA boolean is similar to R’s TRUE and FALSE, but uses lowercase true and false.\n\nJSON’s strings, numbers, and booleans are pretty similar to R’s character, numeric, and logical vectors. The main difference is that JSON’s scalars can only represent a single value. To represent multiple values you need to use one of the two remaining types: arrays and objects.\nBoth arrays and objects are similar to lists in R; the difference is whether or not they’re named. An array is like an unnamed list, and is written with []. For example [1, 2, 3] is an array containing 3 numbers, and [null, 1, \"string\", false] is an array that contains a null, a number, a string, and a boolean. An object is like a named list, and is written with {}. The names (keys in JSON terminology) are strings, so must be surrounded by quotes. For example, {\"x\": 1, \"y\": 2} is an object that maps x to 1 and y to 2.\nNote that JSON doesn’t have any native way to represent dates or date-times, so they’re often stored as strings, and you’ll need to use readr::parse_date() or readr::parse_datetime() to turn them into the correct data structure. Similarly, JSON’s rules for representing floating point numbers in JSON are a little imprecise, so you’ll also sometimes find numbers stored in strings. Apply readr::parse_double() as needed to get the correct variable type.\n\n23.5.2 jsonlite\nTo convert JSON into R data structures, we recommend the jsonlite package, by Jeroen Ooms. We’ll use only two jsonlite functions: read_json() and parse_json(). In real life, you’ll use read_json() to read a JSON file from disk. For example, the repurrsive package also provides the source for gh_user as a JSON file and you can read it with read_json():\n\n# A path to a json file inside the package:\ngh_users_json()\n#> [1] \"/home/runner/work/_temp/Library/repurrrsive/extdata/gh_users.json\"\n\n# Read it with read_json()\ngh_users2 <- read_json(gh_users_json())\n\n# Check it's the same as the data we were using previously\nidentical(gh_users, gh_users2)\n#> [1] TRUE\n\nIn this book, we’ll also use parse_json(), since it takes a string containing JSON, which makes it good for generating simple examples. To get started, here are three simple JSON datasets, starting with a number, then putting a few numbers in an array, then putting that array in an object:\n\nstr(parse_json('1'))\n#> int 1\nstr(parse_json('[1, 2, 3]'))\n#> List of 3\n#> $ : int 1\n#> $ : int 2\n#> $ : int 3\nstr(parse_json('{\"x\": [1, 2, 3]}'))\n#> List of 1\n#> $ x:List of 3\n#> ..$ : int 1\n#> ..$ : int 2\n#> ..$ : int 3\n\njsonlite has another important function called fromJSON(). We don’t use it here because it performs automatic simplification (simplifyVector = TRUE). This often works well, particularly in simple cases, but we think you’re better off doing the rectangling yourself so you know exactly what’s happening and can more easily handle the most complicated nested structures.\n\n23.5.3 Starting the rectangling process\nIn most cases, JSON files contain a single top-level array, because they’re designed to provide data about multiple “things”, e.g., multiple pages, or multiple records, or multiple results. In this case, you’ll start your rectangling with tibble(json) so that each element becomes a row:\n\njson <- '[\n {\"name\": \"John\", \"age\": 34},\n {\"name\": \"Susan\", \"age\": 27}\n]'\ndf <- tibble(json = parse_json(json))\ndf\n#> # A tibble: 2 × 1\n#> json \n#> <list> \n#> 1 <named list [2]>\n#> 2 <named list [2]>\n\ndf |> \n unnest_wider(json)\n#> # A tibble: 2 × 2\n#> name age\n#> <chr> <int>\n#> 1 John 34\n#> 2 Susan 27\n\nIn rarer cases, the JSON file consists of a single top-level JSON object, representing one “thing”. In this case, you’ll need to kick off the rectangling process by wrapping it in a list, before you put it in a tibble.\n\njson <- '{\n \"status\": \"OK\", \n \"results\": [\n {\"name\": \"John\", \"age\": 34},\n {\"name\": \"Susan\", \"age\": 27}\n ]\n}\n'\ndf <- tibble(json = list(parse_json(json)))\ndf\n#> # A tibble: 1 × 1\n#> json \n#> <list> \n#> 1 <named list [2]>\n\ndf |> \n unnest_wider(json) |> \n unnest_longer(results) |> \n unnest_wider(results)\n#> # A tibble: 2 × 3\n#> status name age\n#> <chr> <chr> <int>\n#> 1 OK John 34\n#> 2 OK Susan 27\n\nAlternatively, you can reach inside the parsed JSON and start with the bit that you actually care about:\n\ndf <- tibble(results = parse_json(json)$results)\ndf |> \n unnest_wider(results)\n#> # A tibble: 2 × 2\n#> name age\n#> <chr> <int>\n#> 1 John 34\n#> 2 Susan 27\n\n\n23.5.4 Exercises\n\n\nRectangle the df_col and df_row below. They represent the two ways of encoding a data frame in JSON.\n\njson_col <- parse_json('\n {\n \"x\": [\"a\", \"x\", \"z\"],\n \"y\": [10, null, 3]\n }\n')\njson_row <- parse_json('\n [\n {\"x\": \"a\", \"y\": 10},\n {\"x\": \"x\", \"y\": null},\n {\"x\": \"z\", \"y\": 3}\n ]\n')\n\ndf_col <- tibble(json = list(json_col)) \ndf_row <- tibble(json = json_row)" + }, + { + "objectID": "rectangling.html#summary", + "href": "rectangling.html#summary", + "title": "23  Hierarchical data", + "section": "\n23.6 Summary", + "text": "23.6 Summary\nIn this chapter, you learned what lists are, how you can generate them from JSON files, and how to turn them into rectangular data frames. Surprisingly we only need two new functions: unnest_longer() to put list elements into rows and unnest_wider() to put list elements into columns. It doesn’t matter how deeply nested the list-column is; all you need to do is repeatedly call these two functions.\nJSON is the most common data format returned by web APIs. What happens if the website doesn’t have an API, but you can see data you want on the website? That’s the topic of the next chapter: web scraping, extracting data from HTML webpages." + }, + { + "objectID": "rectangling.html#footnotes", + "href": "rectangling.html#footnotes", + "title": "23  Hierarchical data", + "section": "", + "text": "This is an RStudio feature.↩︎" + }, + { + "objectID": "webscraping.html#introduction", + "href": "webscraping.html#introduction", + "title": "24  Web scraping", + "section": "\n24.1 Introduction", + "text": "24.1 Introduction\nThis chapter introduces you to the basics of web scraping with rvest. Web scraping is a very useful tool for extracting data from web pages. Some websites will offer an API, a set of structured HTTP requests that return data as JSON, which you handle using the techniques from Capítulo 23. Where possible, you should use the API1, because typically it will give you more reliable data. Unfortunately, however, programming with web APIs is out of scope for this book. Instead, we are teaching scraping, a technique that works whether or not a site provides an API.\nIn this chapter, we’ll first discuss the ethics and legalities of scraping before we dive into the basics of HTML. You’ll then learn the basics of CSS selectors to locate specific elements on the page, and how to use rvest functions to get data from text and attributes out of HTML and into R. We’ll then discuss some techniques to figure out what CSS selector you need for the page you’re scraping, before finishing up with a couple of case studies, and a brief discussion of dynamic websites.\n\n24.1.1 Prerequisites\nIn this chapter, we’ll focus on tools provided by rvest. rvest is a member of the tidyverse, but is not a core member so you’ll need to load it explicitly. We’ll also load the full tidyverse since we’ll find it generally useful working with the data we’ve scraped.\n\nlibrary(tidyverse)\nlibrary(rvest)" + }, + { + "objectID": "webscraping.html#scraping-ethics-and-legalities", + "href": "webscraping.html#scraping-ethics-and-legalities", + "title": "24  Web scraping", + "section": "\n24.2 Scraping ethics and legalities", + "text": "24.2 Scraping ethics and legalities\nBefore we get started discussing the code you’ll need to perform web scraping, we need to talk about whether it’s legal and ethical for you to do so. Overall, the situation is complicated with regards to both of these.\nLegalities depend a lot on where you live. However, as a general principle, if the data is public, non-personal, and factual, you’re likely to be ok2. These three factors are important because they’re connected to the site’s terms and conditions, personally identifiable information, and copyright, as we’ll discuss below.\nIf the data isn’t public, non-personal, or factual or you’re scraping the data specifically to make money with it, you’ll need to talk to a lawyer. In any case, you should be respectful of the resources of the server hosting the pages you are scraping. Most importantly, this means that if you’re scraping many pages, you should make sure to wait a little between each request. One easy way to do so is to use the polite package by Dmytro Perepolkin. It will automatically pause between requests and cache the results so you never ask for the same page twice.\n\n24.2.1 Terms of service\nIf you look closely, you’ll find many websites include a “terms and conditions” or “terms of service” link somewhere on the page, and if you read that page closely you’ll often discover that the site specifically prohibits web scraping. These pages tend to be a legal land grab where companies make very broad claims. It’s polite to respect these terms of service where possible, but take any claims with a grain of salt.\nUS courts have generally found that simply putting the terms of service in the footer of the website isn’t sufficient for you to be bound by them, e.g., HiQ Labs v. LinkedIn. Generally, to be bound to the terms of service, you must have taken some explicit action like creating an account or checking a box. This is why whether or not the data is public is important; if you don’t need an account to access them, it is unlikely that you are bound to the terms of service. Note, however, the situation is rather different in Europe where courts have found that terms of service are enforceable even if you don’t explicitly agree to them.\n\n24.2.2 Personally identifiable information\nEven if the data is public, you should be extremely careful about scraping personally identifiable information like names, email addresses, phone numbers, dates of birth, etc. Europe has particularly strict laws about the collection or storage of such data (GDPR), and regardless of where you live you’re likely to be entering an ethical quagmire. For example, in 2016, a group of researchers scraped public profile information (e.g., usernames, age, gender, location, etc.) about 70,000 people on the dating site OkCupid and they publicly released these data without any attempts for anonymization. While the researchers felt that there was nothing wrong with this since the data were already public, this work was widely condemned due to ethics concerns around identifiability of users whose information was released in the dataset. If your work involves scraping personally identifiable information, we strongly recommend reading about the OkCupid study3 as well as similar studies with questionable research ethics involving the acquisition and release of personally identifiable information.\n\n24.2.3 Copyright\nFinally, you also need to worry about copyright law. Copyright law is complicated, but it’s worth taking a look at the US law which describes exactly what’s protected: “[…] original works of authorship fixed in any tangible medium of expression, […]”. It then goes on to describe specific categories that it applies like literary works, musical works, motion pictures and more. Notably absent from copyright protection are data. This means that as long as you limit your scraping to facts, copyright protection does not apply. (But note that Europe has a separate “sui generis” right that protects databases.)\nAs a brief example, in the US, lists of ingredients and instructions are not copyrightable, so copyright can not be used to protect a recipe. But if that list of recipes is accompanied by substantial novel literary content, that is copyrightable. This is why when you’re looking for a recipe on the internet there’s always so much content beforehand.\nIf you do need to scrape original content (like text or images), you may still be protected under the doctrine of fair use. Fair use is not a hard and fast rule, but weighs up a number of factors. It’s more likely to apply if you are collecting the data for research or non-commercial purposes and if you limit what you scrape to just what you need." + }, + { + "objectID": "webscraping.html#html-basics", + "href": "webscraping.html#html-basics", + "title": "24  Web scraping", + "section": "\n24.3 HTML basics", + "text": "24.3 HTML basics\nTo scrape webpages, you need to first understand a little bit about HTML, the language that describes web pages. HTML stands for HyperText Markup Language and looks something like this:\n<html>\n<head>\n <title>Page title</title>\n</head>\n<body>\n <h1 id='first'>A heading</h1>\n <p>Some text & <b>some bold text.</b></p>\n <img src='myimg.png' width='100' height='100'>\n</body>\nHTML has a hierarchical structure formed by elements which consist of a start tag (e.g., <tag>), optional attributes (id='first'), an end tag4 (like </tag>), and contents (everything in between the start and end tag).\nSince < and > are used for start and end tags, you can’t write them directly. Instead you have to use the HTML escapes > (greater than) and < (less than). And since those escapes use &, if you want a literal ampersand you have to escape it as &. There are a wide range of possible HTML escapes but you don’t need to worry about them too much because rvest automatically handles them for you.\nWeb scraping is possible because most pages that contain data that you want to scrape generally have a consistent structure.\n\n24.3.1 Elements\nThere are over 100 HTML elements. Some of the most important are:\n\nEvery HTML page must be in an <html> element, and it must have two children: <head>, which contains document metadata like the page title, and <body>, which contains the content you see in the browser.\nBlock tags like <h1> (heading 1), <section> (section), <p> (paragraph), and <ol> (ordered list) form the overall structure of the page.\nInline tags like <b> (bold), <i> (italics), and <a> (link) format text inside block tags.\n\nIf you encounter a tag that you’ve never seen before, you can find out what it does with a little googling. Another good place to start are the MDN Web Docs which describe just about every aspect of web programming.\nMost elements can have content in between their start and end tags. This content can either be text or more elements. For example, the following HTML contains paragraph of text, with one word in bold.\n<p>\n Hi! My <b>name</b> is Hadley.\n</p>\nThe children are the elements it contains, so the <p> element above has one child, the <b> element. The <b> element has no children, but it does have contents (the text “name”).\n\n24.3.2 Attributes\nTags can have named attributes which look like name1='value1' name2='value2'. Two of the most important attributes are id and class, which are used in conjunction with CSS (Cascading Style Sheets) to control the visual appearance of the page. These are often useful when scraping data off a page. Attributes are also used to record the destination of links (the href attribute of <a> elements) and the source of images (the src attribute of the <img> element)." + }, + { + "objectID": "webscraping.html#extracting-data", + "href": "webscraping.html#extracting-data", + "title": "24  Web scraping", + "section": "\n24.4 Extracting data", + "text": "24.4 Extracting data\nTo get started scraping, you’ll need the URL of the page you want to scrape, which you can usually copy from your web browser. You’ll then need to read the HTML for that page into R with read_html(). This returns an xml_document5 object which you’ll then manipulate using rvest functions:\n\nhtml <- read_html(\"http://rvest.tidyverse.org/\")\nhtml\n#> {html_document}\n#> <html lang=\"en\">\n#> [1] <head>\\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UT ...\n#> [2] <body>\\n <a href=\"#container\" class=\"visually-hidden-focusable\">Ski ...\n\nrvest also includes a function that lets you write HTML inline. We’ll use this a bunch in this chapter as we teach how the various rvest functions work with simple examples.\n\nhtml <- minimal_html(\"\n <p>This is a paragraph</p>\n <ul>\n <li>This is a bulleted list</li>\n </ul>\n\")\nhtml\n#> {html_document}\n#> <html>\n#> [1] <head>\\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UT ...\n#> [2] <body>\\n<p>This is a paragraph</p>\\n <ul>\\n<li>This is a bulleted lis ...\n\nNow that you have the HTML in R, it’s time to extract the data of interest. You’ll first learn about the CSS selectors that allow you to identify the elements of interest and the rvest functions that you can use to extract data from them. Then we’ll briefly cover HTML tables, which have some special tools.\n\n24.4.1 Find elements\nCSS is short for cascading style sheets, and is a tool for defining the visual styling of HTML documents. CSS includes a miniature language for selecting elements on a page called CSS selectors. CSS selectors define patterns for locating HTML elements, and are useful for scraping because they provide a concise way of describing which elements you want to extract.\nWe’ll come back to CSS selectors in more detail in Seção 24.5, but luckily you can get a long way with just three:\n\np selects all <p> elements.\n.title selects all elements with class “title”.\n#title selects the element with the id attribute that equals “title”. Id attributes must be unique within a document, so this will only ever select a single element.\n\nLet’s try out these selectors with a simple example:\n\nhtml <- minimal_html(\"\n <h1>This is a heading</h1>\n <p id='first'>This is a paragraph</p>\n <p class='important'>This is an important paragraph</p>\n\")\n\nUse html_elements() to find all elements that match the selector:\n\nhtml |> html_elements(\"p\")\n#> {xml_nodeset (2)}\n#> [1] <p id=\"first\">This is a paragraph</p>\n#> [2] <p class=\"important\">This is an important paragraph</p>\nhtml |> html_elements(\".important\")\n#> {xml_nodeset (1)}\n#> [1] <p class=\"important\">This is an important paragraph</p>\nhtml |> html_elements(\"#first\")\n#> {xml_nodeset (1)}\n#> [1] <p id=\"first\">This is a paragraph</p>\n\nAnother important function is html_element() which always returns the same number of outputs as inputs. If you apply it to a whole document it’ll give you the first match:\n\nhtml |> html_element(\"p\")\n#> {html_node}\n#> <p id=\"first\">\n\nThere’s an important difference between html_element() and html_elements() when you use a selector that doesn’t match any elements. html_elements() returns a vector of length 0, where html_element() returns a missing value. This will be important shortly.\n\nhtml |> html_elements(\"b\")\n#> {xml_nodeset (0)}\nhtml |> html_element(\"b\")\n#> {xml_missing}\n#> <NA>\n\n\n24.4.2 Nesting selections\nIn most cases, you’ll use html_elements() and html_element() together, typically using html_elements() to identify elements that will become observations then using html_element() to find elements that will become variables. Let’s see this in action using a simple example. Here we have an unordered list (<ul>) where each list item (<li>) contains some information about four characters from StarWars:\n\nhtml <- minimal_html(\"\n <ul>\n <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li>\n <li><b>R4-P17</b> is a <i>droid</i></li>\n <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li>\n <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li>\n </ul>\n \")\n\nWe can use html_elements() to make a vector where each element corresponds to a different character:\n\ncharacters <- html |> html_elements(\"li\")\ncharacters\n#> {xml_nodeset (4)}\n#> [1] <li>\\n<b>C-3PO</b> is a <i>droid</i> that weighs <span class=\"weight\"> ...\n#> [2] <li>\\n<b>R4-P17</b> is a <i>droid</i>\\n</li>\n#> [3] <li>\\n<b>R2-D2</b> is a <i>droid</i> that weighs <span class=\"weight\"> ...\n#> [4] <li>\\n<b>Yoda</b> weighs <span class=\"weight\">66 kg</span>\\n</li>\n\nTo extract the name of each character, we use html_element(), because when applied to the output of html_elements() it’s guaranteed to return one response per element:\n\ncharacters |> html_element(\"b\")\n#> {xml_nodeset (4)}\n#> [1] <b>C-3PO</b>\n#> [2] <b>R4-P17</b>\n#> [3] <b>R2-D2</b>\n#> [4] <b>Yoda</b>\n\nThe distinction between html_element() and html_elements() isn’t important for name, but it is important for weight. We want to get one weight for each character, even if there’s no weight <span>. That’s what html_element() does:\n\ncharacters |> html_element(\".weight\")\n#> {xml_nodeset (4)}\n#> [1] <span class=\"weight\">167 kg</span>\n#> [2] <NA>\n#> [3] <span class=\"weight\">96 kg</span>\n#> [4] <span class=\"weight\">66 kg</span>\n\nhtml_elements() finds all weight <span>s that are children of characters. There’s only three of these, so we lose the connection between names and weights:\n\ncharacters |> html_elements(\".weight\")\n#> {xml_nodeset (3)}\n#> [1] <span class=\"weight\">167 kg</span>\n#> [2] <span class=\"weight\">96 kg</span>\n#> [3] <span class=\"weight\">66 kg</span>\n\nNow that you’ve selected the elements of interest, you’ll need to extract the data, either from the text contents or some attributes.\n\n24.4.3 Text and attributes\nhtml_text2()6 extracts the plain text contents of an HTML element:\n\ncharacters |> \n html_element(\"b\") |> \n html_text2()\n#> [1] \"C-3PO\" \"R4-P17\" \"R2-D2\" \"Yoda\"\n\ncharacters |> \n html_element(\".weight\") |> \n html_text2()\n#> [1] \"167 kg\" NA \"96 kg\" \"66 kg\"\n\nNote that any escapes will be automatically handled; you’ll only ever see HTML escapes in the source HTML, not in the data returned by rvest.\nhtml_attr() extracts data from attributes:\n\nhtml <- minimal_html(\"\n <p><a href='https://en.wikipedia.org/wiki/Cat'>cats</a></p>\n <p><a href='https://en.wikipedia.org/wiki/Dog'>dogs</a></p>\n\")\n\nhtml |> \n html_elements(\"p\") |> \n html_element(\"a\") |> \n html_attr(\"href\")\n#> [1] \"https://en.wikipedia.org/wiki/Cat\" \"https://en.wikipedia.org/wiki/Dog\"\n\nhtml_attr() always returns a string, so if you’re extracting numbers or dates, you’ll need to do some post-processing.\n\n24.4.4 Tables\nIf you’re lucky, your data will be already stored in an HTML table, and it’ll be a matter of just reading it from that table. It’s usually straightforward to recognize a table in your browser: it’ll have a rectangular structure of rows and columns, and you can copy and paste it into a tool like Excel.\nHTML tables are built up from four main elements: <table>, <tr> (table row), <th> (table heading), and <td> (table data). Here’s a simple HTML table with two columns and three rows:\n\nhtml <- minimal_html(\"\n <table class='mytable'>\n <tr><th>x</th> <th>y</th></tr>\n <tr><td>1.5</td> <td>2.7</td></tr>\n <tr><td>4.9</td> <td>1.3</td></tr>\n <tr><td>7.2</td> <td>8.1</td></tr>\n </table>\n \")\n\nrvest provides a function that knows how to read this sort of data: html_table(). It returns a list containing one tibble for each table found on the page. Use html_element() to identify the table you want to extract:\n\nhtml |> \n html_element(\".mytable\") |> \n html_table()\n#> # A tibble: 3 × 2\n#> x y\n#> <dbl> <dbl>\n#> 1 1.5 2.7\n#> 2 4.9 1.3\n#> 3 7.2 8.1\n\nNote that x and y have automatically been converted to numbers. This automatic conversion doesn’t always work, so in more complex scenarios you may want to turn it off with convert = FALSE and then do your own conversion." + }, + { + "objectID": "webscraping.html#sec-css-selectors", + "href": "webscraping.html#sec-css-selectors", + "title": "24  Web scraping", + "section": "\n24.5 Finding the right selectors", + "text": "24.5 Finding the right selectors\nFiguring out the selector you need for your data is typically the hardest part of the problem. You’ll often need to do some experimenting to find a selector that is both specific (i.e. it doesn’t select things you don’t care about) and sensitive (i.e. it does select everything you care about). Lots of trial and error is a normal part of the process! There are two main tools that are available to help you with this process: SelectorGadget and your browser’s developer tools.\nSelectorGadget is a javascript bookmarklet that automatically generates CSS selectors based on the positive and negative examples that you provide. It doesn’t always work, but when it does, it’s magic! You can learn how to install and use SelectorGadget either by reading https://rvest.tidyverse.org/articles/selectorgadget.html or watching Mine’s video at https://www.youtube.com/watch?v=PetWV5g1Xsc.\nEvery modern browser comes with some toolkit for developers, but we recommend Chrome, even if it isn’t your regular browser: its web developer tools are some of the best and they’re immediately available. Right click on an element on the page and click Inspect. This will open an expandable view of the complete HTML page, centered on the element that you just clicked. You can use this to explore the page and get a sense of what selectors might work. Pay particular attention to the class and id attributes, since these are often used to form the visual structure of the page, and hence make for good tools to extract the data that you’re looking for.\nInside the Elements view, you can also right click on an element and choose Copy as Selector to generate a selector that will uniquely identify the element of interest.\nIf either SelectorGadget or Chrome DevTools have generated a CSS selector that you don’t understand, try Selectors Explained which translates CSS selectors into plain English. If you find yourself doing this a lot, you might want to learn more about CSS selectors generally. We recommend starting with the fun CSS dinner tutorial and then referring to the MDN web docs." + }, + { + "objectID": "webscraping.html#putting-it-all-together", + "href": "webscraping.html#putting-it-all-together", + "title": "24  Web scraping", + "section": "\n24.6 Putting it all together", + "text": "24.6 Putting it all together\nLet’s put this all together to scrape some websites. There’s some risk that these examples may no longer work when you run them — that’s the fundamental challenge of web scraping; if the structure of the site changes, then you’ll have to change your scraping code.\n\n24.6.1 StarWars\nrvest includes a very simple example in vignette(\"starwars\"). This is a simple page with minimal HTML so it’s a good place to start. I’d encourage you to navigate to that page now and use “Inspect Element” to inspect one of the headings that’s the title of a Star Wars movie. Use the keyboard or mouse to explore the hierarchy of the HTML and see if you can get a sense of the shared structure used by each movie.\nYou should be able to see that each movie has a shared structure that looks like this:\n<section>\n <h2 data-id=\"1\">The Phantom Menace</h2>\n <p>Released: 1999-05-19</p>\n <p>Director: <span class=\"director\">George Lucas</span></p>\n \n <div class=\"crawl\">\n <p>...</p>\n <p>...</p>\n <p>...</p>\n </div>\n</section>\nOur goal is to turn this data into a 7 row data frame with variables title, year, director, and intro. We’ll start by reading the HTML and extracting all the <section> elements:\n\nurl <- \"https://rvest.tidyverse.org/articles/starwars.html\"\nhtml <- read_html(url)\n\nsection <- html |> html_elements(\"section\")\nsection\n#> {xml_nodeset (7)}\n#> [1] <section><h2 data-id=\"1\">\\nThe Phantom Menace\\n</h2>\\n<p>\\nReleased: 1 ...\n#> [2] <section><h2 data-id=\"2\">\\nAttack of the Clones\\n</h2>\\n<p>\\nReleased: ...\n#> [3] <section><h2 data-id=\"3\">\\nRevenge of the Sith\\n</h2>\\n<p>\\nReleased: ...\n#> [4] <section><h2 data-id=\"4\">\\nA New Hope\\n</h2>\\n<p>\\nReleased: 1977-05-2 ...\n#> [5] <section><h2 data-id=\"5\">\\nThe Empire Strikes Back\\n</h2>\\n<p>\\nReleas ...\n#> [6] <section><h2 data-id=\"6\">\\nReturn of the Jedi\\n</h2>\\n<p>\\nReleased: 1 ...\n#> [7] <section><h2 data-id=\"7\">\\nThe Force Awakens\\n</h2>\\n<p>\\nReleased: 20 ...\n\nThis retrieves seven elements matching the seven movies found on that page, suggesting that using section as a selector is good. Extracting the individual elements is straightforward since the data is always found in the text. It’s just a matter of finding the right selector:\n\nsection |> html_element(\"h2\") |> html_text2()\n#> [1] \"The Phantom Menace\" \"Attack of the Clones\" \n#> [3] \"Revenge of the Sith\" \"A New Hope\" \n#> [5] \"The Empire Strikes Back\" \"Return of the Jedi\" \n#> [7] \"The Force Awakens\"\n\nsection |> html_element(\".director\") |> html_text2()\n#> [1] \"George Lucas\" \"George Lucas\" \"George Lucas\" \n#> [4] \"George Lucas\" \"Irvin Kershner\" \"Richard Marquand\"\n#> [7] \"J. J. Abrams\"\n\nOnce we’ve done that for each component, we can wrap all the results up into a tibble:\n\ntibble(\n title = section |> \n html_element(\"h2\") |> \n html_text2(),\n released = section |> \n html_element(\"p\") |> \n html_text2() |> \n str_remove(\"Released: \") |> \n parse_date(),\n director = section |> \n html_element(\".director\") |> \n html_text2(),\n intro = section |> \n html_element(\".crawl\") |> \n html_text2()\n)\n#> # A tibble: 7 × 4\n#> title released director intro \n#> <chr> <date> <chr> <chr> \n#> 1 The Phantom Menace 1999-05-19 George Lucas \"Turmoil has engulfed …\n#> 2 Attack of the Clones 2002-05-16 George Lucas \"There is unrest in th…\n#> 3 Revenge of the Sith 2005-05-19 George Lucas \"War! The Republic is …\n#> 4 A New Hope 1977-05-25 George Lucas \"It is a period of civ…\n#> 5 The Empire Strikes Back 1980-05-17 Irvin Kershner \"It is a dark time for…\n#> 6 Return of the Jedi 1983-05-25 Richard Marquand \"Luke Skywalker has re…\n#> # ℹ 1 more row\n\nWe did a little more processing of released to get a variable that will be easy to use later in our analysis.\n\n24.6.2 IMDB top films\nFor our next task we’ll tackle something a little trickier, extracting the top 250 movies from the internet movie database (IMDb). At the time we wrote this chapter, the page looked like Figura 24.1.\n\n\n\n\nFigura 24.1: Screenshot of the IMDb top movies web page taken on 2022-12-05.\n\n\n\nThis data has a clear tabular structure so it’s worth starting with html_table():\n\nurl <- \"https://web.archive.org/web/20220201012049/https://www.imdb.com/chart/top/\"\nhtml <- read_html(url)\n\ntable <- html |> \n html_element(\"table\") |> \n html_table()\ntable\n#> # A tibble: 250 × 5\n#> `` `Rank & Title` `IMDb Rating` `Your Rating` `` \n#> <lgl> <chr> <dbl> <chr> <lgl>\n#> 1 NA \"1.\\n The Shawshank Redempt… 9.2 \"12345678910\\n… NA \n#> 2 NA \"2.\\n The Godfather\\n … 9.1 \"12345678910\\n… NA \n#> 3 NA \"3.\\n The Godfather: Part I… 9 \"12345678910\\n… NA \n#> 4 NA \"4.\\n The Dark Knight\\n … 9 \"12345678910\\n… NA \n#> 5 NA \"5.\\n 12 Angry Men\\n … 8.9 \"12345678910\\n… NA \n#> 6 NA \"6.\\n Schindler's List\\n … 8.9 \"12345678910\\n… NA \n#> # ℹ 244 more rows\n\nThis includes a few empty columns, but overall does a good job of capturing the information from the table. However, we need to do some more processing to make it easier to use. First, we’ll rename the columns to be easier to work with, and remove the extraneous whitespace in rank and title. We will do this with select() (instead of rename()) to do the renaming and selecting of just these two columns in one step. Then we’ll remove the new lines and extra spaces, and then apply separate_wider_regex() (from Seção 15.3.4) to pull out the title, year, and rank into their own variables.\n\nratings <- table |>\n select(\n rank_title_year = `Rank & Title`,\n rating = `IMDb Rating`\n ) |> \n mutate(\n rank_title_year = str_replace_all(rank_title_year, \"\\n +\", \" \")\n ) |> \n separate_wider_regex(\n rank_title_year,\n patterns = c(\n rank = \"\\\\d+\", \"\\\\. \",\n title = \".+\", \" +\\\\(\",\n year = \"\\\\d+\", \"\\\\)\"\n )\n )\nratings\n#> # A tibble: 250 × 4\n#> rank title year rating\n#> <chr> <chr> <chr> <dbl>\n#> 1 1 The Shawshank Redemption 1994 9.2\n#> 2 2 The Godfather 1972 9.1\n#> 3 3 The Godfather: Part II 1974 9 \n#> 4 4 The Dark Knight 2008 9 \n#> 5 5 12 Angry Men 1957 8.9\n#> 6 6 Schindler's List 1993 8.9\n#> # ℹ 244 more rows\n\nEven in this case where most of the data comes from table cells, it’s still worth looking at the raw HTML. If you do so, you’ll discover that we can add a little extra data by using one of the attributes. This is one of the reasons it’s worth spending a little time spelunking the source of the page; you might find extra data, or might find a parsing route that’s slightly easier.\n\nhtml |> \n html_elements(\"td strong\") |> \n head() |> \n html_attr(\"title\")\n#> [1] \"9.2 based on 2,536,415 user ratings\"\n#> [2] \"9.1 based on 1,745,675 user ratings\"\n#> [3] \"9.0 based on 1,211,032 user ratings\"\n#> [4] \"9.0 based on 2,486,931 user ratings\"\n#> [5] \"8.9 based on 749,563 user ratings\" \n#> [6] \"8.9 based on 1,295,705 user ratings\"\n\nWe can combine this with the tabular data and again apply separate_wider_regex() to extract out the bit of data we care about:\n\nratings |>\n mutate(\n rating_n = html |> html_elements(\"td strong\") |> html_attr(\"title\")\n ) |> \n separate_wider_regex(\n rating_n,\n patterns = c(\n \"[0-9.]+ based on \",\n number = \"[0-9,]+\",\n \" user ratings\"\n )\n ) |> \n mutate(\n number = parse_number(number)\n )\n#> # A tibble: 250 × 5\n#> rank title year rating number\n#> <chr> <chr> <chr> <dbl> <dbl>\n#> 1 1 The Shawshank Redemption 1994 9.2 2536415\n#> 2 2 The Godfather 1972 9.1 1745675\n#> 3 3 The Godfather: Part II 1974 9 1211032\n#> 4 4 The Dark Knight 2008 9 2486931\n#> 5 5 12 Angry Men 1957 8.9 749563\n#> 6 6 Schindler's List 1993 8.9 1295705\n#> # ℹ 244 more rows" + }, + { + "objectID": "webscraping.html#dynamic-sites", + "href": "webscraping.html#dynamic-sites", + "title": "24  Web scraping", + "section": "\n24.7 Dynamic sites", + "text": "24.7 Dynamic sites\nSo far we have focused on websites where html_elements() returns what you see in the browser and discussed how to parse what it returns and how to organize that information in tidy data frames. From time-to-time, however, you’ll hit a site where html_elements() and friends don’t return anything like what you see in the browser. In many cases, that’s because you’re trying to scrape a website that dynamically generates the content of the page with javascript. This doesn’t currently work with rvest, because rvest downloads the raw HTML and doesn’t run any javascript.\nIt’s still possible to scrape these types of sites, but rvest needs to use a more expensive process: fully simulating the web browser including running all javascript. This functionality is not available at the time of writing, but it’s something we’re actively working on and might be available by the time you read this. It uses the chromote package which actually runs the Chrome browser in the background, and gives you additional tools to interact with the site, like a human typing text and clicking buttons. Check out the rvest website for more details." + }, + { + "objectID": "webscraping.html#summary", + "href": "webscraping.html#summary", + "title": "24  Web scraping", + "section": "\n24.8 Summary", + "text": "24.8 Summary\nIn this chapter, you’ve learned about the why, the why not, and the how of scraping data from web pages. First, you’ve learned about the basics of HTML and using CSS selectors to refer to specific elements, then you’ve learned about using the rvest package to get data out of HTML into R. We then demonstrated web scraping with two case studies: a simpler scenario on scraping data on StarWars films from the rvest package website and a more complex scenario on scraping the top 250 films from IMDB.\nTechnical details of scraping data off the web can be complex, particularly when dealing with sites, however legal and ethical considerations can be even more complex. It’s important for you to educate yourself about both of these before setting out to scrape data.\nThis brings us to the end of the import part of the book where you’ve learned techniques to get data from where it lives (spreadsheets, databases, JSON files, and web sites) into a tidy form in R. Now it’s time to turn our sights to a new topic: making the most of R as a programming language." + }, + { + "objectID": "webscraping.html#footnotes", + "href": "webscraping.html#footnotes", + "title": "24  Web scraping", + "section": "", + "text": "And many popular APIs already have CRAN packages that wrap them, so start with a little research first!↩︎\nObviously we’re not lawyers, and this is not legal advice. But this is the best summary we can give having read a bunch about this topic.↩︎\nOne example of an article on the OkCupid study was published by Wired, https://www.wired.com/2016/05/okcupid-study-reveals-perils-big-data-science.↩︎\nA number of tags (including <p> and <li>) don’t require end tags, but we think it’s best to include them because it makes seeing the structure of the HTML a little easier.↩︎\nThis class comes from the xml2 package. xml2 is a low-level package that rvest builds on top of.↩︎\nrvest also provides html_text() but you should almost always use html_text2() since it does a better job of converting nested HTML to text.↩︎" }, { "objectID": "program.html", "href": "program.html", "title": "Program", "section": "", - "text": "In this part of the book, you’ll improve your programming skills. Programming is a cross-cutting skill needed for all data science work: you must use a computer to do data science; you cannot do it in your head, or with pencil and paper.\n\n\n\n\nFigura 1: Programming is the water in which all the other components swim.\n\n\n\nProgramming produces code, and code is a tool of communication. Obviously code tells the computer what you want it to do. But it also communicates meaning to other humans. Thinking about code as a vehicle for communication is important because every project you do is fundamentally collaborative. Even if you’re not working with other people, you’ll definitely be working with future-you! Writing clear code is important so that others (like future-you) can understand why you tackled an analysis in the way you did. That means getting better at programming also involves getting better at communicating. Over time, you want your code to become not just easier to write, but easier for others to read.\nIn the following three chapters, you’ll learn skills to improve your programming skills:\n\nCopy-and-paste is a powerful tool, but you should avoid doing it more than twice. Repeating yourself in code is dangerous because it can easily lead to errors and inconsistencies. Instead, in ?sec-functions, you’ll learn how to write functions which let you extract out repeated tidyverse code so that it can be easily reused.\nFunctions extract out repeated code, but you often need to repeat the same actions on different inputs. You need tools for iteration that let you do similar things again and again. These tools include for loops and functional programming, which you’ll learn about in ?sec-iteration.\nAs you read more code written by others, you’ll see more code that doesn’t use the tidyverse. In ?sec-base-r, you’ll learn some of the most important base R functions that you’ll see in the wild.\n\nThe goal of these chapters is to teach you the minimum about programming that you need for data science. Once you have mastered the material here, we strongly recommend that you continue to invest in your programming skills. We’ve written two books that you might find helpful. Hands on Programming with R, by Garrett Grolemund, is an introduction to R as a programming language and is a great place to start if R is your first programming language. Advanced R by Hadley Wickham dives into the details of R the programming language; it’s great place to start if you have existing programming experience and great next step once you’ve internalized the ideas in these chapters." + "text": "In this part of the book, you’ll improve your programming skills. Programming is a cross-cutting skill needed for all data science work: you must use a computer to do data science; you cannot do it in your head, or with pencil and paper.\n\n\n\n\nFigura 1: Programming is the water in which all the other components swim.\n\n\n\nProgramming produces code, and code is a tool of communication. Obviously code tells the computer what you want it to do. But it also communicates meaning to other humans. Thinking about code as a vehicle for communication is important because every project you do is fundamentally collaborative. Even if you’re not working with other people, you’ll definitely be working with future-you! Writing clear code is important so that others (like future-you) can understand why you tackled an analysis in the way you did. That means getting better at programming also involves getting better at communicating. Over time, you want your code to become not just easier to write, but easier for others to read.\nIn the following three chapters, you’ll learn skills to improve your programming skills:\n\nCopy-and-paste is a powerful tool, but you should avoid doing it more than twice. Repeating yourself in code is dangerous because it can easily lead to errors and inconsistencies. Instead, in Capítulo 25, you’ll learn how to write functions which let you extract out repeated tidyverse code so that it can be easily reused.\nFunctions extract out repeated code, but you often need to repeat the same actions on different inputs. You need tools for iteration that let you do similar things again and again. These tools include for loops and functional programming, which you’ll learn about in Capítulo 26.\nAs you read more code written by others, you’ll see more code that doesn’t use the tidyverse. In Capítulo 27, you’ll learn some of the most important base R functions that you’ll see in the wild.\n\nThe goal of these chapters is to teach you the minimum about programming that you need for data science. Once you have mastered the material here, we strongly recommend that you continue to invest in your programming skills. We’ve written two books that you might find helpful. Hands on Programming with R, by Garrett Grolemund, is an introduction to R as a programming language and is a great place to start if R is your first programming language. Advanced R by Hadley Wickham dives into the details of R the programming language; it’s great place to start if you have existing programming experience and great next step once you’ve internalized the ideas in these chapters." + }, + { + "objectID": "functions.html#introduction", + "href": "functions.html#introduction", + "title": "25  Functions", + "section": "\n25.1 Introduction", + "text": "25.1 Introduction\nOne of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. Writing a function has four big advantages over using copy-and-paste:\n\nYou can give a function an evocative name that makes your code easier to understand.\nAs requirements change, you only need to update code in one place, instead of many.\nYou eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).\nIt makes it easier to reuse work from project-to-project, increasing your productivity over time.\n\nA good rule of thumb is to consider writing a function whenever you’ve copied and pasted a block of code more than twice (i.e. you now have three copies of the same code). In this chapter, you’ll learn about three useful types of functions:\n\nVector functions take one or more vectors as input and return a vector as output.\nData frame functions take a data frame as input and return a data frame as output.\nPlot functions that take a data frame as input and return a plot as output.\n\nEach of these sections includes many examples to help you generalize the patterns that you see. These examples wouldn’t be possible without the help of folks of twitter, and we encourage follow the links in the comment to see original inspirations. You might also want to read the original motivating tweets for general functions and plotting functions to see even more functions.\n\n25.1.1 Prerequisites\nWe’ll wrap up a variety of functions from around the tidyverse. We’ll also use nycflights13 as a source of familiar data to use our functions with.\n\nlibrary(tidyverse)\nlibrary(nycflights13)" + }, + { + "objectID": "functions.html#vector-functions", + "href": "functions.html#vector-functions", + "title": "25  Functions", + "section": "\n25.2 Vector functions", + "text": "25.2 Vector functions\nWe’ll begin with vector functions: functions that take one or more vectors and return a vector result. For example, take a look at this code. What does it do?\n\ndf <- tibble(\n a = rnorm(5),\n b = rnorm(5),\n c = rnorm(5),\n d = rnorm(5),\n)\n\ndf |> mutate(\n a = (a - min(a, na.rm = TRUE)) / \n (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),\n b = (b - min(b, na.rm = TRUE)) / \n (max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),\n c = (c - min(c, na.rm = TRUE)) / \n (max(c, na.rm = TRUE) - min(c, na.rm = TRUE)),\n d = (d - min(d, na.rm = TRUE)) / \n (max(d, na.rm = TRUE) - min(d, na.rm = TRUE)),\n)\n#> # A tibble: 5 × 4\n#> a b c d\n#> <dbl> <dbl> <dbl> <dbl>\n#> 1 0.339 2.59 0.291 0 \n#> 2 0.880 0 0.611 0.557\n#> 3 0 1.37 1 0.752\n#> 4 0.795 1.37 0 1 \n#> 5 1 1.34 0.580 0.394\n\nYou might be able to puzzle out that this rescales each column to have a range from 0 to 1. But did you spot the mistake? When Hadley wrote this code he made an error when copying-and-pasting and forgot to change an a to a b. Preventing this type of mistake is one very good reason to learn how to write functions.\n\n25.2.1 Writing a function\nTo write a function you need to first analyse your repeated code to figure what parts are constant and what parts vary. If we take the code above and pull it outside of mutate(), it’s a little easier to see the pattern because each repetition is now one line:\n\n(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))\n(b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))\n(c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))\n(d - min(d, na.rm = TRUE)) / (max(d, na.rm = TRUE) - min(d, na.rm = TRUE)) \n\nTo make this a bit clearer we can replace the bit that varies with █:\n\n(█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))\n\nTo turn this into a function you need three things:\n\nA name. Here we’ll use rescale01 because this function rescales a vector to lie between 0 and 1.\nThe arguments. The arguments are things that vary across calls and our analysis above tells us that we have just one. We’ll call it x because this is the conventional name for a numeric vector.\nThe body. The body is the code that’s repeated across all the calls.\n\nThen you create a function by following the template:\n\nname <- function(arguments) {\n body\n}\n\nFor this case that leads to:\n\nrescale01 <- function(x) {\n (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))\n}\n\nAt this point you might test with a few simple inputs to make sure you’ve captured the logic correctly:\n\nrescale01(c(-10, 0, 10))\n#> [1] 0.0 0.5 1.0\nrescale01(c(1, 2, 3, NA, 5))\n#> [1] 0.00 0.25 0.50 NA 1.00\n\nThen you can rewrite the call to mutate() as:\n\ndf |> mutate(\n a = rescale01(a),\n b = rescale01(b),\n c = rescale01(c),\n d = rescale01(d),\n)\n#> # A tibble: 5 × 4\n#> a b c d\n#> <dbl> <dbl> <dbl> <dbl>\n#> 1 0.339 1 0.291 0 \n#> 2 0.880 0 0.611 0.557\n#> 3 0 0.530 1 0.752\n#> 4 0.795 0.531 0 1 \n#> 5 1 0.518 0.580 0.394\n\n(In Capítulo 26, you’ll learn how to use across() to reduce the duplication even further so all you need is df |> mutate(across(a:d, rescale01))).\n\n25.2.2 Improving our function\nYou might notice that the rescale01() function does some unnecessary work — instead of computing min() twice and max() once we could instead compute both the minimum and maximum in one step with range():\n\nrescale01 <- function(x) {\n rng <- range(x, na.rm = TRUE)\n (x - rng[1]) / (rng[2] - rng[1])\n}\n\nOr you might try this function on a vector that includes an infinite value:\n\nx <- c(1:10, Inf)\nrescale01(x)\n#> [1] 0 0 0 0 0 0 0 0 0 0 NaN\n\nThat result is not particularly useful so we could ask range() to ignore infinite values:\n\nrescale01 <- function(x) {\n rng <- range(x, na.rm = TRUE, finite = TRUE)\n (x - rng[1]) / (rng[2] - rng[1])\n}\n\nrescale01(x)\n#> [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667\n#> [8] 0.7777778 0.8888889 1.0000000 Inf\n\nThese changes illustrate an important benefit of functions: because we’ve moved the repeated code into a function, we only need to make the change in one place.\n\n25.2.3 Mutate functions\nNow you’ve got the basic idea of functions, let’s take a look at a whole bunch of examples. We’ll start by looking at “mutate” functions, i.e. functions that work well inside of mutate() and filter() because they return an output of the same length as the input.\nLet’s start with a simple variation of rescale01(). Maybe you want to compute the Z-score, rescaling a vector to have a mean of zero and a standard deviation of one:\n\nz_score <- function(x) {\n (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)\n}\n\nOr maybe you want to wrap up a straightforward case_when() and give it a useful name. For example, this clamp() function ensures all values of a vector lie in between a minimum or a maximum:\n\nclamp <- function(x, min, max) {\n case_when(\n x < min ~ min,\n x > max ~ max,\n .default = x\n )\n}\n\nclamp(1:10, min = 3, max = 7)\n#> [1] 3 3 3 4 5 6 7 7 7 7\n\nOf course functions don’t just need to work with numeric variables. You might want to do some repeated string manipulation. Maybe you need to make the first character upper case:\n\nfirst_upper <- function(x) {\n str_sub(x, 1, 1) <- str_to_upper(str_sub(x, 1, 1))\n x\n}\n\nfirst_upper(\"hello\")\n#> [1] \"Hello\"\n\nOr maybe you want to strip percent signs, commas, and dollar signs from a string before converting it into a number:\n\n# https://twitter.com/NVlabormarket/status/1571939851922198530\nclean_number <- function(x) {\n is_pct <- str_detect(x, \"%\")\n num <- x |> \n str_remove_all(\"%\") |> \n str_remove_all(\",\") |> \n str_remove_all(fixed(\"$\")) |> \n as.numeric()\n if_else(is_pct, num / 100, num)\n}\n\nclean_number(\"$12,300\")\n#> [1] 12300\nclean_number(\"45%\")\n#> [1] 0.45\n\nSometimes your functions will be highly specialized for one data analysis step. For example, if you have a bunch of variables that record missing values as 997, 998, or 999, you might want to write a function to replace them with NA:\n\nfix_na <- function(x) {\n if_else(x %in% c(997, 998, 999), NA, x)\n}\n\nWe’ve focused on examples that take a single vector because we think they’re the most common. But there’s no reason that your function can’t take multiple vector inputs.\n\n25.2.4 Summary functions\nAnother important family of vector functions is summary functions, functions that return a single value for use in summarize(). Sometimes this can just be a matter of setting a default argument or two:\n\ncommas <- function(x) {\n str_flatten(x, collapse = \", \", last = \" and \")\n}\n\ncommas(c(\"cat\", \"dog\", \"pigeon\"))\n#> [1] \"cat, dog and pigeon\"\n\nOr you might wrap up a simple computation, like for the coefficient of variation, which divides the standard deviation by the mean:\n\ncv <- function(x, na.rm = FALSE) {\n sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)\n}\n\ncv(runif(100, min = 0, max = 50))\n#> [1] 0.5196276\ncv(runif(100, min = 0, max = 500))\n#> [1] 0.5652554\n\nOr maybe you just want to make a common pattern easier to remember by giving it a memorable name:\n\n# https://twitter.com/gbganalyst/status/1571619641390252033\nn_missing <- function(x) {\n sum(is.na(x))\n} \n\nYou can also write functions with multiple vector inputs. For example, maybe you want to compute the mean absolute percentage error to help you compare model predictions with actual values:\n\n# https://twitter.com/neilgcurrie/status/1571607727255834625\nmape <- function(actual, predicted) {\n sum(abs((actual - predicted) / actual)) / length(actual)\n}\n\n\n\n\n\n\n\nRStudio\n\n\n\nOnce you start writing functions, there are two RStudio shortcuts that are super useful:\n\nTo find the definition of a function that you’ve written, place the cursor on the name of the function and press F2.\nTo quickly jump to a function, press Ctrl + . to open the fuzzy file and function finder and type the first few letters of your function name. You can also navigate to files, Quarto sections, and more, making it a very handy navigation tool.\n\n\n\n\n25.2.5 Exercises\n\n\nPractice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need?\n\nmean(is.na(x))\nmean(is.na(y))\nmean(is.na(z))\n\nx / sum(x, na.rm = TRUE)\ny / sum(y, na.rm = TRUE)\nz / sum(z, na.rm = TRUE)\n\nround(x / sum(x, na.rm = TRUE) * 100, 1)\nround(y / sum(y, na.rm = TRUE) * 100, 1)\nround(z / sum(z, na.rm = TRUE) * 100, 1)\n\n\nIn the second variant of rescale01(), infinite values are left unchanged. Can you rewrite rescale01() so that -Inf is mapped to 0, and Inf is mapped to 1?\nGiven a vector of birthdates, write a function to compute the age in years.\nWrite your own functions to compute the variance and skewness of a numeric vector. You can look up the definitions on Wikipedia or elsewhere.\nWrite both_na(), a summary function that takes two vectors of the same length and returns the number of positions that have an NA in both vectors.\n\nRead the documentation to figure out what the following functions do. Why are they useful even though they are so short?\n\nis_directory <- function(x) {\n file.info(x)$isdir\n}\nis_readable <- function(x) {\n file.access(x, 4) == 0\n}" + }, + { + "objectID": "functions.html#data-frame-functions", + "href": "functions.html#data-frame-functions", + "title": "25  Functions", + "section": "\n25.3 Data frame functions", + "text": "25.3 Data frame functions\nVector functions are useful for pulling out code that’s repeated within a dplyr verb. But you’ll often also repeat the verbs themselves, particularly within a large pipeline. When you notice yourself copying and pasting multiple verbs multiple times, you might think about writing a data frame function. Data frame functions work like dplyr verbs: they take a data frame as the first argument, some extra arguments that say what to do with it, and return a data frame or a vector.\nTo let you write a function that uses dplyr verbs, we’ll first introduce you to the challenge of indirection and how you can overcome it with embracing, {{ }}. With this theory under your belt, we’ll then show you a bunch of examples to illustrate what you might do with it.\n\n25.3.1 Indirection and tidy evaluation\nWhen you start writing functions that use dplyr verbs you rapidly hit the problem of indirection. Let’s illustrate the problem with a very simple function: grouped_mean(). The goal of this function is to compute the mean of mean_var grouped by group_var:\n\ngrouped_mean <- function(df, group_var, mean_var) {\n df |> \n group_by(group_var) |> \n summarize(mean(mean_var))\n}\n\nIf we try and use it, we get an error:\n\ndiamonds |> grouped_mean(cut, carat)\n#> Error in `group_by()`:\n#> ! Must group by variables found in `.data`.\n#> ✖ Column `group_var` is not found.\n\nTo make the problem a bit more clear, we can use a made up data frame:\n\ndf <- tibble(\n mean_var = 1,\n group_var = \"g\",\n group = 1,\n x = 10,\n y = 100\n)\n\ndf |> grouped_mean(group, x)\n#> # A tibble: 1 × 2\n#> group_var `mean(mean_var)`\n#> <chr> <dbl>\n#> 1 g 1\ndf |> grouped_mean(group, y)\n#> # A tibble: 1 × 2\n#> group_var `mean(mean_var)`\n#> <chr> <dbl>\n#> 1 g 1\n\nRegardless of how we call grouped_mean() it always does df |> group_by(group_var) |> summarize(mean(mean_var)), instead of df |> group_by(group) |> summarize(mean(x)) or df |> group_by(group) |> summarize(mean(y)). This is a problem of indirection, and it arises because dplyr uses tidy evaluation to allow you to refer to the names of variables inside your data frame without any special treatment.\nTidy evaluation is great 95% of the time because it makes your data analyses very concise as you never have to say which data frame a variable comes from; it’s obvious from the context. The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function. Here we need some way to tell group_by() and summarize() not to treat group_var and mean_var as the name of the variables, but instead look inside them for the variable we actually want to use.\nTidy evaluation includes a solution to this problem called embracing 🤗. Embracing a variable means to wrap it in braces so (e.g.) var becomes {{ var }}. Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the literal variable name. One way to remember what’s happening is to think of {{ }} as looking down a tunnel — {{ var }} will make a dplyr function look inside of var rather than looking for a variable called var.\nSo to make grouped_mean() work, we need to surround group_var and mean_var with {{ }}:\n\ngrouped_mean <- function(df, group_var, mean_var) {\n df |> \n group_by({{ group_var }}) |> \n summarize(mean({{ mean_var }}))\n}\n\ndf |> grouped_mean(group, x)\n#> # A tibble: 1 × 2\n#> group `mean(x)`\n#> <dbl> <dbl>\n#> 1 1 10\n\nSuccess!\n\n25.3.2 When to embrace?\nSo the key challenge in writing data frame functions is figuring out which arguments need to be embraced. Fortunately, this is easy because you can look it up from the documentation 😄. There are two terms to look for in the docs which correspond to the two most common sub-types of tidy evaluation:\n\nData-masking: this is used in functions like arrange(), filter(), and summarize() that compute with variables.\nTidy-selection: this is used for functions like select(), relocate(), and rename() that select variables.\n\nYour intuition about which arguments use tidy evaluation should be good for many common functions — just think about whether you can compute (e.g., x + 1) or select (e.g., a:x).\nIn the following sections, we’ll explore the sorts of handy functions you might write once you understand embracing.\n\n25.3.3 Common use cases\nIf you commonly perform the same set of summaries when doing initial data exploration, you might consider wrapping them up in a helper function:\n\nsummary6 <- function(data, var) {\n data |> summarize(\n min = min({{ var }}, na.rm = TRUE),\n mean = mean({{ var }}, na.rm = TRUE),\n median = median({{ var }}, na.rm = TRUE),\n max = max({{ var }}, na.rm = TRUE),\n n = n(),\n n_miss = sum(is.na({{ var }})),\n .groups = \"drop\"\n )\n}\n\ndiamonds |> summary6(carat)\n#> # A tibble: 1 × 6\n#> min mean median max n n_miss\n#> <dbl> <dbl> <dbl> <dbl> <int> <int>\n#> 1 0.2 0.798 0.7 5.01 53940 0\n\n(Whenever you wrap summarize() in a helper, we think it’s good practice to set .groups = \"drop\" to both avoid the message and leave the data in an ungrouped state.)\nThe nice thing about this function is, because it wraps summarize(), you can use it on grouped data:\n\ndiamonds |> \n group_by(cut) |> \n summary6(carat)\n#> # A tibble: 5 × 7\n#> cut min mean median max n n_miss\n#> <ord> <dbl> <dbl> <dbl> <dbl> <int> <int>\n#> 1 Fair 0.22 1.05 1 5.01 1610 0\n#> 2 Good 0.23 0.849 0.82 3.01 4906 0\n#> 3 Very Good 0.2 0.806 0.71 4 12082 0\n#> 4 Premium 0.2 0.892 0.86 4.01 13791 0\n#> 5 Ideal 0.2 0.703 0.54 3.5 21551 0\n\nFurthermore, since the arguments to summarize are data-masking also means that the var argument to summary6() is data-masking. That means you can also summarize computed variables:\n\ndiamonds |> \n group_by(cut) |> \n summary6(log10(carat))\n#> # A tibble: 5 × 7\n#> cut min mean median max n n_miss\n#> <ord> <dbl> <dbl> <dbl> <dbl> <int> <int>\n#> 1 Fair -0.658 -0.0273 0 0.700 1610 0\n#> 2 Good -0.638 -0.133 -0.0862 0.479 4906 0\n#> 3 Very Good -0.699 -0.164 -0.149 0.602 12082 0\n#> 4 Premium -0.699 -0.125 -0.0655 0.603 13791 0\n#> 5 Ideal -0.699 -0.225 -0.268 0.544 21551 0\n\nTo summarize multiple variables, you’ll need to wait until Seção 26.2, where you’ll learn how to use across().\nAnother popular summarize() helper function is a version of count() that also computes proportions:\n\n# https://twitter.com/Diabb6/status/1571635146658402309\ncount_prop <- function(df, var, sort = FALSE) {\n df |>\n count({{ var }}, sort = sort) |>\n mutate(prop = n / sum(n))\n}\n\ndiamonds |> count_prop(clarity)\n#> # A tibble: 8 × 3\n#> clarity n prop\n#> <ord> <int> <dbl>\n#> 1 I1 741 0.0137\n#> 2 SI2 9194 0.170 \n#> 3 SI1 13065 0.242 \n#> 4 VS2 12258 0.227 \n#> 5 VS1 8171 0.151 \n#> 6 VVS2 5066 0.0939\n#> # ℹ 2 more rows\n\nThis function has three arguments: df, var, and sort, and only var needs to be embraced because it’s passed to count() which uses data-masking for all variables. Note that we use a default value for sort so that if the user doesn’t supply their own value it will default to FALSE.\nOr maybe you want to find the sorted unique values of a variable for a subset of the data. Rather than supplying a variable and a value to do the filtering, we’ll allow the user to supply a condition:\n\nunique_where <- function(df, condition, var) {\n df |> \n filter({{ condition }}) |> \n distinct({{ var }}) |> \n arrange({{ var }})\n}\n\n# Find all the destinations in December\nflights |> unique_where(month == 12, dest)\n#> # A tibble: 96 × 1\n#> dest \n#> <chr>\n#> 1 ABQ \n#> 2 ALB \n#> 3 ATL \n#> 4 AUS \n#> 5 AVL \n#> 6 BDL \n#> # ℹ 90 more rows\n\nHere we embrace condition because it’s passed to filter() and var because it’s passed to distinct() and arrange().\nWe’ve made all these examples to take a data frame as the first argument, but if you’re working repeatedly with the same data, it can make sense to hardcode it. For example, the following function always works with the flights dataset and always selects time_hour, carrier, and flight since they form the compound primary key that allows you to identify a row.\n\nsubset_flights <- function(rows, cols) {\n flights |> \n filter({{ rows }}) |> \n select(time_hour, carrier, flight, {{ cols }})\n}\n\n\n25.3.4 Data-masking vs. tidy-selection\nSometimes you want to select variables inside a function that uses data-masking. For example, imagine you want to write a count_missing() that counts the number of missing observations in rows. You might try writing something like:\n\ncount_missing <- function(df, group_vars, x_var) {\n df |> \n group_by({{ group_vars }}) |> \n summarize(\n n_miss = sum(is.na({{ x_var }})),\n .groups = \"drop\"\n )\n}\n\nflights |> \n count_missing(c(year, month, day), dep_time)\n#> Error in `group_by()`:\n#> ℹ In argument: `c(year, month, day)`.\n#> Caused by error:\n#> ! `c(year, month, day)` must be size 336776 or 1, not 1010328.\n\nThis doesn’t work because group_by() uses data-masking, not tidy-selection. We can work around that problem by using the handy pick() function, which allows you to use tidy-selection inside data-masking functions:\n\ncount_missing <- function(df, group_vars, x_var) {\n df |> \n group_by(pick({{ group_vars }})) |> \n summarize(\n n_miss = sum(is.na({{ x_var }})),\n .groups = \"drop\"\n )\n}\n\nflights |> \n count_missing(c(year, month, day), dep_time)\n#> # A tibble: 365 × 4\n#> year month day n_miss\n#> <int> <int> <int> <int>\n#> 1 2013 1 1 4\n#> 2 2013 1 2 8\n#> 3 2013 1 3 10\n#> 4 2013 1 4 6\n#> 5 2013 1 5 3\n#> 6 2013 1 6 1\n#> # ℹ 359 more rows\n\nAnother convenient use of pick() is to make a 2d table of counts. Here we count using all the variables in the rows and columns, then use pivot_wider() to rearrange the counts into a grid:\n\n# https://twitter.com/pollicipes/status/1571606508944719876\ncount_wide <- function(data, rows, cols) {\n data |> \n count(pick(c({{ rows }}, {{ cols }}))) |> \n pivot_wider(\n names_from = {{ cols }}, \n values_from = n,\n names_sort = TRUE,\n values_fill = 0\n )\n}\n\ndiamonds |> count_wide(c(clarity, color), cut)\n#> # A tibble: 56 × 7\n#> clarity color Fair Good `Very Good` Premium Ideal\n#> <ord> <ord> <int> <int> <int> <int> <int>\n#> 1 I1 D 4 8 5 12 13\n#> 2 I1 E 9 23 22 30 18\n#> 3 I1 F 35 19 13 34 42\n#> 4 I1 G 53 19 16 46 16\n#> 5 I1 H 52 14 12 46 38\n#> 6 I1 I 34 9 8 24 17\n#> # ℹ 50 more rows\n\nWhile our examples have mostly focused on dplyr, tidy evaluation also underpins tidyr, and if you look at the pivot_wider() docs you can see that names_from uses tidy-selection.\n\n25.3.5 Exercises\n\n\nUsing the datasets from nycflights13, write a function that:\n\n\nFinds all flights that were cancelled (i.e. is.na(arr_time)) or delayed by more than an hour.\n\nflights |> filter_severe()\n\n\n\nCounts the number of cancelled flights and the number of flights delayed by more than an hour.\n\nflights |> group_by(dest) |> summarize_severe()\n\n\n\nFinds all flights that were cancelled or delayed by more than a user supplied number of hours:\n\nflights |> filter_severe(hours = 2)\n\n\n\nSummarizes the weather to compute the minimum, mean, and maximum, of a user supplied variable:\n\nweather |> summarize_weather(temp)\n\n\n\nConverts the user supplied variable that uses clock time (e.g., dep_time, arr_time, etc.) into a decimal time (i.e. hours + (minutes / 60)).\n\nflights |> standardize_time(sched_dep_time)\n\n\n\n\nFor each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-selection: distinct(), count(), group_by(), rename_with(), slice_min(), slice_sample().\n\nGeneralize the following function so that you can supply any number of variables to count.\n\ncount_prop <- function(df, var, sort = FALSE) {\n df |>\n count({{ var }}, sort = sort) |>\n mutate(prop = n / sum(n))\n}" + }, + { + "objectID": "functions.html#plot-functions", + "href": "functions.html#plot-functions", + "title": "25  Functions", + "section": "\n25.4 Plot functions", + "text": "25.4 Plot functions\nInstead of returning a data frame, you might want to return a plot. Fortunately, you can use the same techniques with ggplot2, because aes() is a data-masking function. For example, imagine that you’re making a lot of histograms:\n\ndiamonds |> \n ggplot(aes(x = carat)) +\n geom_histogram(binwidth = 0.1)\n\ndiamonds |> \n ggplot(aes(x = carat)) +\n geom_histogram(binwidth = 0.05)\n\nWouldn’t it be nice if you could wrap this up into a histogram function? This is easy as pie once you know that aes() is a data-masking function and you need to embrace:\n\nhistogram <- function(df, var, binwidth = NULL) {\n df |> \n ggplot(aes(x = {{ var }})) + \n geom_histogram(binwidth = binwidth)\n}\n\ndiamonds |> histogram(carat, 0.1)\n\n\n\n\nNote that histogram() returns a ggplot2 plot, meaning you can still add on additional components if you want. Just remember to switch from |> to +:\n\ndiamonds |> \n histogram(carat, 0.1) +\n labs(x = \"Size (in carats)\", y = \"Number of diamonds\")\n\n\n25.4.1 More variables\nIt’s straightforward to add more variables to the mix. For example, maybe you want an easy way to eyeball whether or not a dataset is linear by overlaying a smooth line and a straight line:\n\n# https://twitter.com/tyler_js_smith/status/1574377116988104704\nlinearity_check <- function(df, x, y) {\n df |>\n ggplot(aes(x = {{ x }}, y = {{ y }})) +\n geom_point() +\n geom_smooth(method = \"loess\", formula = y ~ x, color = \"red\", se = FALSE) +\n geom_smooth(method = \"lm\", formula = y ~ x, color = \"blue\", se = FALSE) \n}\n\nstarwars |> \n filter(mass < 1000) |> \n linearity_check(mass, height)\n\n\n\n\nOr maybe you want an alternative to colored scatterplots for very large datasets where overplotting is a problem:\n\n# https://twitter.com/ppaxisa/status/1574398423175921665\nhex_plot <- function(df, x, y, z, bins = 20, fun = \"mean\") {\n df |> \n ggplot(aes(x = {{ x }}, y = {{ y }}, z = {{ z }})) + \n stat_summary_hex(\n aes(color = after_scale(fill)), # make border same color as fill\n bins = bins, \n fun = fun,\n )\n}\n\ndiamonds |> hex_plot(carat, price, depth)\n\n\n\n\n\n25.4.2 Combining with other tidyverse\nSome of the most useful helpers combine a dash of data manipulation with ggplot2. For example, if you might want to do a vertical bar chart where you automatically sort the bars in frequency order using fct_infreq(). Since the bar chart is vertical, we also need to reverse the usual order to get the highest values at the top:\n\nsorted_bars <- function(df, var) {\n df |> \n mutate({{ var }} := fct_rev(fct_infreq({{ var }}))) |>\n ggplot(aes(y = {{ var }})) +\n geom_bar()\n}\n\ndiamonds |> sorted_bars(clarity)\n\n\n\n\nWe have to use a new operator here, := (commonly referred to as the “walrus operator”), because we are generating the variable name based on user-supplied data. Variable names go on the left hand side of =, but R’s syntax doesn’t allow anything to the left of = except for a single literal name. To work around this problem, we use the special operator := which tidy evaluation treats in exactly the same way as =.\nOr maybe you want to make it easy to draw a bar plot just for a subset of the data:\n\nconditional_bars <- function(df, condition, var) {\n df |> \n filter({{ condition }}) |> \n ggplot(aes(x = {{ var }})) + \n geom_bar()\n}\n\ndiamonds |> conditional_bars(cut == \"Good\", clarity)\n\n\n\n\nYou can also get creative and display data summaries in other ways. You can find a cool application at https://gist.github.com/GShotwell/b19ef520b6d56f61a830fabb3454965b; it uses the axis labels to display the highest value. As you learn more about ggplot2, the power of your functions will continue to increase.\nWe’ll finish with a more complicated case: labelling the plots you create.\n\n25.4.3 Labeling\nRemember the histogram function we showed you earlier?\n\nhistogram <- function(df, var, binwidth = NULL) {\n df |> \n ggplot(aes(x = {{ var }})) + \n geom_histogram(binwidth = binwidth)\n}\n\nWouldn’t it be nice if we could label the output with the variable and the bin width that was used? To do so, we’re going to have to go under the covers of tidy evaluation and use a function from the package we haven’t talked about yet: rlang. rlang is a low-level package that’s used by just about every other package in the tidyverse because it implements tidy evaluation (as well as many other useful tools).\nTo solve the labeling problem we can use rlang::englue(). This works similarly to str_glue(), so any value wrapped in { } will be inserted into the string. But it also understands {{ }}, which automatically inserts the appropriate variable name:\n\nhistogram <- function(df, var, binwidth) {\n label <- rlang::englue(\"A histogram of {{var}} with binwidth {binwidth}\")\n \n df |> \n ggplot(aes(x = {{ var }})) + \n geom_histogram(binwidth = binwidth) + \n labs(title = label)\n}\n\ndiamonds |> histogram(carat, 0.1)\n\n\n\n\nYou can use the same approach in any other place where you want to supply a string in a ggplot2 plot.\n\n25.4.4 Exercises\nBuild up a rich plotting function by incrementally implementing each of the steps below:\n\nDraw a scatterplot given dataset and x and y variables.\nAdd a line of best fit (i.e. a linear model with no standard errors).\nAdd a title." + }, + { + "objectID": "functions.html#style", + "href": "functions.html#style", + "title": "25  Functions", + "section": "\n25.5 Style", + "text": "25.5 Style\nR doesn’t care what your function or arguments are called but the names make a big difference for humans. Ideally, the name of your function will be short, but clearly evoke what the function does. That’s hard! But it’s better to be clear than short, as RStudio’s autocomplete makes it easy to type long names.\nGenerally, function names should be verbs, and arguments should be nouns. There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. mean() is better than compute_mean()), or accessing some property of an object (i.e. coef() is better than get_coefficients()). Use your best judgement and don’t be afraid to rename a function if you figure out a better name later.\n\n# Too short\nf()\n\n# Not a verb, or descriptive\nmy_awesome_function()\n\n# Long, but clear\nimpute_missing()\ncollapse_years()\n\nR also doesn’t care about how you use white space in your functions but future readers will. Continue to follow the rules from Capítulo 4. Additionally, function() should always be followed by squiggly brackets ({}), and the contents should be indented by an additional two spaces. This makes it easier to see the hierarchy in your code by skimming the left-hand margin.\n\n# Missing extra two spaces\ndensity <- function(color, facets, binwidth = 0.1) {\ndiamonds |> \n ggplot(aes(x = carat, y = after_stat(density), color = {{ color }})) +\n geom_freqpoly(binwidth = binwidth) +\n facet_wrap(vars({{ facets }}))\n}\n\n# Pipe indented incorrectly\ndensity <- function(color, facets, binwidth = 0.1) {\n diamonds |> \n ggplot(aes(x = carat, y = after_stat(density), color = {{ color }})) +\n geom_freqpoly(binwidth = binwidth) +\n facet_wrap(vars({{ facets }}))\n}\n\nAs you can see we recommend putting extra spaces inside of {{ }}. This makes it very obvious that something unusual is happening.\n\n25.5.1 Exercises\n\n\nRead the source code for each of the following two functions, puzzle out what they do, and then brainstorm better names.\n\nf1 <- function(string, prefix) {\n str_sub(string, 1, str_length(prefix)) == prefix\n}\n\nf3 <- function(x, y) {\n rep(y, length.out = length(x))\n}\n\n\nTake a function that you’ve written recently and spend 5 minutes brainstorming a better name for it and its arguments.\nMake a case for why norm_r(), norm_d() etc. would be better than rnorm(), dnorm(). Make a case for the opposite. How could you make the names even clearer?" + }, + { + "objectID": "functions.html#summary", + "href": "functions.html#summary", + "title": "25  Functions", + "section": "\n25.6 Summary", + "text": "25.6 Summary\nIn this chapter, you learned how to write functions for three useful scenarios: creating a vector, creating a data frame, or creating a plot. Along the way you saw many examples, which hopefully started to get your creative juices flowing, and gave you some ideas for where functions might help your analysis code.\nWe have only shown you the bare minimum to get started with functions and there’s much more to learn. A few places to learn more are:\n\nTo learn more about programming with tidy evaluation, see useful recipes in programming with dplyr and programming with tidyr and learn more about the theory in What is data-masking and why do I need {{?.\nTo learn more about reducing duplication in your ggplot2 code, read the Programming with ggplot2 chapter of the ggplot2 book.\nFor more advice on function style, see the tidyverse style guide.\n\nIn the next chapter, we’ll dive into iteration which gives you further tools for reducing code duplication." + }, + { + "objectID": "iteration.html#introduction", + "href": "iteration.html#introduction", + "title": "26  Iteration", + "section": "\n26.1 Introduction", + "text": "26.1 Introduction\nIn this chapter, you’ll learn tools for iteration, repeatedly performing the same action on different objects. Iteration in R generally tends to look rather different from other programming languages because so much of it is implicit and we get it for free. For example, if you want to double a numeric vector x in R, you can just write 2 * x. In most other languages, you’d need to explicitly double each element of x using some sort of for loop.\nThis book has already given you a small but powerful number of tools that perform the same action for multiple “things”:\n\n\nfacet_wrap() and facet_grid() draws a plot for each subset.\n\ngroup_by() plus summarize() computes summary statistics for each subset.\n\nunnest_wider() and unnest_longer() create new rows and columns for each element of a list-column.\n\nNow it’s time to learn some more general tools, often called functional programming tools because they are built around functions that take other functions as inputs. Learning functional programming can easily veer into the abstract, but in this chapter we’ll keep things concrete by focusing on three common tasks: modifying multiple columns, reading multiple files, and saving multiple objects.\n\n26.1.1 Prerequisites\nIn this chapter, we’ll focus on tools provided by dplyr and purrr, both core members of the tidyverse. You’ve seen dplyr before, but purrr is new. We’re just going to use a couple of purrr functions in this chapter, but it’s a great package to explore as you improve your programming skills.\n\nlibrary(tidyverse)" + }, + { + "objectID": "iteration.html#sec-across", + "href": "iteration.html#sec-across", + "title": "26  Iteration", + "section": "\n26.2 Modifying multiple columns", + "text": "26.2 Modifying multiple columns\nImagine you have this simple tibble and you want to count the number of observations and compute the median of every column.\n\ndf <- tibble(\n a = rnorm(10),\n b = rnorm(10),\n c = rnorm(10),\n d = rnorm(10)\n)\n\nYou could do it with copy-and-paste:\n\ndf |> summarize(\n n = n(),\n a = median(a),\n b = median(b),\n c = median(c),\n d = median(d),\n)\n#> # A tibble: 1 × 5\n#> n a b c d\n#> <int> <dbl> <dbl> <dbl> <dbl>\n#> 1 10 -0.246 -0.287 -0.0567 0.144\n\nThat breaks our rule of thumb to never copy and paste more than twice, and you can imagine that this will get very tedious if you have tens or even hundreds of columns. Instead, you can use across():\n\ndf |> summarize(\n n = n(),\n across(a:d, median),\n)\n#> # A tibble: 1 × 5\n#> n a b c d\n#> <int> <dbl> <dbl> <dbl> <dbl>\n#> 1 10 -0.246 -0.287 -0.0567 0.144\n\nacross() has three particularly important arguments, which we’ll discuss in detail in the following sections. You’ll use the first two every time you use across(): the first argument, .cols, specifies which columns you want to iterate over, and the second argument, .fns, specifies what to do with each column. You can use the .names argument when you need additional control over the names of output columns, which is particularly important when you use across() with mutate(). We’ll also discuss two important variations, if_any() and if_all(), which work with filter().\n\n26.2.1 Selecting columns with .cols\n\nThe first argument to across(), .cols, selects the columns to transform. This uses the same specifications as select(), Seção 3.3.2, so you can use functions like starts_with() and ends_with() to select columns based on their name.\nThere are two additional selection techniques that are particularly useful for across(): everything() and where(). everything() is straightforward: it selects every (non-grouping) column:\n\ndf <- tibble(\n grp = sample(2, 10, replace = TRUE),\n a = rnorm(10),\n b = rnorm(10),\n c = rnorm(10),\n d = rnorm(10)\n)\n\ndf |> \n group_by(grp) |> \n summarize(across(everything(), median))\n#> # A tibble: 2 × 5\n#> grp a b c d\n#> <int> <dbl> <dbl> <dbl> <dbl>\n#> 1 1 -0.0935 -0.0163 0.363 0.364\n#> 2 2 0.312 -0.0576 0.208 0.565\n\nNote grouping columns (grp here) are not included in across(), because they’re automatically preserved by summarize().\nwhere() allows you to select columns based on their type:\n\n\nwhere(is.numeric) selects all numeric columns.\n\nwhere(is.character) selects all string columns.\n\nwhere(is.Date) selects all date columns.\n\nwhere(is.POSIXct) selects all date-time columns.\n\nwhere(is.logical) selects all logical columns.\n\nJust like other selectors, you can combine these with Boolean algebra. For example, !where(is.numeric) selects all non-numeric columns, and starts_with(\"a\") & where(is.logical) selects all logical columns whose name starts with “a”.\n\n26.2.2 Calling a single function\nThe second argument to across() defines how each column will be transformed. In simple cases, as above, this will be a single existing function. This is a pretty special feature of R: we’re passing one function (median, mean, str_flatten, …) to another function (across). This is one of the features that makes R a functional programming language.\nIt’s important to note that we’re passing this function to across(), so across() can call it; we’re not calling it ourselves. That means the function name should never be followed by (). If you forget, you’ll get an error:\n\ndf |> \n group_by(grp) |> \n summarize(across(everything(), median()))\n#> Error in `summarize()`:\n#> ℹ In argument: `across(everything(), median())`.\n#> Caused by error in `median.default()`:\n#> ! argument \"x\" is missing, with no default\n\nThis error arises because you’re calling the function with no input, e.g.:\n\nmedian()\n#> Error in median.default(): argument \"x\" is missing, with no default\n\n\n26.2.3 Calling multiple functions\nIn more complex cases, you might want to supply additional arguments or perform multiple transformations. Let’s motivate this problem with a simple example: what happens if we have some missing values in our data? median() propagates those missing values, giving us a suboptimal output:\n\nrnorm_na <- function(n, n_na, mean = 0, sd = 1) {\n sample(c(rnorm(n - n_na, mean = mean, sd = sd), rep(NA, n_na)))\n}\n\ndf_miss <- tibble(\n a = rnorm_na(5, 1),\n b = rnorm_na(5, 1),\n c = rnorm_na(5, 2),\n d = rnorm(5)\n)\ndf_miss |> \n summarize(\n across(a:d, median),\n n = n()\n )\n#> # A tibble: 1 × 5\n#> a b c d n\n#> <dbl> <dbl> <dbl> <dbl> <int>\n#> 1 NA NA NA 1.15 5\n\nIt would be nice if we could pass along na.rm = TRUE to median() to remove these missing values. To do so, instead of calling median() directly, we need to create a new function that calls median() with the desired arguments:\n\ndf_miss |> \n summarize(\n across(a:d, function(x) median(x, na.rm = TRUE)),\n n = n()\n )\n#> # A tibble: 1 × 5\n#> a b c d n\n#> <dbl> <dbl> <dbl> <dbl> <int>\n#> 1 0.139 -1.11 -0.387 1.15 5\n\nThis is a little verbose, so R comes with a handy shortcut: for this sort of throw away, or anonymous1, function you can replace function with \\2:\n\ndf_miss |> \n summarize(\n across(a:d, \\(x) median(x, na.rm = TRUE)),\n n = n()\n )\n\nIn either case, across() effectively expands to the following code:\n\ndf_miss |> \n summarize(\n a = median(a, na.rm = TRUE),\n b = median(b, na.rm = TRUE),\n c = median(c, na.rm = TRUE),\n d = median(d, na.rm = TRUE),\n n = n()\n )\n\nWhen we remove the missing values from the median(), it would be nice to know just how many values were removed. We can find that out by supplying two functions to across(): one to compute the median and the other to count the missing values. You supply multiple functions by using a named list to .fns:\n\ndf_miss |> \n summarize(\n across(a:d, list(\n median = \\(x) median(x, na.rm = TRUE),\n n_miss = \\(x) sum(is.na(x))\n )),\n n = n()\n )\n#> # A tibble: 1 × 9\n#> a_median a_n_miss b_median b_n_miss c_median c_n_miss d_median d_n_miss\n#> <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <int>\n#> 1 0.139 1 -1.11 1 -0.387 2 1.15 0\n#> # ℹ 1 more variable: n <int>\n\nIf you look carefully, you might intuit that the columns are named using a glue specification (Seção 14.3.2) like {.col}_{.fn} where .col is the name of the original column and .fn is the name of the function. That’s not a coincidence! As you’ll learn in the next section, you can use .names argument to supply your own glue spec.\n\n26.2.4 Column names\nThe result of across() is named according to the specification provided in the .names argument. We could specify our own if we wanted the name of the function to come first3:\n\ndf_miss |> \n summarize(\n across(\n a:d,\n list(\n median = \\(x) median(x, na.rm = TRUE),\n n_miss = \\(x) sum(is.na(x))\n ),\n .names = \"{.fn}_{.col}\"\n ),\n n = n(),\n )\n#> # A tibble: 1 × 9\n#> median_a n_miss_a median_b n_miss_b median_c n_miss_c median_d n_miss_d\n#> <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <int>\n#> 1 0.139 1 -1.11 1 -0.387 2 1.15 0\n#> # ℹ 1 more variable: n <int>\n\nThe .names argument is particularly important when you use across() with mutate(). By default, the output of across() is given the same names as the inputs. This means that across() inside of mutate() will replace existing columns. For example, here we use coalesce() to replace NAs with 0:\n\ndf_miss |> \n mutate(\n across(a:d, \\(x) coalesce(x, 0))\n )\n#> # A tibble: 5 × 4\n#> a b c d\n#> <dbl> <dbl> <dbl> <dbl>\n#> 1 0.434 -1.25 0 1.60 \n#> 2 0 -1.43 -0.297 0.776\n#> 3 -0.156 -0.980 0 1.15 \n#> 4 -2.61 -0.683 -0.785 2.13 \n#> 5 1.11 0 -0.387 0.704\n\nIf you’d like to instead create new columns, you can use the .names argument to give the output new names:\n\ndf_miss |> \n mutate(\n across(a:d, \\(x) coalesce(x, 0), .names = \"{.col}_na_zero\")\n )\n#> # A tibble: 5 × 8\n#> a b c d a_na_zero b_na_zero c_na_zero d_na_zero\n#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>\n#> 1 0.434 -1.25 NA 1.60 0.434 -1.25 0 1.60 \n#> 2 NA -1.43 -0.297 0.776 0 -1.43 -0.297 0.776\n#> 3 -0.156 -0.980 NA 1.15 -0.156 -0.980 0 1.15 \n#> 4 -2.61 -0.683 -0.785 2.13 -2.61 -0.683 -0.785 2.13 \n#> 5 1.11 NA -0.387 0.704 1.11 0 -0.387 0.704\n\n\n26.2.5 Filtering\nacross() is a great match for summarize() and mutate() but it’s more awkward to use with filter(), because you usually combine multiple conditions with either | or &. It’s clear that across() can help to create multiple logical columns, but then what? So dplyr provides two variants of across() called if_any() and if_all():\n\n# same as df_miss |> filter(is.na(a) | is.na(b) | is.na(c) | is.na(d))\ndf_miss |> filter(if_any(a:d, is.na))\n#> # A tibble: 4 × 4\n#> a b c d\n#> <dbl> <dbl> <dbl> <dbl>\n#> 1 0.434 -1.25 NA 1.60 \n#> 2 NA -1.43 -0.297 0.776\n#> 3 -0.156 -0.980 NA 1.15 \n#> 4 1.11 NA -0.387 0.704\n\n# same as df_miss |> filter(is.na(a) & is.na(b) & is.na(c) & is.na(d))\ndf_miss |> filter(if_all(a:d, is.na))\n#> # A tibble: 0 × 4\n#> # ℹ 4 variables: a <dbl>, b <dbl>, c <dbl>, d <dbl>\n\n\n26.2.6 across() in functions\nacross() is particularly useful to program with because it allows you to operate on multiple columns. For example, Jacob Scott uses this little helper which wraps a bunch of lubridate functions to expand all date columns into year, month, and day columns:\n\nexpand_dates <- function(df) {\n df |> \n mutate(\n across(where(is.Date), list(year = year, month = month, day = mday))\n )\n}\n\ndf_date <- tibble(\n name = c(\"Amy\", \"Bob\"),\n date = ymd(c(\"2009-08-03\", \"2010-01-16\"))\n)\n\ndf_date |> \n expand_dates()\n#> # A tibble: 2 × 5\n#> name date date_year date_month date_day\n#> <chr> <date> <dbl> <dbl> <int>\n#> 1 Amy 2009-08-03 2009 8 3\n#> 2 Bob 2010-01-16 2010 1 16\n\nacross() also makes it easy to supply multiple columns in a single argument because the first argument uses tidy-select; you just need to remember to embrace that argument, as we discussed in Seção 25.3.2. For example, this function will compute the means of numeric columns by default. But by supplying the second argument you can choose to summarize just selected columns:\n\nsummarize_means <- function(df, summary_vars = where(is.numeric)) {\n df |> \n summarize(\n across({{ summary_vars }}, \\(x) mean(x, na.rm = TRUE)),\n n = n(),\n .groups = \"drop\"\n )\n}\ndiamonds |> \n group_by(cut) |> \n summarize_means()\n#> # A tibble: 5 × 9\n#> cut carat depth table price x y z n\n#> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>\n#> 1 Fair 1.05 64.0 59.1 4359. 6.25 6.18 3.98 1610\n#> 2 Good 0.849 62.4 58.7 3929. 5.84 5.85 3.64 4906\n#> 3 Very Good 0.806 61.8 58.0 3982. 5.74 5.77 3.56 12082\n#> 4 Premium 0.892 61.3 58.7 4584. 5.97 5.94 3.65 13791\n#> 5 Ideal 0.703 61.7 56.0 3458. 5.51 5.52 3.40 21551\n\ndiamonds |> \n group_by(cut) |> \n summarize_means(c(carat, x:z))\n#> # A tibble: 5 × 6\n#> cut carat x y z n\n#> <ord> <dbl> <dbl> <dbl> <dbl> <int>\n#> 1 Fair 1.05 6.25 6.18 3.98 1610\n#> 2 Good 0.849 5.84 5.85 3.64 4906\n#> 3 Very Good 0.806 5.74 5.77 3.56 12082\n#> 4 Premium 0.892 5.97 5.94 3.65 13791\n#> 5 Ideal 0.703 5.51 5.52 3.40 21551\n\n\n26.2.7 Vs pivot_longer()\n\nBefore we go on, it’s worth pointing out an interesting connection between across() and pivot_longer() (Seção 5.3). In many cases, you perform the same calculations by first pivoting the data and then performing the operations by group rather than by column. For example, take this multi-function summary:\n\ndf |> \n summarize(across(a:d, list(median = median, mean = mean)))\n#> # A tibble: 1 × 8\n#> a_median a_mean b_median b_mean c_median c_mean d_median d_mean\n#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>\n#> 1 0.0380 0.205 -0.0163 0.0910 0.260 0.0716 0.540 0.508\n\nWe could compute the same values by pivoting longer and then summarizing:\n\nlong <- df |> \n pivot_longer(a:d) |> \n group_by(name) |> \n summarize(\n median = median(value),\n mean = mean(value)\n )\nlong\n#> # A tibble: 4 × 3\n#> name median mean\n#> <chr> <dbl> <dbl>\n#> 1 a 0.0380 0.205 \n#> 2 b -0.0163 0.0910\n#> 3 c 0.260 0.0716\n#> 4 d 0.540 0.508\n\nAnd if you wanted the same structure as across() you could pivot again:\n\nlong |> \n pivot_wider(\n names_from = name,\n values_from = c(median, mean),\n names_vary = \"slowest\",\n names_glue = \"{name}_{.value}\"\n )\n#> # A tibble: 1 × 8\n#> a_median a_mean b_median b_mean c_median c_mean d_median d_mean\n#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>\n#> 1 0.0380 0.205 -0.0163 0.0910 0.260 0.0716 0.540 0.508\n\nThis is a useful technique to know about because sometimes you’ll hit a problem that’s not currently possible to solve with across(): when you have groups of columns that you want to compute with simultaneously. For example, imagine that our data frame contains both values and weights and we want to compute a weighted mean:\n\ndf_paired <- tibble(\n a_val = rnorm(10),\n a_wts = runif(10),\n b_val = rnorm(10),\n b_wts = runif(10),\n c_val = rnorm(10),\n c_wts = runif(10),\n d_val = rnorm(10),\n d_wts = runif(10)\n)\n\nThere’s currently no way to do this with across()4, but it’s relatively straightforward with pivot_longer():\n\ndf_long <- df_paired |> \n pivot_longer(\n everything(), \n names_to = c(\"group\", \".value\"), \n names_sep = \"_\"\n )\ndf_long\n#> # A tibble: 40 × 3\n#> group val wts\n#> <chr> <dbl> <dbl>\n#> 1 a 0.715 0.518\n#> 2 b -0.709 0.691\n#> 3 c 0.718 0.216\n#> 4 d -0.217 0.733\n#> 5 a -1.09 0.979\n#> 6 b -0.209 0.675\n#> # ℹ 34 more rows\n\ndf_long |> \n group_by(group) |> \n summarize(mean = weighted.mean(val, wts))\n#> # A tibble: 4 × 2\n#> group mean\n#> <chr> <dbl>\n#> 1 a 0.126 \n#> 2 b -0.0704\n#> 3 c -0.360 \n#> 4 d -0.248\n\nIf needed, you could pivot_wider() this back to the original form.\n\n26.2.8 Exercises\n\n\nPractice your across() skills by:\n\nComputing the number of unique values in each column of palmerpenguins::penguins.\nComputing the mean of every column in mtcars.\nGrouping diamonds by cut, clarity, and color then counting the number of observations and computing the mean of each numeric column.\n\n\nWhat happens if you use a list of functions in across(), but don’t name them? How is the output named?\nAdjust expand_dates() to automatically remove the date columns after they’ve been expanded. Do you need to embrace any arguments?\n\nExplain what each step of the pipeline in this function does. What special feature of where() are we taking advantage of?\n\nshow_missing <- function(df, group_vars, summary_vars = everything()) {\n df |> \n group_by(pick({{ group_vars }})) |> \n summarize(\n across({{ summary_vars }}, \\(x) sum(is.na(x))),\n .groups = \"drop\"\n ) |>\n select(where(\\(x) any(x > 0)))\n}\nnycflights13::flights |> show_missing(c(year, month, day))" + }, + { + "objectID": "iteration.html#reading-multiple-files", + "href": "iteration.html#reading-multiple-files", + "title": "26  Iteration", + "section": "\n26.3 Reading multiple files", + "text": "26.3 Reading multiple files\nIn the previous section, you learned how to use dplyr::across() to repeat a transformation on multiple columns. In this section, you’ll learn how to use purrr::map() to do something to every file in a directory. Let’s start with a little motivation: imagine you have a directory full of excel spreadsheets5 you want to read. You could do it with copy and paste:\n\ndata2019 <- readxl::read_excel(\"data/y2019.xlsx\")\ndata2020 <- readxl::read_excel(\"data/y2020.xlsx\")\ndata2021 <- readxl::read_excel(\"data/y2021.xlsx\")\ndata2022 <- readxl::read_excel(\"data/y2022.xlsx\")\n\nAnd then use dplyr::bind_rows() to combine them all together:\n\ndata <- bind_rows(data2019, data2020, data2021, data2022)\n\nYou can imagine that this would get tedious quickly, especially if you had hundreds of files, not just four. The following sections show you how to automate this sort of task. There are three basic steps: use list.files() to list all the files in a directory, then use purrr::map() to read each of them into a list, then use purrr::list_rbind() to combine them into a single data frame. We’ll then discuss how you can handle situations of increasing heterogeneity, where you can’t do exactly the same thing to every file.\n\n26.3.1 Listing files in a directory\nAs the name suggests, list.files() lists the files in a directory. You’ll almost always use three arguments:\n\nThe first argument, path, is the directory to look in.\npattern is a regular expression used to filter the file names. The most common pattern is something like [.]xlsx$ or [.]csv$ to find all files with a specified extension.\nfull.names determines whether or not the directory name should be included in the output. You almost always want this to be TRUE.\n\nTo make our motivating example concrete, this book contains a folder with 12 excel spreadsheets containing data from the gapminder package. Each file contains one year’s worth of data for 142 countries. We can list them all with the appropriate call to list.files():\n\npaths <- list.files(\"data/gapminder\", pattern = \"[.]xlsx$\", full.names = TRUE)\npaths\n#> [1] \"data/gapminder/1952.xlsx\" \"data/gapminder/1957.xlsx\"\n#> [3] \"data/gapminder/1962.xlsx\" \"data/gapminder/1967.xlsx\"\n#> [5] \"data/gapminder/1972.xlsx\" \"data/gapminder/1977.xlsx\"\n#> [7] \"data/gapminder/1982.xlsx\" \"data/gapminder/1987.xlsx\"\n#> [9] \"data/gapminder/1992.xlsx\" \"data/gapminder/1997.xlsx\"\n#> [11] \"data/gapminder/2002.xlsx\" \"data/gapminder/2007.xlsx\"\n\n\n26.3.2 Lists\nNow that we have these 12 paths, we could call read_excel() 12 times to get 12 data frames:\n\ngapminder_1952 <- readxl::read_excel(\"data/gapminder/1952.xlsx\")\ngapminder_1957 <- readxl::read_excel(\"data/gapminder/1957.xlsx\")\ngapminder_1962 <- readxl::read_excel(\"data/gapminder/1962.xlsx\")\n ...,\ngapminder_2007 <- readxl::read_excel(\"data/gapminder/2007.xlsx\")\n\nBut putting each sheet into its own variable is going to make it hard to work with them a few steps down the road. Instead, they’ll be easier to work with if we put them into a single object. A list is the perfect tool for this job:\n\nfiles <- list(\n readxl::read_excel(\"data/gapminder/1952.xlsx\"),\n readxl::read_excel(\"data/gapminder/1957.xlsx\"),\n readxl::read_excel(\"data/gapminder/1962.xlsx\"),\n ...,\n readxl::read_excel(\"data/gapminder/2007.xlsx\")\n)\n\nNow that you have these data frames in a list, how do you get one out? You can use files[[i]] to extract the ith element:\n\nfiles[[3]]\n#> # A tibble: 142 × 5\n#> country continent lifeExp pop gdpPercap\n#> <chr> <chr> <dbl> <dbl> <dbl>\n#> 1 Afghanistan Asia 32.0 10267083 853.\n#> 2 Albania Europe 64.8 1728137 2313.\n#> 3 Algeria Africa 48.3 11000948 2551.\n#> 4 Angola Africa 34 4826015 4269.\n#> 5 Argentina Americas 65.1 21283783 7133.\n#> 6 Australia Oceania 70.9 10794968 12217.\n#> # ℹ 136 more rows\n\nWe’ll come back to [[ in more detail in Seção 27.3.\n\n26.3.3 purrr::map() and list_rbind()\n\nThe code to collect those data frames in a list “by hand” is basically just as tedious to type as code that reads the files one-by-one. Happily, we can use purrr::map() to make even better use of our paths vector. map() is similar toacross(), but instead of doing something to each column in a data frame, it does something to each element of a vector.map(x, f) is shorthand for:\n\nlist(\n f(x[[1]]),\n f(x[[2]]),\n ...,\n f(x[[n]])\n)\n\nSo we can use map() to get a list of 12 data frames:\n\nfiles <- map(paths, readxl::read_excel)\nlength(files)\n#> [1] 12\n\nfiles[[1]]\n#> # A tibble: 142 × 5\n#> country continent lifeExp pop gdpPercap\n#> <chr> <chr> <dbl> <dbl> <dbl>\n#> 1 Afghanistan Asia 28.8 8425333 779.\n#> 2 Albania Europe 55.2 1282697 1601.\n#> 3 Algeria Africa 43.1 9279525 2449.\n#> 4 Angola Africa 30.0 4232095 3521.\n#> 5 Argentina Americas 62.5 17876956 5911.\n#> 6 Australia Oceania 69.1 8691212 10040.\n#> # ℹ 136 more rows\n\n(This is another data structure that doesn’t display particularly compactly with str() so you might want to load it into RStudio and inspect it with View()).\nNow we can use purrr::list_rbind() to combine that list of data frames into a single data frame:\n\nlist_rbind(files)\n#> # A tibble: 1,704 × 5\n#> country continent lifeExp pop gdpPercap\n#> <chr> <chr> <dbl> <dbl> <dbl>\n#> 1 Afghanistan Asia 28.8 8425333 779.\n#> 2 Albania Europe 55.2 1282697 1601.\n#> 3 Algeria Africa 43.1 9279525 2449.\n#> 4 Angola Africa 30.0 4232095 3521.\n#> 5 Argentina Americas 62.5 17876956 5911.\n#> 6 Australia Oceania 69.1 8691212 10040.\n#> # ℹ 1,698 more rows\n\nOr we could do both steps at once in a pipeline:\n\npaths |> \n map(readxl::read_excel) |> \n list_rbind()\n\nWhat if we want to pass in extra arguments to read_excel()? We use the same technique that we used with across(). For example, it’s often useful to peak at the first few rows of the data with n_max = 1:\n\npaths |> \n map(\\(path) readxl::read_excel(path, n_max = 1)) |> \n list_rbind()\n#> # A tibble: 12 × 5\n#> country continent lifeExp pop gdpPercap\n#> <chr> <chr> <dbl> <dbl> <dbl>\n#> 1 Afghanistan Asia 28.8 8425333 779.\n#> 2 Afghanistan Asia 30.3 9240934 821.\n#> 3 Afghanistan Asia 32.0 10267083 853.\n#> 4 Afghanistan Asia 34.0 11537966 836.\n#> 5 Afghanistan Asia 36.1 13079460 740.\n#> 6 Afghanistan Asia 38.4 14880372 786.\n#> # ℹ 6 more rows\n\nThis makes it clear that something is missing: there’s no year column because that value is recorded in the path, not in the individual files. We’ll tackle that problem next.\n\n26.3.4 Data in the path\nSometimes the name of the file is data itself. In this example, the file name contains the year, which is not otherwise recorded in the individual files. To get that column into the final data frame, we need to do two things:\nFirst, we name the vector of paths. The easiest way to do this is with the set_names() function, which can take a function. Here we use basename() to extract just the file name from the full path:\n\npaths |> set_names(basename) \n#> 1952.xlsx 1957.xlsx \n#> \"data/gapminder/1952.xlsx\" \"data/gapminder/1957.xlsx\" \n#> 1962.xlsx 1967.xlsx \n#> \"data/gapminder/1962.xlsx\" \"data/gapminder/1967.xlsx\" \n#> 1972.xlsx 1977.xlsx \n#> \"data/gapminder/1972.xlsx\" \"data/gapminder/1977.xlsx\" \n#> 1982.xlsx 1987.xlsx \n#> \"data/gapminder/1982.xlsx\" \"data/gapminder/1987.xlsx\" \n#> 1992.xlsx 1997.xlsx \n#> \"data/gapminder/1992.xlsx\" \"data/gapminder/1997.xlsx\" \n#> 2002.xlsx 2007.xlsx \n#> \"data/gapminder/2002.xlsx\" \"data/gapminder/2007.xlsx\"\n\nThose names are automatically carried along by all the map functions, so the list of data frames will have those same names:\n\nfiles <- paths |> \n set_names(basename) |> \n map(readxl::read_excel)\n\nThat makes this call to map() shorthand for:\n\nfiles <- list(\n \"1952.xlsx\" = readxl::read_excel(\"data/gapminder/1952.xlsx\"),\n \"1957.xlsx\" = readxl::read_excel(\"data/gapminder/1957.xlsx\"),\n \"1962.xlsx\" = readxl::read_excel(\"data/gapminder/1962.xlsx\"),\n ...,\n \"2007.xlsx\" = readxl::read_excel(\"data/gapminder/2007.xlsx\")\n)\n\nYou can also use [[ to extract elements by name:\n\nfiles[[\"1962.xlsx\"]]\n#> # A tibble: 142 × 5\n#> country continent lifeExp pop gdpPercap\n#> <chr> <chr> <dbl> <dbl> <dbl>\n#> 1 Afghanistan Asia 32.0 10267083 853.\n#> 2 Albania Europe 64.8 1728137 2313.\n#> 3 Algeria Africa 48.3 11000948 2551.\n#> 4 Angola Africa 34 4826015 4269.\n#> 5 Argentina Americas 65.1 21283783 7133.\n#> 6 Australia Oceania 70.9 10794968 12217.\n#> # ℹ 136 more rows\n\nThen we use the names_to argument to list_rbind() to tell it to save the names into a new column called year then use readr::parse_number() to extract the number from the string.\n\npaths |> \n set_names(basename) |> \n map(readxl::read_excel) |> \n list_rbind(names_to = \"year\") |> \n mutate(year = parse_number(year))\n#> # A tibble: 1,704 × 6\n#> year country continent lifeExp pop gdpPercap\n#> <dbl> <chr> <chr> <dbl> <dbl> <dbl>\n#> 1 1952 Afghanistan Asia 28.8 8425333 779.\n#> 2 1952 Albania Europe 55.2 1282697 1601.\n#> 3 1952 Algeria Africa 43.1 9279525 2449.\n#> 4 1952 Angola Africa 30.0 4232095 3521.\n#> 5 1952 Argentina Americas 62.5 17876956 5911.\n#> 6 1952 Australia Oceania 69.1 8691212 10040.\n#> # ℹ 1,698 more rows\n\nIn more complicated cases, there might be other variables stored in the directory name, or maybe the file name contains multiple bits of data. In that case, use set_names() (without any arguments) to record the full path, and then use tidyr::separate_wider_delim() and friends to turn them into useful columns.\n\npaths |> \n set_names() |> \n map(readxl::read_excel) |> \n list_rbind(names_to = \"year\") |> \n separate_wider_delim(year, delim = \"/\", names = c(NA, \"dir\", \"file\")) |> \n separate_wider_delim(file, delim = \".\", names = c(\"file\", \"ext\"))\n#> # A tibble: 1,704 × 8\n#> dir file ext country continent lifeExp pop gdpPercap\n#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>\n#> 1 gapminder 1952 xlsx Afghanistan Asia 28.8 8425333 779.\n#> 2 gapminder 1952 xlsx Albania Europe 55.2 1282697 1601.\n#> 3 gapminder 1952 xlsx Algeria Africa 43.1 9279525 2449.\n#> 4 gapminder 1952 xlsx Angola Africa 30.0 4232095 3521.\n#> 5 gapminder 1952 xlsx Argentina Americas 62.5 17876956 5911.\n#> 6 gapminder 1952 xlsx Australia Oceania 69.1 8691212 10040.\n#> # ℹ 1,698 more rows\n\n\n26.3.5 Save your work\nNow that you’ve done all this hard work to get to a nice tidy data frame, it’s a great time to save your work:\n\ngapminder <- paths |> \n set_names(basename) |> \n map(readxl::read_excel) |> \n list_rbind(names_to = \"year\") |> \n mutate(year = parse_number(year))\n\nwrite_csv(gapminder, \"gapminder.csv\")\n\nNow when you come back to this problem in the future, you can read in a single csv file. For large and richer datasets, using parquet might be a better choice than .csv, as discussed in Seção 22.4.\nIf you’re working in a project, we suggest calling the file that does this sort of data prep work something like 0-cleanup.R. The 0 in the file name suggests that this should be run before anything else.\nIf your input data files change over time, you might consider learning a tool like targets to set up your data cleaning code to automatically re-run whenever one of the input files is modified.\n\n26.3.6 Many simple iterations\nHere we’ve just loaded the data directly from disk, and were lucky enough to get a tidy dataset. In most cases, you’ll need to do some additional tidying, and you have two basic options: you can do one round of iteration with a complex function, or do multiple rounds of iteration with simple functions. In our experience most folks reach first for one complex iteration, but you’re often better by doing multiple simple iterations.\nFor example, imagine that you want to read in a bunch of files, filter out missing values, pivot, and then combine. One way to approach the problem is to write a function that takes a file and does all those steps then call map() once:\n\nprocess_file <- function(path) {\n df <- read_csv(path)\n \n df |> \n filter(!is.na(id)) |> \n mutate(id = tolower(id)) |> \n pivot_longer(jan:dec, names_to = \"month\")\n}\n\npaths |> \n map(process_file) |> \n list_rbind()\n\nAlternatively, you could perform each step of process_file() to every file:\n\npaths |> \n map(read_csv) |> \n map(\\(df) df |> filter(!is.na(id))) |> \n map(\\(df) df |> mutate(id = tolower(id))) |> \n map(\\(df) df |> pivot_longer(jan:dec, names_to = \"month\")) |> \n list_rbind()\n\nWe recommend this approach because it stops you getting fixated on getting the first file right before moving on to the rest. By considering all of the data when doing tidying and cleaning, you’re more likely to think holistically and end up with a higher quality result.\nIn this particular example, there’s another optimization you could make, by binding all the data frames together earlier. Then you can rely on regular dplyr behavior:\n\npaths |> \n map(read_csv) |> \n list_rbind() |> \n filter(!is.na(id)) |> \n mutate(id = tolower(id)) |> \n pivot_longer(jan:dec, names_to = \"month\")\n\n\n26.3.7 Heterogeneous data\nUnfortunately, sometimes it’s not possible to go from map() straight to list_rbind() because the data frames are so heterogeneous that list_rbind() either fails or yields a data frame that’s not very useful. In that case, it’s still useful to start by loading all of the files:\n\nfiles <- paths |> \n map(readxl::read_excel) \n\nThen a very useful strategy is to capture the structure of the data frames so that you can explore it using your data science skills. One way to do so is with this handy df_types function6 that returns a tibble with one row for each column:\n\ndf_types <- function(df) {\n tibble(\n col_name = names(df), \n col_type = map_chr(df, vctrs::vec_ptype_full),\n n_miss = map_int(df, \\(x) sum(is.na(x)))\n )\n}\n\ndf_types(gapminder)\n#> # A tibble: 6 × 3\n#> col_name col_type n_miss\n#> <chr> <chr> <int>\n#> 1 year double 0\n#> 2 country character 0\n#> 3 continent character 0\n#> 4 lifeExp double 0\n#> 5 pop double 0\n#> 6 gdpPercap double 0\n\nYou can then apply this function to all of the files, and maybe do some pivoting to make it easier to see where the differences are. For example, this makes it easy to verify that the gapminder spreadsheets that we’ve been working with are all quite homogeneous:\n\nfiles |> \n map(df_types) |> \n list_rbind(names_to = \"file_name\") |> \n select(-n_miss) |> \n pivot_wider(names_from = col_name, values_from = col_type)\n#> # A tibble: 12 × 6\n#> file_name country continent lifeExp pop gdpPercap\n#> <chr> <chr> <chr> <chr> <chr> <chr> \n#> 1 1952.xlsx character character double double double \n#> 2 1957.xlsx character character double double double \n#> 3 1962.xlsx character character double double double \n#> 4 1967.xlsx character character double double double \n#> 5 1972.xlsx character character double double double \n#> 6 1977.xlsx character character double double double \n#> # ℹ 6 more rows\n\nIf the files have heterogeneous formats, you might need to do more processing before you can successfully merge them. Unfortunately, we’re now going to leave you to figure that out on your own, but you might want to read about map_if() and map_at(). map_if() allows you to selectively modify elements of a list based on their values; map_at() allows you to selectively modify elements based on their names.\n\n26.3.8 Handling failures\nSometimes the structure of your data might be sufficiently wild that you can’t even read all the files with a single command. And then you’ll encounter one of the downsides of map(): it succeeds or fails as a whole. map() will either successfully read all of the files in a directory or fail with an error, reading zero files. This is annoying: why does one failure prevent you from accessing all the other successes?\nLuckily, purrr comes with a helper to tackle this problem: possibly(). possibly() is what’s known as a function operator: it takes a function and returns a function with modified behavior. In particular, possibly() changes a function from erroring to returning a value that you specify:\n\nfiles <- paths |> \n map(possibly(\\(path) readxl::read_excel(path), NULL))\n\ndata <- files |> list_rbind()\n\nThis works particularly well here because list_rbind(), like many tidyverse functions, automatically ignores NULLs.\nNow you have all the data that can be read easily, and it’s time to tackle the hard part of figuring out why some files failed to load and what to do about it. Start by getting the paths that failed:\n\nfailed <- map_vec(files, is.null)\npaths[failed]\n#> character(0)\n\nThen call the import function again for each failure and figure out what went wrong." + }, + { + "objectID": "iteration.html#saving-multiple-outputs", + "href": "iteration.html#saving-multiple-outputs", + "title": "26  Iteration", + "section": "\n26.4 Saving multiple outputs", + "text": "26.4 Saving multiple outputs\nIn the last section, you learned about map(), which is useful for reading multiple files into a single object. In this section, we’ll now explore sort of the opposite problem: how can you take one or more R objects and save it to one or more files? We’ll explore this challenge using three examples:\n\nSaving multiple data frames into one database.\nSaving multiple data frames into multiple .csv files.\nSaving multiple plots to multiple .png files.\n\n\n26.4.1 Writing to a database\nSometimes when working with many files at once, it’s not possible to fit all your data into memory at once, and you can’t do map(files, read_csv). One approach to deal with this problem is to load your data into a database so you can access just the bits you need with dbplyr.\nIf you’re lucky, the database package you’re using will provide a handy function that takes a vector of paths and loads them all into the database. This is the case with duckdb’s duckdb_read_csv():\n\ncon <- DBI::dbConnect(duckdb::duckdb())\nduckdb::duckdb_read_csv(con, \"gapminder\", paths)\n\nThis would work well here, but we don’t have csv files, instead we have excel spreadsheets. So we’re going to have to do it “by hand”. Learning to do it by hand will also help you when you have a bunch of csvs and the database that you’re working with doesn’t have one function that will load them all in.\nWe need to start by creating a table that will fill in with data. The easiest way to do this is by creating a template, a dummy data frame that contains all the columns we want, but only a sampling of the data. For the gapminder data, we can make that template by reading a single file and adding the year to it:\n\ntemplate <- readxl::read_excel(paths[[1]])\ntemplate$year <- 1952\ntemplate\n#> # A tibble: 142 × 6\n#> country continent lifeExp pop gdpPercap year\n#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>\n#> 1 Afghanistan Asia 28.8 8425333 779. 1952\n#> 2 Albania Europe 55.2 1282697 1601. 1952\n#> 3 Algeria Africa 43.1 9279525 2449. 1952\n#> 4 Angola Africa 30.0 4232095 3521. 1952\n#> 5 Argentina Americas 62.5 17876956 5911. 1952\n#> 6 Australia Oceania 69.1 8691212 10040. 1952\n#> # ℹ 136 more rows\n\nNow we can connect to the database, and use DBI::dbCreateTable() to turn our template into a database table:\n\ncon <- DBI::dbConnect(duckdb::duckdb())\nDBI::dbCreateTable(con, \"gapminder\", template)\n\ndbCreateTable() doesn’t use the data in template, just the variable names and types. So if we inspect the gapminder table now you’ll see that it’s empty but it has the variables we need with the types we expect:\n\ncon |> tbl(\"gapminder\")\n#> # Source: table<gapminder> [0 x 6]\n#> # Database: DuckDB v0.9.1 [unknown@Linux 6.2.0-1015-azure:R 4.3.2/:memory:]\n#> # ℹ 6 variables: country <chr>, continent <chr>, lifeExp <dbl>, pop <dbl>,\n#> # gdpPercap <dbl>, year <dbl>\n\nNext, we need a function that takes a single file path, reads it into R, and adds the result to the gapminder table. We can do that by combining read_excel() with DBI::dbAppendTable():\n\nappend_file <- function(path) {\n df <- readxl::read_excel(path)\n df$year <- parse_number(basename(path))\n \n DBI::dbAppendTable(con, \"gapminder\", df)\n}\n\nNow we need to call append_file() once for each element of paths. That’s certainly possible with map():\n\npaths |> map(append_file)\n\nBut we don’t care about the output of append_file(), so instead of map() it’s slightly nicer to use walk(). walk() does exactly the same thing as map() but throws the output away:\n\npaths |> walk(append_file)\n\nNow we can see if we have all the data in our table:\n\ncon |> \n tbl(\"gapminder\") |> \n count(year)\n#> # Source: SQL [?? x 2]\n#> # Database: DuckDB v0.9.1 [unknown@Linux 6.2.0-1015-azure:R 4.3.2/:memory:]\n#> year n\n#> <dbl> <dbl>\n#> 1 1967 142\n#> 2 1977 142\n#> 3 1987 142\n#> 4 2007 142\n#> 5 1952 142\n#> 6 1957 142\n#> # ℹ more rows\n\n\n26.4.2 Writing csv files\nThe same basic principle applies if we want to write multiple csv files, one for each group. Let’s imagine that we want to take the ggplot2::diamonds data and save one csv file for each clarity. First we need to make those individual datasets. There are many ways you could do that, but there’s one way we particularly like: group_nest().\n\nby_clarity <- diamonds |> \n group_nest(clarity)\n\nby_clarity\n#> # A tibble: 8 × 2\n#> clarity data\n#> <ord> <list<tibble[,9]>>\n#> 1 I1 [741 × 9]\n#> 2 SI2 [9,194 × 9]\n#> 3 SI1 [13,065 × 9]\n#> 4 VS2 [12,258 × 9]\n#> 5 VS1 [8,171 × 9]\n#> 6 VVS2 [5,066 × 9]\n#> # ℹ 2 more rows\n\nThis gives us a new tibble with eight rows and two columns. clarity is our grouping variable and data is a list-column containing one tibble for each unique value of clarity:\n\nby_clarity$data[[1]]\n#> # A tibble: 741 × 9\n#> carat cut color depth table price x y z\n#> <dbl> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>\n#> 1 0.32 Premium E 60.9 58 345 4.38 4.42 2.68\n#> 2 1.17 Very Good J 60.2 61 2774 6.83 6.9 4.13\n#> 3 1.01 Premium F 61.8 60 2781 6.39 6.36 3.94\n#> 4 1.01 Fair E 64.5 58 2788 6.29 6.21 4.03\n#> 5 0.96 Ideal F 60.7 55 2801 6.37 6.41 3.88\n#> 6 1.04 Premium G 62.2 58 2801 6.46 6.41 4 \n#> # ℹ 735 more rows\n\nWhile we’re here, let’s create a column that gives the name of output file, using mutate() and str_glue():\n\nby_clarity <- by_clarity |> \n mutate(path = str_glue(\"diamonds-{clarity}.csv\"))\n\nby_clarity\n#> # A tibble: 8 × 3\n#> clarity data path \n#> <ord> <list<tibble[,9]>> <glue> \n#> 1 I1 [741 × 9] diamonds-I1.csv \n#> 2 SI2 [9,194 × 9] diamonds-SI2.csv \n#> 3 SI1 [13,065 × 9] diamonds-SI1.csv \n#> 4 VS2 [12,258 × 9] diamonds-VS2.csv \n#> 5 VS1 [8,171 × 9] diamonds-VS1.csv \n#> 6 VVS2 [5,066 × 9] diamonds-VVS2.csv\n#> # ℹ 2 more rows\n\nSo if we were going to save these data frames by hand, we might write something like:\n\nwrite_csv(by_clarity$data[[1]], by_clarity$path[[1]])\nwrite_csv(by_clarity$data[[2]], by_clarity$path[[2]])\nwrite_csv(by_clarity$data[[3]], by_clarity$path[[3]])\n...\nwrite_csv(by_clarity$by_clarity[[8]], by_clarity$path[[8]])\n\nThis is a little different to our previous uses of map() because there are two arguments that are changing, not just one. That means we need a new function: map2(), which varies both the first and second arguments. And because we again don’t care about the output, we want walk2() rather than map2(). That gives us:\n\nwalk2(by_clarity$data, by_clarity$path, write_csv)\n\n\n26.4.3 Saving plots\nWe can take the same basic approach to create many plots. Let’s first make a function that draws the plot we want:\n\ncarat_histogram <- function(df) {\n ggplot(df, aes(x = carat)) + geom_histogram(binwidth = 0.1) \n}\n\ncarat_histogram(by_clarity$data[[1]])\n\n\n\n\nNow we can use map() to create a list of many plots7 and their eventual file paths:\n\nby_clarity <- by_clarity |> \n mutate(\n plot = map(data, carat_histogram),\n path = str_glue(\"clarity-{clarity}.png\")\n )\n\nThen use walk2() with ggsave() to save each plot:\n\nwalk2(\n by_clarity$path,\n by_clarity$plot,\n \\(path, plot) ggsave(path, plot, width = 6, height = 6)\n)\n\nThis is shorthand for:\n\nggsave(by_clarity$path[[1]], by_clarity$plot[[1]], width = 6, height = 6)\nggsave(by_clarity$path[[2]], by_clarity$plot[[2]], width = 6, height = 6)\nggsave(by_clarity$path[[3]], by_clarity$plot[[3]], width = 6, height = 6)\n...\nggsave(by_clarity$path[[8]], by_clarity$plot[[8]], width = 6, height = 6)" + }, + { + "objectID": "iteration.html#summary", + "href": "iteration.html#summary", + "title": "26  Iteration", + "section": "\n26.5 Summary", + "text": "26.5 Summary\nIn this chapter, you’ve seen how to use explicit iteration to solve three problems that come up frequently when doing data science: manipulating multiple columns, reading multiple files, and saving multiple outputs. But in general, iteration is a super power: if you know the right iteration technique, you can easily go from fixing one problem to fixing all the problems. Once you’ve mastered the techniques in this chapter, we highly recommend learning more by reading the Functionals chapter of Advanced R and consulting the purrr website.\nIf you know much about iteration in other languages, you might be surprised that we didn’t discuss the for loop. That’s because R’s orientation towards data analysis changes how we iterate: in most cases you can rely on an existing idiom to do something to each columns or each group. And when you can’t, you can often use a functional programming tool like map() that does something to each element of a list. However, you will see for loops in wild-caught code, so you’ll learn about them in the next chapter where we’ll discuss some important base R tools." + }, + { + "objectID": "iteration.html#footnotes", + "href": "iteration.html#footnotes", + "title": "26  Iteration", + "section": "", + "text": "Anonymous, because we never explicitly gave it a name with <-. Another term programmers use for this is “lambda function”.↩︎\nIn older code you might see syntax that looks like ~ .x + 1. This is another way to write anonymous functions but it only works inside tidyverse functions and always uses the variable name .x. We now recommend the base syntax, \\(x) x + 1.↩︎\nYou can’t currently change the order of the columns, but you could reorder them after the fact using relocate() or similar.↩︎\nMaybe there will be one day, but currently we don’t see how.↩︎\nIf you instead had a directory of csv files with the same format, you can use the technique from Seção 7.4.↩︎\nWe’re not going to explain how it works, but if you look at the docs for the functions used, you should be able to puzzle it out.↩︎\nYou can print by_clarity$plot to get a crude animation — you’ll get one plot for each element of plots. NOTE: this didn’t happen for me.↩︎" + }, + { + "objectID": "base-R.html#introduction", + "href": "base-R.html#introduction", + "title": "27  A field guide to base R", + "section": "\n27.1 Introduction", + "text": "27.1 Introduction\nTo finish off the programming section, we’re going to give you a quick tour of the most important base R functions that we don’t otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code you’ll encounter in the wild.\nThis is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, increasing the consistency across functions, and making each new function or package a little easier to learn and use. It’s not possible to use the tidyverse without using base R, so we’ve actually already taught you a lot of base R functions: from library() to load packages, to sum() and mean() for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like +, -, /, *, |, &, and !. What we haven’t focused on so far is base R workflows, so we will highlight a few of those in this chapter.\nAfter you read this book, you’ll learn other approaches to the same problems using base R, data.table, and other packages. You’ll undoubtedly encounter these other approaches when you start reading R code written by others, particularly if you’re using StackOverflow. It’s 100% okay to write code that uses a mix of approaches, and don’t let anyone tell you otherwise!\nIn this chapter, we’ll focus on four big topics: subsetting with [, subsetting with [[ and $, the apply family of functions, and for loops. To finish off, we’ll briefly discuss two essential plotting functions.\n\n27.1.1 Prerequisites\nThis package focuses on base R so doesn’t have any real prerequisites, but we’ll load the tidyverse in order to explain some of the differences.\n\nlibrary(tidyverse)" + }, + { + "objectID": "base-R.html#sec-subset-many", + "href": "base-R.html#sec-subset-many", + "title": "27  A field guide to base R", + "section": "\n27.2 Selecting multiple elements with [\n", + "text": "27.2 Selecting multiple elements with [\n\n[ is used to extract sub-components from vectors and data frames, and is called like x[i] or x[i, j]. In this section, we’ll introduce you to the power of [, first showing you how you can use it with vectors, then how the same principles extend in a straightforward way to two-dimensional (2d) structures like data frames. We’ll then help you cement that knowledge by showing how various dplyr verbs are special cases of [.\n\n27.2.1 Subsetting vectors\nThere are five main types of things that you can subset a vector with, i.e., that can be the i in x[i]:\n\n\nA vector of positive integers. Subsetting with positive integers keeps the elements at those positions:\n\nx <- c(\"one\", \"two\", \"three\", \"four\", \"five\")\nx[c(3, 2, 5)]\n#> [1] \"three\" \"two\" \"five\"\n\nBy repeating a position, you can actually make a longer output than input, making the term “subsetting” a bit of a misnomer.\n\nx[c(1, 1, 5, 5, 5, 2)]\n#> [1] \"one\" \"one\" \"five\" \"five\" \"five\" \"two\"\n\n\n\nA vector of negative integers. Negative values drop the elements at the specified positions:\n\nx[c(-1, -3, -5)]\n#> [1] \"two\" \"four\"\n\n\n\nA logical vector. Subsetting with a logical vector keeps all values corresponding to a TRUE value. This is most often useful in conjunction with the comparison functions.\n\nx <- c(10, 3, NA, 5, 8, 1, NA)\n\n# All non-missing values of x\nx[!is.na(x)]\n#> [1] 10 3 5 8 1\n\n# All even (or missing!) values of x\nx[x %% 2 == 0]\n#> [1] 10 NA 8 NA\n\nUnlike filter(), NA indices will be included in the output as NAs.\n\n\nA character vector. If you have a named vector, you can subset it with a character vector:\n\nx <- c(abc = 1, def = 2, xyz = 5)\nx[c(\"xyz\", \"def\")]\n#> xyz def \n#> 5 2\n\nAs with subsetting with positive integers, you can use a character vector to duplicate individual entries.\n\nNothing. The final type of subsetting is nothing, x[], which returns the complete x. This is not useful for subsetting vectors, but as we’ll see shortly, it is useful when subsetting 2d structures like tibbles.\n\n27.2.2 Subsetting data frames\nThere are quite a few different ways1 that you can use [ with a data frame, but the most important way is to select rows and columns independently with df[rows, cols]. Here rows and cols are vectors as described above. For example, df[rows, ] and df[, cols] select just rows or just columns, using the empty subset to preserve the other dimension.\nHere are a couple of examples:\n\ndf <- tibble(\n x = 1:3, \n y = c(\"a\", \"e\", \"f\"), \n z = runif(3)\n)\n\n# Select first row and second column\ndf[1, 2]\n#> # A tibble: 1 × 1\n#> y \n#> <chr>\n#> 1 a\n\n# Select all rows and columns x and y\ndf[, c(\"x\" , \"y\")]\n#> # A tibble: 3 × 2\n#> x y \n#> <int> <chr>\n#> 1 1 a \n#> 2 2 e \n#> 3 3 f\n\n# Select rows where `x` is greater than 1 and all columns\ndf[df$x > 1, ]\n#> # A tibble: 2 × 3\n#> x y z\n#> <int> <chr> <dbl>\n#> 1 2 e 0.834\n#> 2 3 f 0.601\n\nWe’ll come back to $ shortly, but you should be able to guess what df$x does from the context: it extracts the x variable from df. We need to use it here because [ doesn’t use tidy evaluation, so you need to be explicit about the source of the x variable.\nThere’s an important difference between tibbles and data frames when it comes to [. In this book, we’ve mainly used tibbles, which are data frames, but they tweak some behaviors to make your life a little easier. In most places, you can use “tibble” and “data frame” interchangeably, so when we want to draw particular attention to R’s built-in data frame, we’ll write data.frame. If df is a data.frame, then df[, cols] will return a vector if col selects a single column and a data frame if it selects more than one column. If df is a tibble, then [ will always return a tibble.\n\ndf1 <- data.frame(x = 1:3)\ndf1[, \"x\"]\n#> [1] 1 2 3\n\ndf2 <- tibble(x = 1:3)\ndf2[, \"x\"]\n#> # A tibble: 3 × 1\n#> x\n#> <int>\n#> 1 1\n#> 2 2\n#> 3 3\n\nOne way to avoid this ambiguity with data.frames is to explicitly specify drop = FALSE:\n\ndf1[, \"x\" , drop = FALSE]\n#> x\n#> 1 1\n#> 2 2\n#> 3 3\n\n\n27.2.3 dplyr equivalents\nSeveral dplyr verbs are special cases of [:\n\n\nfilter() is equivalent to subsetting the rows with a logical vector, taking care to exclude missing values:\n\ndf <- tibble(\n x = c(2, 3, 1, 1, NA), \n y = letters[1:5], \n z = runif(5)\n)\ndf |> filter(x > 1)\n\n# same as\ndf[!is.na(df$x) & df$x > 1, ]\n\nAnother common technique in the wild is to use which() for its side-effect of dropping missing values: df[which(df$x > 1), ].\n\n\narrange() is equivalent to subsetting the rows with an integer vector, usually created with order():\n\ndf |> arrange(x, y)\n\n# same as\ndf[order(df$x, df$y), ]\n\nYou can use order(decreasing = TRUE) to sort all columns in descending order or -rank(col) to sort columns in decreasing order individually.\n\n\nBoth select() and relocate() are similar to subsetting the columns with a character vector:\n\ndf |> select(x, z)\n\n# same as\ndf[, c(\"x\", \"z\")]\n\n\n\nBase R also provides a function that combines the features of filter() and select()2 called subset():\n\ndf |> \n filter(x > 1) |> \n select(y, z)\n#> # A tibble: 2 × 2\n#> y z\n#> <chr> <dbl>\n#> 1 a 0.157 \n#> 2 b 0.00740\n\n\n# same as\ndf |> subset(x > 1, c(y, z))\n\nThis function was the inspiration for much of dplyr’s syntax.\n\n27.2.4 Exercises\n\n\nCreate functions that take a vector as input and return:\n\nThe elements at even-numbered positions.\nEvery element except the last value.\nOnly even values (and no missing values).\n\n\nWhy is x[-which(x > 0)] not the same as x[x <= 0]? Read the documentation for which() and do some experiments to figure it out." + }, + { + "objectID": "base-R.html#sec-subset-one", + "href": "base-R.html#sec-subset-one", + "title": "27  A field guide to base R", + "section": "\n27.3 Selecting a single element with $ and [[\n", + "text": "27.3 Selecting a single element with $ and [[\n\n[, which selects many elements, is paired with [[ and $, which extract a single element. In this section, we’ll show you how to use [[ and $ to pull columns out of data frames, discuss a couple more differences between data.frames and tibbles, and emphasize some important differences between [ and [[ when used with lists.\n\n27.3.1 Data frames\n[[ and $ can be used to extract columns out of a data frame. [[ can access by position or by name, and $ is specialized for access by name:\n\ntb <- tibble(\n x = 1:4,\n y = c(10, 4, 1, 21)\n)\n\n# by position\ntb[[1]]\n#> [1] 1 2 3 4\n\n# by name\ntb[[\"x\"]]\n#> [1] 1 2 3 4\ntb$x\n#> [1] 1 2 3 4\n\nThey can also be used to create new columns, the base R equivalent of mutate():\n\ntb$z <- tb$x + tb$y\ntb\n#> # A tibble: 4 × 3\n#> x y z\n#> <int> <dbl> <dbl>\n#> 1 1 10 11\n#> 2 2 4 6\n#> 3 3 1 4\n#> 4 4 21 25\n\nThere are several other base R approaches to creating new columns including with transform(), with(), and within(). Hadley collected a few examples at https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf.\nUsing $ directly is convenient when performing quick summaries. For example, if you just want to find the size of the biggest diamond or the possible values of cut, there’s no need to use summarize():\n\nmax(diamonds$carat)\n#> [1] 5.01\n\nlevels(diamonds$cut)\n#> [1] \"Fair\" \"Good\" \"Very Good\" \"Premium\" \"Ideal\"\n\ndplyr also provides an equivalent to [[/$ that we didn’t mention in Capítulo 3: pull(). pull() takes either a variable name or variable position and returns just that column. That means we could rewrite the above code to use the pipe:\n\ndiamonds |> pull(carat) |> max()\n#> [1] 5.01\n\ndiamonds |> pull(cut) |> levels()\n#> [1] \"Fair\" \"Good\" \"Very Good\" \"Premium\" \"Ideal\"\n\n\n27.3.2 Tibbles\nThere are a couple of important differences between tibbles and base data.frames when it comes to $. Data frames match the prefix of any variable names (so-called partial matching) and don’t complain if a column doesn’t exist:\n\ndf <- data.frame(x1 = 1)\ndf$x\n#> [1] 1\ndf$z\n#> NULL\n\nTibbles are more strict: they only ever match variable names exactly and they will generate a warning if the column you are trying to access doesn’t exist:\n\ntb <- tibble(x1 = 1)\n\ntb$x\n#> Warning: Unknown or uninitialised column: `x`.\n#> NULL\ntb$z\n#> Warning: Unknown or uninitialised column: `z`.\n#> NULL\n\nFor this reason we sometimes joke that tibbles are lazy and surly: they do less and complain more.\n\n27.3.3 Lists\n[[ and $ are also really important for working with lists, and it’s important to understand how they differ from [. Let’s illustrate the differences with a list named l:\n\nl <- list(\n a = 1:3, \n b = \"a string\", \n c = pi, \n d = list(-1, -5)\n)\n\n\n\n[ extracts a sub-list. It doesn’t matter how many elements you extract, the result will always be a list.\n\nstr(l[1:2])\n#> List of 2\n#> $ a: int [1:3] 1 2 3\n#> $ b: chr \"a string\"\n\nstr(l[1])\n#> List of 1\n#> $ a: int [1:3] 1 2 3\n\nstr(l[4])\n#> List of 1\n#> $ d:List of 2\n#> ..$ : num -1\n#> ..$ : num -5\n\nLike with vectors, you can subset with a logical, integer, or character vector.\n\n\n[[ and $ extract a single component from a list. They remove a level of hierarchy from the list.\n\nstr(l[[1]])\n#> int [1:3] 1 2 3\n\nstr(l[[4]])\n#> List of 2\n#> $ : num -1\n#> $ : num -5\n\nstr(l$a)\n#> int [1:3] 1 2 3\n\n\n\nThe difference between [ and [[ is particularly important for lists because [[ drills down into the list while [ returns a new, smaller list. To help you remember the difference, take a look at the unusual pepper shaker shown in Figura 27.1. If this pepper shaker is your list pepper, then, pepper[1] is a pepper shaker containing a single pepper packet. pepper[2] would look the same, but would contain the second packet. pepper[1:2] would be a pepper shaker containing two pepper packets. pepper[[1]] would extract the pepper packet itself.\n\n\n\n\nFigura 27.1: (Left) A pepper shaker that Hadley once found in his hotel room. (Middle) pepper[1]. (Right) pepper[[1]]\n\n\n\nThis same principle applies when you use 1d [ with a data frame: df[\"x\"] returns a one-column data frame and df[[\"x\"]] returns a vector.\n\n27.3.4 Exercises\n\nWhat happens when you use [[ with a positive integer that’s bigger than the length of the vector? What happens when you subset with a name that doesn’t exist?\nWhat would pepper[[1]][1] be? What about pepper[[1]][[1]]?" + }, + { + "objectID": "base-R.html#apply-family", + "href": "base-R.html#apply-family", + "title": "27  A field guide to base R", + "section": "\n27.4 Apply family", + "text": "27.4 Apply family\nIn Capítulo 26, you learned tidyverse techniques for iteration like dplyr::across() and the map family of functions. In this section, you’ll learn about their base equivalents, the apply family. In this context apply and map are synonyms because another way of saying “map a function over each element of a vector” is “apply a function over each element of a vector”. Here we’ll give you a quick overview of this family so you can recognize them in the wild.\nThe most important member of this family is lapply(), which is very similar to purrr::map()3. In fact, because we haven’t used any of map()’s more advanced features, you can replace every map() call in Capítulo 26 with lapply().\nThere’s no exact base R equivalent to across() but you can get close by using [ with lapply(). This works because under the hood, data frames are lists of columns, so calling lapply() on a data frame applies the function to each column.\n\ndf <- tibble(a = 1, b = 2, c = \"a\", d = \"b\", e = 4)\n\n# First find numeric columns\nnum_cols <- sapply(df, is.numeric)\nnum_cols\n#> a b c d e \n#> TRUE TRUE FALSE FALSE TRUE\n\n# Then transform each column with lapply() then replace the original values\ndf[, num_cols] <- lapply(df[, num_cols, drop = FALSE], \\(x) x * 2)\ndf\n#> # A tibble: 1 × 5\n#> a b c d e\n#> <dbl> <dbl> <chr> <chr> <dbl>\n#> 1 2 4 a b 8\n\nThe code above uses a new function, sapply(). It’s similar to lapply() but it always tries to simplify the result, hence the s in its name, here producing a logical vector instead of a list. We don’t recommend using it for programming, because the simplification can fail and give you an unexpected type, but it’s usually fine for interactive use. purrr has a similar function called map_vec() that we didn’t mention in Capítulo 26.\nBase R provides a stricter version of sapply() called vapply(), short for vector apply. It takes an additional argument that specifies the expected type, ensuring that simplification occurs the same way regardless of the input. For example, we could replace the sapply() call above with this vapply() where we specify that we expect is.numeric() to return a logical vector of length 1:\n\nvapply(df, is.numeric, logical(1))\n#> a b c d e \n#> TRUE TRUE FALSE FALSE TRUE\n\nThe distinction between sapply() and vapply() is really important when they’re inside a function (because it makes a big difference to the function’s robustness to unusual inputs), but it doesn’t usually matter in data analysis.\nAnother important member of the apply family is tapply() which computes a single grouped summary:\n\ndiamonds |> \n group_by(cut) |> \n summarize(price = mean(price))\n#> # A tibble: 5 × 2\n#> cut price\n#> <ord> <dbl>\n#> 1 Fair 4359.\n#> 2 Good 3929.\n#> 3 Very Good 3982.\n#> 4 Premium 4584.\n#> 5 Ideal 3458.\n\ntapply(diamonds$price, diamonds$cut, mean)\n#> Fair Good Very Good Premium Ideal \n#> 4358.758 3928.864 3981.760 4584.258 3457.542\n\nUnfortunately tapply() returns its results in a named vector which requires some gymnastics if you want to collect multiple summaries and grouping variables into a data frame (it’s certainly possible to not do this and just work with free floating vectors, but in our experience that just delays the work). If you want to see how you might use tapply() or other base techniques to perform other grouped summaries, Hadley has collected a few techniques in a gist.\nThe final member of the apply family is the titular apply(), which works with matrices and arrays. In particular, watch out for apply(df, 2, something), which is a slow and potentially dangerous way of doing lapply(df, something). This rarely comes up in data science because we usually work with data frames and not matrices." + }, + { + "objectID": "base-R.html#for-loops", + "href": "base-R.html#for-loops", + "title": "27  A field guide to base R", + "section": "\n27.5 for loops", + "text": "27.5 for loops\nfor loops are the fundamental building block of iteration that both the apply and map families use under the hood. for loops are powerful and general tools that are important to learn as you become a more experienced R programmer. The basic structure of a for loop looks like this:\n\nfor (element in vector) {\n # do something with element\n}\n\nThe most straightforward use of for loops is to achieve the same effect as walk(): call some function with a side-effect on each element of a list. For example, in Seção 26.4.1 instead of using walk():\n\npaths |> walk(append_file)\n\nWe could have used a for loop:\n\nfor (path in paths) {\n append_file(path)\n}\n\nThings get a little trickier if you want to save the output of the for loop, for example reading all of the excel files in a directory like we did in Capítulo 26:\n\npaths <- dir(\"data/gapminder\", pattern = \"\\\\.xlsx$\", full.names = TRUE)\nfiles <- map(paths, readxl::read_excel)\n\nThere are a few different techniques that you can use, but we recommend being explicit about what the output is going to look like upfront. In this case, we’re going to want a list the same length as paths, which we can create with vector():\n\nfiles <- vector(\"list\", length(paths))\n\nThen instead of iterating over the elements of paths, we’ll iterate over their indices, using seq_along() to generate one index for each element of paths:\n\nseq_along(paths)\n#> [1] 1 2 3 4 5 6 7 8 9 10 11 12\n\nUsing the indices is important because it allows us to link to each position in the input with the corresponding position in the output:\n\nfor (i in seq_along(paths)) {\n files[[i]] <- readxl::read_excel(paths[[i]])\n}\n\nTo combine the list of tibbles into a single tibble you can use do.call() + rbind():\n\ndo.call(rbind, files)\n#> # A tibble: 1,704 × 5\n#> country continent lifeExp pop gdpPercap\n#> <chr> <chr> <dbl> <dbl> <dbl>\n#> 1 Afghanistan Asia 28.8 8425333 779.\n#> 2 Albania Europe 55.2 1282697 1601.\n#> 3 Algeria Africa 43.1 9279525 2449.\n#> 4 Angola Africa 30.0 4232095 3521.\n#> 5 Argentina Americas 62.5 17876956 5911.\n#> 6 Australia Oceania 69.1 8691212 10040.\n#> # ℹ 1,698 more rows\n\nRather than making a list and saving the results as we go, a simpler approach is to build up the data frame piece-by-piece:\n\nout <- NULL\nfor (path in paths) {\n out <- rbind(out, readxl::read_excel(path))\n}\n\nWe recommend avoiding this pattern because it can become very slow when the vector is very long. This is the source of the persistent canard that for loops are slow: they’re not, but iteratively growing a vector is." + }, + { + "objectID": "base-R.html#plots", + "href": "base-R.html#plots", + "title": "27  A field guide to base R", + "section": "\n27.6 Plots", + "text": "27.6 Plots\nMany R users who don’t otherwise use the tidyverse prefer ggplot2 for plotting due to helpful features like sensible defaults, automatic legends, and a modern look. However, base R plotting functions can still be useful because they’re so concise — it takes very little typing to do a basic exploratory plot.\nThere are two main types of base plot you’ll see in the wild: scatterplots and histograms, produced with plot() and hist() respectively. Here’s a quick example from the diamonds dataset:\n\n# Left\nhist(diamonds$carat)\n\n# Right\nplot(diamonds$carat, diamonds$price)\n\n\n\n\n\n\n\n\n\n\n\nNote that base plotting functions work with vectors, so you need to pull columns out of the data frame using $ or some other technique." + }, + { + "objectID": "base-R.html#summary", + "href": "base-R.html#summary", + "title": "27  A field guide to base R", + "section": "\n27.7 Summary", + "text": "27.7 Summary\nIn this chapter, we’ve shown you a selection of base R functions useful for subsetting and iteration. Compared to approaches discussed elsewhere in the book, these functions tend to have more of a “vector” flavor than a “data frame” flavor because base R functions tend to take individual vectors, rather than a data frame and some column specification. This often makes life easier for programming and so becomes more important as you write more functions and begin to write your own packages.\nThis chapter concludes the programming section of the book. You’ve made a solid start on your journey to becoming not just a data scientist who uses R, but a data scientist who can program in R. We hope these chapters have sparked your interest in programming and that you’re looking forward to learning more outside of this book." + }, + { + "objectID": "base-R.html#footnotes", + "href": "base-R.html#footnotes", + "title": "27  A field guide to base R", + "section": "", + "text": "Read https://adv-r.hadley.nz/subsetting.html#subset-multiple to see how you can also subset a data frame like it is a 1d object and how you can subset it with a matrix.↩︎\nBut it doesn’t handle grouped data frames differently and it doesn’t support selection helper functions like starts_with().↩︎\nIt just lacks convenient features like progress bars and reporting which element caused the problem if there’s an error.↩︎" }, { "objectID": "communicate.html", "href": "communicate.html", "title": "Comunicar", "section": "", - "text": "Até agora, você aprendeu as ferramentas para importar seus dados no R, organizá-los em uma forma conveniente para análise e, posteriormente, compreendê-los por meio de transformações e visualização. Contudo, não importa o quão boa é sua análise se você não conseguir explicá-la para outras pessoas: você precisa comunicar seus resultados.\n\n\n\n\nFigura 1: Comunicação é a parte final do processo de ciência de dados; se você não conseguir comunicar seus resultados para outros humanos, não importa o quão boa é sua análise.\n\n\n\nComunicação é o tema dos dois capítulos seguintes:\n\nNo ?sec-quarto, você irá aprender sobre o Quarto, uma ferramenta para integrar texto, código e resultados. Você pode usar o Quarto tanto para comunicação entre analistas, quanto para comunicação entre analistas e pessoas tomadoras de decisão. Graças ao poder dos formatos do Quarto, você pode até usar o mesmo documento para ambos os propósitos.\nNo ?sec-quarto-formats, você irá aprender um pouco sobre as muitas outras variedades de outputs possíveis de serem produzidos usando o Quarto, incluindo dashboards, websites e livros.\n\nEsses capítulos focam principalmente na parte técnica da comunicação, não nos problemas realmente difíceis de comunicar seus pensamentos para outros humanos. Entretanto, há vários outros ótimos livros sobre comunicação, os quais iremos indicar no final de cada capítulo." + "text": "Até agora, você aprendeu as ferramentas para importar seus dados no R, organizá-los em uma forma conveniente para análise e, posteriormente, compreendê-los por meio de transformações e visualização. Contudo, não importa o quão boa é sua análise se você não conseguir explicá-la para outras pessoas: você precisa comunicar seus resultados.\n\n\n\n\nFigura 1: Comunicação é a parte final do processo de ciência de dados; se você não conseguir comunicar seus resultados para outros humanos, não importa o quão boa é sua análise.\n\n\n\nComunicação é o tema dos dois capítulos seguintes:\n\nNo Capítulo 28, você irá aprender sobre o Quarto, uma ferramenta para integrar texto, código e resultados. Você pode usar o Quarto tanto para comunicação entre analistas, quanto para comunicação entre analistas e pessoas tomadoras de decisão. Graças ao poder dos formatos do Quarto, você pode até usar o mesmo documento para ambos os propósitos.\nNo Capítulo 29, você irá aprender um pouco sobre as muitas outras variedades de outputs possíveis de serem produzidos usando o Quarto, incluindo dashboards, websites e livros.\n\nEsses capítulos focam principalmente na parte técnica da comunicação, não nos problemas realmente difíceis de comunicar seus pensamentos para outros humanos. Entretanto, há vários outros ótimos livros sobre comunicação, os quais iremos indicar no final de cada capítulo." + }, + { + "objectID": "quarto.html#introduction", + "href": "quarto.html#introduction", + "title": "28  Quarto", + "section": "\n28.1 Introduction", + "text": "28.1 Introduction\nQuarto provides a unified authoring framework for data science, combining your code, its results, and your prose. Quarto documents are fully reproducible and support dozens of output formats, like PDFs, Word files, presentations, and more.\nQuarto files are designed to be used in three ways:\n\nFor communicating to decision-makers, who want to focus on the conclusions, not the code behind the analysis.\nFor collaborating with other data scientists (including future you!), who are interested in both your conclusions, and how you reached them (i.e. the code).\nAs an environment in which to do data science, as a modern-day lab notebook where you can capture not only what you did, but also what you were thinking.\n\nQuarto is a command line interface tool, not an R package. This means that help is, by-and-large, not available through ?. Instead, as you work through this chapter, and use Quarto in the future, you should refer to the Quarto documentation.\nIf you’re an R Markdown user, you might be thinking “Quarto sounds a lot like R Markdown”. You’re not wrong! Quarto unifies the functionality of many packages from the R Markdown ecosystem (rmarkdown, bookdown, distill, xaringan, etc.) into a single consistent system as well as extends it with native support for multiple programming languages like Python and Julia in addition to R. In a way, Quarto reflects everything that was learned from expanding and supporting the R Markdown ecosystem over a decade.\n\n28.1.1 Prerequisites\nYou need the Quarto command line interface (Quarto CLI), but you don’t need to explicitly install it or load it, as RStudio automatically does both when needed." + }, + { + "objectID": "quarto.html#quarto-basics", + "href": "quarto.html#quarto-basics", + "title": "28  Quarto", + "section": "\n28.2 Quarto basics", + "text": "28.2 Quarto basics\nThis is a Quarto file – a plain text file that has the extension .qmd:\n\n---\ntitle: \"Diamond sizes\"\ndate: 2022-09-12\nformat: html\n---\n\n```{r}\n#| label: setup\n#| include: false\n\nlibrary(tidyverse)\n\nsmaller <- diamonds |> \n filter(carat <= 2.5)\n```\n\nWe have data about `r nrow(diamonds)` diamonds.\nOnly `r nrow(diamonds) - nrow(smaller)` are larger than 2.5 carats.\nThe distribution of the remainder is shown below:\n\n```{r}\n#| label: plot-smaller-diamonds\n#| echo: false\n\nsmaller |> \n ggplot(aes(x = carat)) + \n geom_freqpoly(binwidth = 0.01)\n```\n\nIt contains three important types of content:\n\nAn (optional) YAML header surrounded by ---s.\n\nChunks of R code surrounded by ```.\nText mixed with simple text formatting like # heading and _italics_.\n\nFigura 28.1 shows a .qmd document in RStudio with notebook interface where code and output are interleaved. You can run each code chunk by clicking the Run icon (it looks like a play button at the top of the chunk), or by pressing Cmd/Ctrl + Shift + Enter. RStudio executes the code and displays the results inline with the code.\n\n\n\n\nFigura 28.1: A Quarto document in RStudio. Code and output interleaved in the document, with the plot output appearing right underneath the code.\n\n\n\nIf you don’t like seeing your plots and output in your document and would rather make use of RStudio’s Console and Plot panes, you can click on the gear icon next to “Render” and switch to “Chunk Output in Console”, as shown in Figura 28.2.\n\n\n\n\nFigura 28.2: A Quarto document in RStudio with the plot output in the Plots pane.\n\n\n\nTo produce a complete report containing all text, code, and results, click “Render” or press Cmd/Ctrl + Shift + K. You can also do this programmatically with quarto::quarto_render(\"diamond-sizes.qmd\"). This will display the report in the viewer pane as shown in Figura 28.3 and create an HTML file.\n\n\n\n\nFigura 28.3: A Quarto document in RStudio with the rendered document in the Viewer pane.\n\n\n\nWhen you render the document, Quarto sends the .qmd file to knitr, https://yihui.org/knitr/, which executes all of the code chunks and creates a new markdown (.md) document which includes the code and its output. The markdown file generated by knitr is then processed by pandoc, https://pandoc.org, which is responsible for creating the finished file. This process is shown in Figura 28.4. The advantage of this two step workflow is that you can create a very wide range of output formats, as you’ll learn about in Capítulo 29.\n\n\n\n\nFigura 28.4: Diagram of Quarto workflow from qmd, to knitr, to md, to pandoc, to output in PDF, MS Word, or HTML formats.\n\n\n\nTo get started with your own .qmd file, select File > New File > Quarto Document… in the menu bar. RStudio will launch a wizard that you can use to pre-populate your file with useful content that reminds you how the key features of Quarto work.\nThe following sections dive into the three components of a Quarto document in more details: the markdown text, the code chunks, and the YAML header.\n\n28.2.1 Exercises\n\nCreate a new Quarto document using File > New File > Quarto Document. Read the instructions. Practice running the chunks individually. Then render the document by clicking the appropriate button and then by using the appropriate keyboard short cut. Verify that you can modify the code, re-run it, and see modified output.\nCreate one new Quarto document for each of the three built-in formats: HTML, PDF and Word. Render each of the three documents. How do the outputs differ? How do the inputs differ? (You may need to install LaTeX in order to build the PDF output — RStudio will prompt you if this is necessary.)" + }, + { + "objectID": "quarto.html#visual-editor", + "href": "quarto.html#visual-editor", + "title": "28  Quarto", + "section": "\n28.3 Visual editor", + "text": "28.3 Visual editor\nThe Visual editor in RStudio provides a WYSIWYM interface for authoring Quarto documents. Under the hood, prose in Quarto documents (.qmd files) is written in Markdown, a lightweight set of conventions for formatting plain text files. In fact, Quarto uses Pandoc markdown (a slightly extended version of Markdown that Quarto understands), including tables, citations, cross-references, footnotes, divs/spans, definition lists, attributes, raw HTML/TeX, and more as well as support for executing code cells and viewing their output inline. While Markdown is designed to be easy to read and write, as you will see in Seção 28.4, it still requires learning new syntax. Therefore, if you’re new to computational documents like .qmd files but have experience using tools like Google Docs or MS Word, the easiest way to get started with Quarto in RStudio is the visual editor.\nIn the visual editor you can either use the buttons on the menu bar to insert images, tables, cross-references, etc. or you can use the catch-all ⌘ / shortcut to insert just about anything. If you are at the beginning of a line (as shown in Figura 28.5), you can also enter just / to invoke the shortcut.\n\n\n\n\nFigura 28.5: Quarto visual editor.\n\n\n\nInserting images and customizing how they are displayed is also facilitated with the visual editor. You can either paste an image from your clipboard directly into the visual editor (and RStudio will place a copy of that image in the project directory and link to it) or you can use the visual editor’s Insert > Figure / Image menu to browse to the image you want to insert or paste it’s URL. In addition, using the same menu you can resize the image as well as add a caption, alternative text, and a link.\nThe visual editor has many more features that we haven’t enumerated here that you might find useful as you gain experience authoring with it.\nMost importantly, while the visual editor displays your content with formatting, under the hood, it saves your content in plain Markdown and you can switch back and forth between the visual and source editors to view and edit your content using either tool.\n\n28.3.1 Exercises\n\nRe-create the document in Figura 28.5 using the visual editor.\nUsing the visual editor, insert a code chunk using the Insert menu and then the insert anything tool.\nUsing the visual editor, figure out how to:\n\nAdd a footnote.\nAdd a horizontal rule.\nAdd a block quote.\n\n\nIn the visual editor, go to Insert > Citation and insert a citation to the paper titled Welcome to the Tidyverse using its DOI (digital object identifier), which is 10.21105/joss.01686. Render the document and observe how the reference shows up in the document. What change do you observe in the YAML of your document?" + }, + { + "objectID": "quarto.html#sec-source-editor", + "href": "quarto.html#sec-source-editor", + "title": "28  Quarto", + "section": "\n28.4 Source editor", + "text": "28.4 Source editor\nYou can also edit Quarto documents using the Source editor in RStudio, without the assist of the Visual editor. While the Visual editor will feel familiar to those with experience writing in tools like Google docs, the Source editor will feel familiar to those with experience writing R scripts or R Markdown documents. The Source editor can also be useful for debugging any Quarto syntax errors since it’s often easier to catch these in plain text.\nThe guide below shows how to use Pandoc’s Markdown for authoring Quarto documents in the source editor.\n\n## Text formatting\n\n*italic* **bold** ~~strikeout~~ `code`\n\nsuperscript^2^ subscript~2~\n\n[underline]{.underline} [small caps]{.smallcaps}\n\n## Headings\n\n# 1st Level Header\n\n## 2nd Level Header\n\n### 3rd Level Header\n\n## Lists\n\n- Bulleted list item 1\n\n- Item 2\n\n - Item 2a\n\n - Item 2b\n\n1. Numbered list item 1\n\n2. Item 2.\n The numbers are incremented automatically in the output.\n\n## Links and images\n\n<http://example.com>\n\n[linked phrase](http://example.com)\n\n![optional caption text](quarto.png){fig-alt=\"Quarto logo and the word quarto spelled in small case letters\"}\n\n## Tables\n\n| First Header | Second Header |\n|--------------|---------------|\n| Content Cell | Content Cell |\n| Content Cell | Content Cell |\n\nThe best way to learn these is simply to try them out. It will take a few days, but soon they will become second nature, and you won’t need to think about them. If you forget, you can get to a handy reference sheet with Help > Markdown Quick Reference.\n\n28.4.1 Exercises\n\nPractice what you’ve learned by creating a brief CV. The title should be your name, and you should include headings for (at least) education or employment. Each of the sections should include a bulleted list of jobs/degrees. Highlight the year in bold.\n\nUsing the source editor and the Markdown quick reference, figure out how to:\n\nAdd a footnote.\nAdd a horizontal rule.\nAdd a block quote.\n\n\nCopy and paste the contents of diamond-sizes.qmd from https://github.com/hadley/r4ds/tree/main/quarto in to a local R Quarto document. Check that you can run it, then add text after the frequency polygon that describes its most striking features.\nCreate a document in a Google doc or MS Word (or locate a document you have created previously) with some content in it such as headings, hyperlinks, formatted text, etc. Copy the contents of this document and paste it into a Quarto document in the visual editor. Then, switch over to the source editor and inspect the source code." + }, + { + "objectID": "quarto.html#code-chunks", + "href": "quarto.html#code-chunks", + "title": "28  Quarto", + "section": "\n28.5 Code chunks", + "text": "28.5 Code chunks\nTo run code inside a Quarto document, you need to insert a chunk. There are three ways to do so:\n\nThe keyboard shortcut Cmd + Option + I / Ctrl + Alt + I.\nThe “Insert” button icon in the editor toolbar.\nBy manually typing the chunk delimiters ```{r} and ```.\n\nWe’d recommend you learn the keyboard shortcut. It will save you a lot of time in the long run!\nYou can continue to run the code using the keyboard shortcut that by now (we hope!) you know and love: Cmd/Ctrl + Enter. However, chunks get a new keyboard shortcut: Cmd/Ctrl + Shift + Enter, which runs all the code in the chunk. Think of a chunk like a function. A chunk should be relatively self-contained, and focused around a single task.\nThe following sections describe the chunk header which consists of ```{r}, followed by an optional chunk label and various other chunk options, each on their own line, marked by #|.\n\n28.5.1 Chunk label\nChunks can be given an optional label, e.g.\n\n```{r}\n#| label: simple-addition\n\n1 + 1\n```\n#> [1] 2\n\nThis has three advantages:\n\n\nYou can more easily navigate to specific chunks using the drop-down code navigator in the bottom-left of the script editor:\n\n\n\n\n\n\nGraphics produced by the chunks will have useful names that make them easier to use elsewhere. More on that in Seção 28.6.\nYou can set up networks of cached chunks to avoid re-performing expensive computations on every run. More on that in Seção 28.8.\n\nYour chunk labels should be short but evocative and should not contain spaces. We recommend using dashes (-) to separate words (instead of underscores, _) and avoiding other special characters in chunk labels.\nYou are generally free to label your chunk however you like, but there is one chunk name that imbues special behavior: setup. When you’re in a notebook mode, the chunk named setup will be run automatically once, before any other code is run.\nAdditionally, chunk labels cannot be duplicated. Each chunk label must be unique.\n\n28.5.2 Chunk options\nChunk output can be customized with options, fields supplied to chunk header. Knitr provides almost 60 options that you can use to customize your code chunks. Here we’ll cover the most important chunk options that you’ll use frequently. You can see the full list at https://yihui.org/knitr/options.\nThe most important set of options controls if your code block is executed and what results are inserted in the finished report:\n\neval: false prevents code from being evaluated. (And obviously if the code is not run, no results will be generated). This is useful for displaying example code, or for disabling a large block of code without commenting each line.\ninclude: false runs the code, but doesn’t show the code or results in the final document. Use this for setup code that you don’t want cluttering your report.\necho: false prevents code, but not the results from appearing in the finished file. Use this when writing reports aimed at people who don’t want to see the underlying R code.\nmessage: false or warning: false prevents messages or warnings from appearing in the finished file.\nresults: hide hides printed output; fig-show: hide hides plots.\nerror: true causes the render to continue even if code returns an error. This is rarely something you’ll want to include in the final version of your report, but can be very useful if you need to debug exactly what is going on inside your .qmd. It’s also useful if you’re teaching R and want to deliberately include an error. The default, error: false causes rendering to fail if there is a single error in the document.\n\nEach of these chunk options get added to the header of the chunk, following #|, e.g., in the following chunk the result is not printed since eval is set to false.\n\n```{r}\n#| label: simple-multiplication\n#| eval: false\n\n2 * 2\n```\n\nThe following table summarizes which types of output each option suppresses:\n\n\n\n\n\n\n\n\n\n\n\nOption\nRun code\nShow code\nOutput\nPlots\nMessages\nWarnings\n\n\n\neval: false\nX\n\nX\nX\nX\nX\n\n\ninclude: false\n\nX\nX\nX\nX\nX\n\n\necho: false\n\nX\n\n\n\n\n\n\nresults: hide\n\n\nX\n\n\n\n\n\nfig-show: hide\n\n\n\nX\n\n\n\n\nmessage: false\n\n\n\n\nX\n\n\n\nwarning: false\n\n\n\n\n\nX\n\n\n\n28.5.3 Global options\nAs you work more with knitr, you will discover that some of the default chunk options don’t fit your needs and you want to change them.\nYou can do this by adding the preferred options in the document YAML, under execute. For example, if you are preparing a report for an audience who does not need to see your code but only your results and narrative, you might set echo: false at the document level. That will hide the code by default, so only showing the chunks you deliberately choose to show (with echo: true). You might consider setting message: false and warning: false, but that would make it harder to debug problems because you wouldn’t see any messages in the final document.\ntitle: \"My report\"\nexecute:\n echo: false\nSince Quarto is designed to be multi-lingual (works with R as well as other languages like Python, Julia, etc.), all of the knitr options are not available at the document execution level since some of them only work with knitr and not other engines Quarto uses for running code in other languages (e.g., Jupyter). You can, however, still set these as global options for your document under the knitr field, under opts_chunk. For example, when writing books and tutorials we set:\ntitle: \"Tutorial\"\nknitr:\n opts_chunk:\n comment: \"#>\"\n collapse: true\nThis uses our preferred comment formatting and ensures that the code and output are kept closely entwined.\n\n28.5.4 Inline code\nThere is one other way to embed R code into a Quarto document: directly into the text, with: `r `. This can be very useful if you mention properties of your data in the text. For example, the example document used at the start of the chapter had:\n\nWe have data about `r nrow(diamonds)` diamonds. Only `r nrow(diamonds) - nrow(smaller)` are larger than 2.5 carats. The distribution of the remainder is shown below:\n\nWhen the report is rendered, the results of these computations are inserted into the text:\n\nWe have data about 53940 diamonds. Only 126 are larger than 2.5 carats. The distribution of the remainder is shown below:\n\nWhen inserting numbers into text, format() is your friend. It allows you to set the number of digits so you don’t print to a ridiculous degree of accuracy, and a big.mark to make numbers easier to read. You might combine these into a helper function:\n\ncomma <- function(x) format(x, digits = 2, big.mark = \",\")\ncomma(3452345)\n#> [1] \"3,452,345\"\ncomma(.12358124331)\n#> [1] \"0.12\"\n\n\n28.5.5 Exercises\n\nAdd a section that explores how diamond sizes vary by cut, color, and clarity. Assume you’re writing a report for someone who doesn’t know R, and instead of setting echo: false on each chunk, set a global option.\nDownload diamond-sizes.qmd from https://github.com/hadley/r4ds/tree/main/quarto. Add a section that describes the largest 20 diamonds, including a table that displays their most important attributes.\nModify diamonds-sizes.qmd to use label_comma() to produce nicely formatted output. Also include the percentage of diamonds that are larger than 2.5 carats." + }, + { + "objectID": "quarto.html#sec-figures", + "href": "quarto.html#sec-figures", + "title": "28  Quarto", + "section": "\n28.6 Figures", + "text": "28.6 Figures\nThe figures in a Quarto document can be embedded (e.g., a PNG or JPEG file) or generated as a result of a code chunk.\nTo embed an image from an external file, you can use the Insert menu in the Visual Editor in RStudio and select Figure / Image. This will pop open a menu where you can browse to the image you want to insert as well as add alternative text or caption to it and adjust its size. In the visual editor you can also simply paste an image from your clipboard into your document and RStudio will place a copy of that image in your project folder.\nIf you include a code chunk that generates a figure (e.g., includes a ggplot() call), the resulting figure will be automatically included in your Quarto document.\n\n28.6.1 Figure sizing\nThe biggest challenge of graphics in Quarto is getting your figures the right size and shape. There are five main options that control figure sizing: fig-width, fig-height, fig-asp, out-width and out-height. Image sizing is challenging because there are two sizes (the size of the figure created by R and the size at which it is inserted in the output document), and multiple ways of specifying the size (i.e. height, width, and aspect ratio: pick two of three).\nWe recommend three of the five options:\n\nPlots tend to be more aesthetically pleasing if they have consistent width. To enforce this, set fig-width: 6 (6”) and fig-asp: 0.618 (the golden ratio) in the defaults. Then in individual chunks, only adjust fig-asp.\n\nControl the output size with out-width and set it to a percentage of the body width of the output document. We suggest to out-width: \"70%\" and fig-align: center.\nThat gives plots room to breathe, without taking up too much space.\n\nTo put multiple plots in a single row, set the layout-ncol to 2 for two plots, 3 for three plots, etc. This effectively sets out-width to “50%” for each of your plots if layout-ncol is 2, “33%” if layout-ncol is 3, etc. Depending on what you’re trying to illustrate (e.g., show data or show plot variations), you might also tweak fig-width, as discussed below.\n\nIf you find that you’re having to squint to read the text in your plot, you need to tweak fig-width. If fig-width is larger than the size the figure is rendered in the final doc, the text will be too small; if fig-width is smaller, the text will be too big. You’ll often need to do a little experimentation to figure out the right ratio between the fig-width and the eventual width in your document. To illustrate the principle, the following three plots have fig-width of 4, 6, and 8 respectively:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIf you want to make sure the font size is consistent across all your figures, whenever you set out-width, you’ll also need to adjust fig-width to maintain the same ratio with your default out-width. For example, if your default fig-width is 6 and out-width is “70%”, when you set out-width: \"50%\" you’ll need to set fig-width to 4.3 (6 * 0.5 / 0.7).\nFigure sizing and scaling is an art and science and getting things right can require an iterative trial-and-error approach. You can learn more about figure sizing in the taking control of plot scaling blog post.\n\n28.6.2 Other important options\nWhen mingling code and text, like in this book, you can set fig-show: hold so that plots are shown after the code. This has the pleasant side effect of forcing you to break up large blocks of code with their explanations.\nTo add a caption to the plot, use fig-cap. In Quarto this will change the figure from inline to “floating”.\nIf you’re producing PDF output, the default graphics type is PDF. This is a good default because PDFs are high quality vector graphics. However, they can produce very large and slow plots if you are displaying thousands of points. In that case, set fig-format: \"png\" to force the use of PNGs. They are slightly lower quality, but will be much more compact.\nIt’s a good idea to name code chunks that produce figures, even if you don’t routinely label other chunks. The chunk label is used to generate the file name of the graphic on disk, so naming your chunks makes it much easier to pick out plots and reuse in other circumstances (e.g., if you want to quickly drop a single plot into an email).\n\n28.6.3 Exercises\n\nOpen diamond-sizes.qmd in the visual editor, find an image of a diamond, copy it, and paste it into the document. Double click on the image and add a caption. Resize the image and render your document. Observe how the image is saved in your current working directory.\nEdit the label of the code chunk in diamond-sizes.qmd that generates a plot to start with the prefix fig- and add a caption to the figure with the chunk option fig-cap. Then, edit the text above the code chunk to add a cross-reference to the figure with Insert > Cross Reference.\nChange the size of the figure with the following chunk options, one at a time, render your document, and describe how the figure changes.\n\nfig-width: 10\nfig-height: 3\nout-width: \"100%\"\nout-width: \"20%\"" + }, + { + "objectID": "quarto.html#tables", + "href": "quarto.html#tables", + "title": "28  Quarto", + "section": "\n28.7 Tables", + "text": "28.7 Tables\nSimilar to figures, you can include two types of tables in a Quarto document. They can be markdown tables that you create directly in your Quarto document (using the Insert Table menu) or they can be tables generated as a result of a code chunk. In this section we will focus on the latter, tables generated via computation.\nBy default, Quarto prints data frames and matrices as you’d see them in the console:\n\nmtcars[1:5, ]\n#> mpg cyl disp hp drat wt qsec vs am gear carb\n#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4\n#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4\n#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1\n#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1\n#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2\n\nIf you prefer that data be displayed with additional formatting you can use the knitr::kable() function. The code below generates Tabela 28.1.\n\nknitr::kable(mtcars[1:5, ], )\n\n\n\nTabela 28.1: A knitr kable.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nmpg\ncyl\ndisp\nhp\ndrat\nwt\nqsec\nvs\nam\ngear\ncarb\n\n\n\nMazda RX4\n21.0\n6\n160\n110\n3.90\n2.620\n16.46\n0\n1\n4\n4\n\n\nMazda RX4 Wag\n21.0\n6\n160\n110\n3.90\n2.875\n17.02\n0\n1\n4\n4\n\n\nDatsun 710\n22.8\n4\n108\n93\n3.85\n2.320\n18.61\n1\n1\n4\n1\n\n\nHornet 4 Drive\n21.4\n6\n258\n110\n3.08\n3.215\n19.44\n1\n0\n3\n1\n\n\nHornet Sportabout\n18.7\n8\n360\n175\n3.15\n3.440\n17.02\n0\n0\n3\n2\n\n\n\n\n\n\nRead the documentation for ?knitr::kable to see the other ways in which you can customize the table. For even deeper customization, consider the gt, huxtable, reactable, kableExtra, xtable, stargazer, pander, tables, and ascii packages. Each provides a set of tools for returning formatted tables from R code.\n\n28.7.1 Exercises\n\nOpen diamond-sizes.qmd in the visual editor, insert a code chunk, and add a table with knitr::kable() that shows the first 5 rows of the diamonds data frame.\nDisplay the same table with gt::gt() instead.\nAdd a chunk label that starts with the prefix tbl- and add a caption to the table with the chunk option tbl-cap. Then, edit the text above the code chunk to add a cross-reference to the table with Insert > Cross Reference." + }, + { + "objectID": "quarto.html#sec-caching", + "href": "quarto.html#sec-caching", + "title": "28  Quarto", + "section": "\n28.8 Caching", + "text": "28.8 Caching\nNormally, each render of a document starts from a completely clean slate. This is great for reproducibility, because it ensures that you’ve captured every important computation in code. However, it can be painful if you have some computations that take a long time. The solution is cache: true.\nYou can enable the Knitr cache at the document level for caching the results of all computations in a document using standard YAML options:\n---\ntitle: \"My Document\"\nexecute: \n cache: true\n---\nYou can also enable caching at the chunk level for caching the results of computation in a specific chunk:\n\n```{r}\n#| cache: true\n\n# code for lengthy computation...\n```\n\nWhen set, this will save the output of the chunk to a specially named file on disk. On subsequent runs, knitr will check to see if the code has changed, and if it hasn’t, it will reuse the cached results.\nThe caching system must be used with care, because by default it is based on the code only, not its dependencies. For example, here the processed_data chunk depends on the raw-data chunk:\n```{r}\n#| label: raw-data\n#| cache: true\n\nrawdata <- readr::read_csv(\"a_very_large_file.csv\")\n```\n```{r}\n#| label: processed_data\n#| cache: true\n\nprocessed_data <- rawdata |> \n filter(!is.na(import_var)) |> \n mutate(new_variable = complicated_transformation(x, y, z))\n```\nCaching the processed_data chunk means that it will get re-run if the dplyr pipeline is changed, but it won’t get rerun if the read_csv() call changes. You can avoid that problem with the dependson chunk option:\n```{r}\n#| label: processed-data\n#| cache: true\n#| dependson: \"raw-data\"\n\nprocessed_data <- rawdata |> \n filter(!is.na(import_var)) |> \n mutate(new_variable = complicated_transformation(x, y, z))\n```\ndependson should contain a character vector of every chunk that the cached chunk depends on. Knitr will update the results for the cached chunk whenever it detects that one of its dependencies have changed.\nNote that the chunks won’t update if a_very_large_file.csv changes, because knitr caching only tracks changes within the .qmd file. If you want to also track changes to that file you can use the cache.extra option. This is an arbitrary R expression that will invalidate the cache whenever it changes. A good function to use is file.mtime(): it returns when it was last modified. Then you can write:\n```{r}\n#| label: raw-data\n#| cache: true\n#| cache.extra: !expr file.mtime(\"a_very_large_file.csv\")\n\nrawdata <- readr::read_csv(\"a_very_large_file.csv\")\n```\nWe’ve followed the advice of David Robinson to name these chunks: each chunk is named after the primary object that it creates. This makes it easier to understand the dependson specification.\nAs your caching strategies get progressively more complicated, it’s a good idea to regularly clear out all your caches with knitr::clean_cache().\n\n28.8.1 Exercises\n\nSet up a network of chunks where d depends on c and b, and both b and c depend on a. Have each chunk print lubridate::now(), set cache: true, then verify your understanding of caching." + }, + { + "objectID": "quarto.html#troubleshooting", + "href": "quarto.html#troubleshooting", + "title": "28  Quarto", + "section": "\n28.9 Troubleshooting", + "text": "28.9 Troubleshooting\nTroubleshooting Quarto documents can be challenging because you are no longer in an interactive R environment, and you will need to learn some new tricks. Additionally, the error could be due to issues with the Quarto document itself or due to the R code in the Quarto document.\nOne common error in documents with code chunks is duplicated chunk labels, which are especially pervasive if your workflow involves copying and pasting code chunks. To address this issue, all you need to do is to change one of your duplicated labels.\nIf the errors are due to the R code in the document, the first thing you should always try is to recreate the problem in an interactive session. Restart R, then “Run all chunks”, either from the Code menu, under Run region or with the keyboard shortcut Ctrl + Alt + R. If you’re lucky, that will recreate the problem, and you can figure out what’s going on interactively.\nIf that doesn’t help, there must be something different between your interactive environment and the Quarto environment. You’re going to need to systematically explore the options. The most common difference is the working directory: the working directory of a Quarto is the directory in which it lives. Check the working directory is what you expect by including getwd() in a chunk.\nNext, brainstorm all the things that might cause the bug. You’ll need to systematically check that they’re the same in your R session and your Quarto session. The easiest way to do that is to set error: true on the chunk causing the problem, then use print() and str() to check that settings are as you expect." + }, + { + "objectID": "quarto.html#yaml-header", + "href": "quarto.html#yaml-header", + "title": "28  Quarto", + "section": "\n28.10 YAML header", + "text": "28.10 YAML header\nYou can control many other “whole document” settings by tweaking the parameters of the YAML header. You might wonder what YAML stands for: it’s “YAML Ain’t Markup Language”, which is designed for representing hierarchical data in a way that’s easy for humans to read and write. Quarto uses it to control many details of the output. Here we’ll discuss three: self-contained documents, document parameters, and bibliographies.\n\n28.10.1 Self-contained\nHTML documents typically have a number of external dependencies (e.g., images, CSS style sheets, JavaScript, etc.) and, by default, Quarto places these dependencies in a _files folder in the same directory as your .qmd file. If you publish the HTML file on a hosting platform (e.g., QuartoPub, https://quartopub.com/), the dependencies in this directory are published with your document and hence are available in the published report. However, if you want to email the report to a colleague, you might prefer to have a single, self-contained, HTML document that embeds all of its dependencies. You can do this by specifying the embed-resources option:\nformat:\n html:\n embed-resources: true\nThe resulting file will be self-contained, such that it will need no external files and no internet access to be displayed properly by a browser.\n\n28.10.2 Parameters\nQuarto documents can include one or more parameters whose values can be set when you render the report. Parameters are useful when you want to re-render the same report with distinct values for various key inputs. For example, you might be producing sales reports per branch, exam results by student, or demographic summaries by country. To declare one or more parameters, use the params field.\nThis example uses a my_class parameter to determine which class of cars to display:\n\n---\nformat: html\nparams:\n my_class: \"suv\"\n---\n\n```{r}\n#| label: setup\n#| include: false\n\nlibrary(tidyverse)\n\nclass <- mpg |> filter(class == params$my_class)\n```\n\n# Fuel economy for `r params$my_class`s\n\n```{r}\n#| message: false\n\nggplot(class, aes(x = displ, y = hwy)) + \n geom_point() + \n geom_smooth(se = FALSE)\n```\n\nAs you can see, parameters are available within the code chunks as a read-only list named params.\nYou can write atomic vectors directly into the YAML header. You can also run arbitrary R expressions by prefacing the parameter value with !expr. This is a good way to specify date/time parameters.\nparams:\n start: !expr lubridate::ymd(\"2015-01-01\")\n snapshot: !expr lubridate::ymd_hms(\"2015-01-01 12:30:00\")\n\n28.10.3 Bibliographies and Citations\nQuarto can automatically generate citations and a bibliography in a number of styles. The most straightforward way of adding citations and bibliographies to a Quarto document is using the visual editor in RStudio.\nTo add a citation using the visual editor, go to Insert > Citation. Citations can be inserted from a variety of sources:\n\nDOI (Document Object Identifier) references.\nZotero personal or group libraries.\nSearches of Crossref, DataCite, or PubMed.\nYour document bibliography (a .bib file in the directory of your document)\n\nUnder the hood, the visual mode uses the standard Pandoc markdown representation for citations (e.g., [@citation]).\nIf you add a citation using one of the first three methods, the visual editor will automatically create a bibliography.bib file for you and add the reference to it. It will also add a bibliography field to the document YAML. As you add more references, this file will get populated with their citations. You can also directly edit this file using many common bibliography formats including BibLaTeX, BibTeX, EndNote, Medline.\nTo create a citation within your .qmd file in the source editor, use a key composed of ‘@’ + the citation identifier from the bibliography file. Then place the citation in square brackets. Here are some examples:\nSeparate multiple citations with a `;`: Blah blah [@smith04; @doe99].\n\nYou can add arbitrary comments inside the square brackets: \nBlah blah [see @doe99, pp. 33-35; also @smith04, ch. 1].\n\nRemove the square brackets to create an in-text citation: @smith04 \nsays blah, or @smith04 [p. 33] says blah.\n\nAdd a `-` before the citation to suppress the author's name: \nSmith says blah [-@smith04].\nWhen Quarto renders your file, it will build and append a bibliography to the end of your document. The bibliography will contain each of the cited references from your bibliography file, but it will not contain a section heading. As a result it is common practice to end your file with a section header for the bibliography, such as # References or # Bibliography.\nYou can change the style of your citations and bibliography by referencing a CSL (citation style language) file in the csl field:\nbibliography: rmarkdown.bib\ncsl: apa.csl\nAs with the bibliography field, your csl file should contain a path to the file. Here we assume that the csl file is in the same directory as the .qmd file. A good place to find CSL style files for common bibliography styles is https://github.com/citation-style-language/styles." + }, + { + "objectID": "quarto.html#workflow", + "href": "quarto.html#workflow", + "title": "28  Quarto", + "section": "\n28.11 Workflow", + "text": "28.11 Workflow\nEarlier, we discussed a basic workflow for capturing your R code where you work interactively in the console, then capture what works in the script editor. Quarto brings together the console and the script editor, blurring the lines between interactive exploration and long-term code capture. You can rapidly iterate within a chunk, editing and re-executing with Cmd/Ctrl + Shift + Enter. When you’re happy, you move on and start a new chunk.\nQuarto is also important because it so tightly integrates prose and code. This makes it a great analysis notebook because it lets you develop code and record your thoughts. An analysis notebook shares many of the same goals as a classic lab notebook in the physical sciences. It:\n\nRecords what you did and why you did it. Regardless of how great your memory is, if you don’t record what you do, there will come a time when you have forgotten important details. Write them down so you don’t forget!\nSupports rigorous thinking. You are more likely to come up with a strong analysis if you record your thoughts as you go, and continue to reflect on them. This also saves you time when you eventually write up your analysis to share with others.\nHelps others understand your work. It is rare to do data analysis by yourself, and you’ll often be working as part of a team. A lab notebook helps you share not only what you’ve done, but why you did it with your colleagues or lab mates.\n\nMuch of the good advice about using lab notebooks effectively can also be translated to analysis notebooks. We’ve drawn on our own experiences and Colin Purrington’s advice on lab notebooks (https://colinpurrington.com/tips/lab-notebooks) to come up with the following tips:\n\nEnsure each notebook has a descriptive title, an evocative file name, and a first paragraph that briefly describes the aims of the analysis.\n\nUse the YAML header date field to record the date you started working on the notebook:\ndate: 2016-08-23\nUse ISO8601 YYYY-MM-DD format so that’s there no ambiguity. Use it even if you don’t normally write dates that way!\n\nIf you spend a lot of time on an analysis idea and it turns out to be a dead end, don’t delete it! Write up a brief note about why it failed and leave it in the notebook. That will help you avoid going down the same dead end when you come back to the analysis in the future.\nGenerally, you’re better off doing data entry outside of R. But if you do need to record a small snippet of data, clearly lay it out using tibble::tribble().\nIf you discover an error in a data file, never modify it directly, but instead write code to correct the value. Explain why you made the fix.\nBefore you finish for the day, make sure you can render the notebook. If you’re using caching, make sure to clear the caches. That will let you fix any problems while the code is still fresh in your mind.\nIf you want your code to be reproducible in the long-run (i.e. so you can come back to run it next month or next year), you’ll need to track the versions of the packages that your code uses. A rigorous approach is to use renv, https://rstudio.github.io/renv/index.html, which stores packages in your project directory. A quick and dirty hack is to include a chunk that runs sessionInfo() — that won’t let you easily recreate your packages as they are today, but at least you’ll know what they were.\nYou are going to create many, many, many analysis notebooks over the course of your career. How are you going to organize them so you can find them again in the future? We recommend storing them in individual projects, and coming up with a good naming scheme." + }, + { + "objectID": "quarto.html#summary", + "href": "quarto.html#summary", + "title": "28  Quarto", + "section": "\n28.12 Summary", + "text": "28.12 Summary\nIn this chapter we introduced you to Quarto for authoring and publishing reproducible computational documents that include your code and your prose in one place. You’ve learned about writing Quarto documents in RStudio with the visual or the source editor, how code chunks work and how to customize options for them, how to include figures and tables in your Quarto documents, and options for caching for computations. Additionally, you’ve learned about adjusting YAML header options for creating self-contained or parametrized documents as well as including citations and bibliography. We have also given you some troubleshooting and workflow tips.\nWhile this introduction should be sufficient to get you started with Quarto, there is still a lot more to learn. Quarto is still relatively young, and is still growing rapidly. The best place to stay on top of innovations is the official Quarto website: https://quarto.org.\nThere are two important topics that we haven’t covered here: collaboration and the details of accurately communicating your ideas to other humans. Collaboration is a vital part of modern data science, and you can make your life much easier by using version control tools, like Git and GitHub. We recommend “Happy Git with R”, a user friendly introduction to Git and GitHub from R users, by Jenny Bryan. The book is freely available online: https://happygitwithr.com.\nWe have also not touched on what you should actually write in order to clearly communicate the results of your analysis. To improve your writing, we highly recommend reading either Style: Lessons in Clarity and Grace by Joseph M. Williams & Joseph Bizup, or The Sense of Structure: Writing from the Reader’s Perspective by George Gopen. Both books will help you understand the structure of sentences and paragraphs, and give you the tools to make your writing more clear. (These books are rather expensive if purchased new, but they’re used by many English classes so there are plenty of cheap second-hand copies). George Gopen also has a number of short articles on writing at https://www.georgegopen.com/the-litigation-articles.html. They are aimed at lawyers, but almost everything applies to data scientists too." + }, + { + "objectID": "quarto-formats.html#introduction", + "href": "quarto-formats.html#introduction", + "title": "29  Quarto formats", + "section": "\n29.1 Introduction", + "text": "29.1 Introduction\nSo far, you’ve seen Quarto used to produce HTML documents. This chapter gives a brief overview of some of the many other types of output you can produce with Quarto.\nThere are two ways to set the output of a document:\n\n\nPermanently, by modifying the YAML header:\ntitle: \"Diamond sizes\"\nformat: html\n\n\nTransiently, by calling quarto::quarto_render() by hand:\n\nquarto::quarto_render(\"diamond-sizes.qmd\", output_format = \"docx\")\n\nThis is useful if you want to programmatically produce multiple types of output since the output_format argument can also take a list of values.\n\nquarto::quarto_render(\"diamond-sizes.qmd\", output_format = c(\"docx\", \"pdf\"))" + }, + { + "objectID": "quarto-formats.html#output-options", + "href": "quarto-formats.html#output-options", + "title": "29  Quarto formats", + "section": "\n29.2 Output options", + "text": "29.2 Output options\nQuarto offers a wide range of output formats. You can find the complete list at https://quarto.org/docs/output-formats/all-formats.html. Many formats share some output options (e.g., toc: true for including a table of contents), but others have options that are format specific (e.g., code-fold: true collapses code chunks into a <details> tag for HTML output so the user can display it on demand, it’s not applicable in a PDF or Word document).\nTo override the default options, you need to use an expanded format field. For example, if you wanted to render an html with a floating table of contents, you’d use:\nformat:\n html:\n toc: true\n toc_float: true\nYou can even render to multiple outputs by supplying a list of formats:\nformat:\n html:\n toc: true\n toc_float: true\n pdf: default\n docx: default\nNote the special syntax (pdf: default) if you don’t want to override any default options.\nTo render to all formats specified in the YAML of a document, you can use output_format = \"all\".\n\nquarto::quarto_render(\"diamond-sizes.qmd\", output_format = \"all\")" + }, + { + "objectID": "quarto-formats.html#documents", + "href": "quarto-formats.html#documents", + "title": "29  Quarto formats", + "section": "\n29.3 Documents", + "text": "29.3 Documents\nThe previous chapter focused on the default html output. There are several basic variations on that theme, generating different types of documents. For example:\n\npdf makes a PDF with LaTeX (an open-source document layout system), which you’ll need to install. RStudio will prompt you if you don’t already have it.\ndocx for Microsoft Word (.docx) documents.\nodt for OpenDocument Text (.odt) documents.\nrtf for Rich Text Format (.rtf) documents.\ngfm for a GitHub Flavored Markdown (.md) document.\nipynb for Jupyter Notebooks (.ipynb).\n\nRemember, when generating a document to share with decision-makers, you can turn off the default display of code by setting global options in the document YAML:\nexecute:\n echo: false\nFor html documents another option is to make the code chunks hidden by default, but visible with a click:\nformat:\n html:\n code: true" + }, + { + "objectID": "quarto-formats.html#presentations", + "href": "quarto-formats.html#presentations", + "title": "29  Quarto formats", + "section": "\n29.4 Presentations", + "text": "29.4 Presentations\nYou can also use Quarto to produce presentations. You get less visual control than with a tool like Keynote or PowerPoint, but automatically inserting the results of your R code into a presentation can save a huge amount of time. Presentations work by dividing your content into slides, with a new slide beginning at each second (##) level header. Additionally, first (#) level headers indicate the beginning of a new section with a section title slide that is, by default, centered in the middle.\nQuarto supports a variety of presentation formats, including:\n\nrevealjs - HTML presentation with revealjs\npptx - PowerPoint presentation\nbeamer - PDF presentation with LaTeX Beamer.\n\nYou can read more about creating presentations with Quarto at https://quarto.org/docs/presentations." + }, + { + "objectID": "quarto-formats.html#interactivity", + "href": "quarto-formats.html#interactivity", + "title": "29  Quarto formats", + "section": "\n29.5 Interactivity", + "text": "29.5 Interactivity\nJust like any HTML document, HTML documents created with Quarto can contain interactive components as well. Here we introduce two options for including interactivity in your Quarto documents: htmlwidgets and Shiny.\n\n29.5.1 htmlwidgets\nHTML is an interactive format, and you can take advantage of that interactivity with htmlwidgets, R functions that produce interactive HTML visualizations. For example, take the leaflet map below. If you’re viewing this page on the web, you can drag the map around, zoom in and out, etc. You obviously can’t do that in a book, so Quarto automatically inserts a static screenshot for you.\n\nlibrary(leaflet)\nleaflet() |>\n setView(174.764, -36.877, zoom = 16) |> \n addTiles() |>\n addMarkers(174.764, -36.877, popup = \"Maungawhau\") \n\n\n\n\n\nThe great thing about htmlwidgets is that you don’t need to know anything about HTML or JavaScript to use them. All the details are wrapped inside the package, so you don’t need to worry about it.\nThere are many packages that provide htmlwidgets, including:\n\ndygraphs for interactive time series visualizations.\nDT for interactive tables.\nthreejs for interactive 3d plots.\nDiagrammeR for diagrams (like flow charts and simple node-link diagrams).\n\nTo learn more about htmlwidgets and see a complete list of packages that provide them visit https://www.htmlwidgets.org.\n\n29.5.2 Shiny\nhtmlwidgets provide client-side interactivity — all the interactivity happens in the browser, independently of R. On the one hand, that’s great because you can distribute the HTML file without any connection to R. However, that fundamentally limits what you can do to things that have been implemented in HTML and JavaScript. An alternative approach is to use shiny, a package that allows you to create interactivity using R code, not JavaScript.\nTo call Shiny code from a Quarto document, add server: shiny to the YAML header:\ntitle: \"Shiny Web App\"\nformat: html\nserver: shiny\nThen you can use the “input” functions to add interactive components to the document:\n\nlibrary(shiny)\n\ntextInput(\"name\", \"What is your name?\")\nnumericInput(\"age\", \"How old are you?\", NA, min = 0, max = 150)\n\n\n\n\n\n\nAnd you also need a code chunk with chunk option context: server which contains the code that needs to run in a Shiny server.\nYou can then refer to the values with input$name and input$age, and the code that uses them will be automatically re-run whenever they change.\nWe can’t show you a live shiny app here because shiny interactions occur on the server-side. This means that you can write interactive apps without knowing JavaScript, but you need a server to run them on. This introduces a logistical issue: Shiny apps need a Shiny server to be run online. When you run Shiny apps on your own computer, Shiny automatically sets up a Shiny server for you, but you need a public-facing Shiny server if you want to publish this sort of interactivity online. That’s the fundamental trade-off of shiny: you can do anything in a shiny document that you can do in R, but it requires someone to be running R.\nFor learning more about Shiny, we recommend reading Mastering Shiny by Hadley Wickham, https://mastering-shiny.org." + }, + { + "objectID": "quarto-formats.html#websites-and-books", + "href": "quarto-formats.html#websites-and-books", + "title": "29  Quarto formats", + "section": "\n29.6 Websites and books", + "text": "29.6 Websites and books\nWith a bit of additional infrastructure, you can use Quarto to generate a complete website or book:\n\nPut your .qmd files in a single directory. index.qmd will become the home page.\n\nAdd a YAML file named _quarto.yml that provides the navigation for the site. In this file, set the project type to either book or website, e.g.:\nproject:\n type: book\n\n\nFor example, the following _quarto.yml file creates a website from three source files: index.qmd (the home page), viridis-colors.qmd, and terrain-colors.qmd.\n\nproject:\n type: website\n\nwebsite:\n title: \"A website on color scales\"\n navbar:\n left:\n - href: index.qmd\n text: Home\n - href: viridis-colors.qmd\n text: Viridis colors\n - href: terrain-colors.qmd\n text: Terrain colors\n\nThe _quarto.yml file you need for a book is very similarly structured. The following example shows how you can create a book with four chapters that renders to three different outputs (html, pdf, and epub). Once again, the source files are .qmd files.\n\nproject:\n type: book\n\nbook:\n title: \"A book on color scales\"\n author: \"Jane Coloriste\"\n chapters:\n - index.qmd\n - intro.qmd\n - viridis-colors.qmd\n - terrain-colors.qmd\n\nformat:\n html:\n theme: cosmo\n pdf: default\n epub: default\n\nWe recommend that you use an RStudio project for your websites and books. Based on the _quarto.yml file, RStudio will recognize the type of project you’re working on, and add a Build tab to the IDE that you can use to render and preview your websites and books. Both websites and books can also be rendered using quarto::render().\nRead more at https://quarto.org/docs/websites about Quarto websites and https://quarto.org/docs/books about books." + }, + { + "objectID": "quarto-formats.html#other-formats", + "href": "quarto-formats.html#other-formats", + "title": "29  Quarto formats", + "section": "\n29.7 Other formats", + "text": "29.7 Other formats\nQuarto offers even more output formats:\n\nYou can write journal articles using Quarto Journal Templates: https://quarto.org/docs/journals/templates.html.\nYou can output Quarto documents to Jupyter Notebooks with format: ipynb: https://quarto.org/docs/reference/formats/ipynb.html.\n\nSee https://quarto.org/docs/output-formats/all-formats.html for a list of even more formats." + }, + { + "objectID": "quarto-formats.html#summary", + "href": "quarto-formats.html#summary", + "title": "29  Quarto formats", + "section": "\n29.8 Summary", + "text": "29.8 Summary\nIn this chapter we presented you a variety of options for communicating your results with Quarto, from static and interactive documents to presentations to websites and books.\nTo learn more about effective communication in these different formats, we recommend the following resources:\n\nTo improve your presentation skills, try Presentation Patterns by Neal Ford, Matthew McCollough, and Nathaniel Schutta. It provides a set of effective patterns (both low- and high-level) that you can apply to improve your presentations.\nIf you give academic talks, you might like the Leek group guide to giving talks.\nWe haven’t taken it ourselves, but we’ve heard good things about Matt McGarrity’s online course on public speaking: https://www.coursera.org/learn/public-speaking.\nIf you are creating many dashboards, make sure to read Stephen Few’s Information Dashboard Design: The Effective Visual Communication of Data. It will help you create dashboards that are truly useful, not just pretty to look at.\nEffectively communicating your ideas often benefits from some knowledge of graphic design. Robin Williams’ The Non-Designer’s Design Book is a great place to start." } ] \ No newline at end of file diff --git a/site_libs/Proj4Leaflet-1.0.1/proj4leaflet.js b/site_libs/Proj4Leaflet-1.0.1/proj4leaflet.js new file mode 100644 index 000000000..eaa650c1b --- /dev/null +++ b/site_libs/Proj4Leaflet-1.0.1/proj4leaflet.js @@ -0,0 +1,272 @@ +(function (factory) { + var L, proj4; + if (typeof define === 'function' && define.amd) { + // AMD + define(['leaflet', 'proj4'], factory); + } else if (typeof module === 'object' && typeof module.exports === "object") { + // Node/CommonJS + L = require('leaflet'); + proj4 = require('proj4'); + module.exports = factory(L, proj4); + } else { + // Browser globals + if (typeof window.L === 'undefined' || typeof window.proj4 === 'undefined') + throw 'Leaflet and proj4 must be loaded first'; + factory(window.L, window.proj4); + } +}(function (L, proj4) { + if (proj4.__esModule && proj4.default) { + // If proj4 was bundled as an ES6 module, unwrap it to get + // to the actual main proj4 object. + // See discussion in https://github.com/kartena/Proj4Leaflet/pull/147 + proj4 = proj4.default; + } + + L.Proj = {}; + + L.Proj._isProj4Obj = function(a) { + return (typeof a.inverse !== 'undefined' && + typeof a.forward !== 'undefined'); + }; + + L.Proj.Projection = L.Class.extend({ + initialize: function(code, def, bounds) { + var isP4 = L.Proj._isProj4Obj(code); + this._proj = isP4 ? code : this._projFromCodeDef(code, def); + this.bounds = isP4 ? def : bounds; + }, + + project: function (latlng) { + var point = this._proj.forward([latlng.lng, latlng.lat]); + return new L.Point(point[0], point[1]); + }, + + unproject: function (point, unbounded) { + var point2 = this._proj.inverse([point.x, point.y]); + return new L.LatLng(point2[1], point2[0], unbounded); + }, + + _projFromCodeDef: function(code, def) { + if (def) { + proj4.defs(code, def); + } else if (proj4.defs[code] === undefined) { + var urn = code.split(':'); + if (urn.length > 3) { + code = urn[urn.length - 3] + ':' + urn[urn.length - 1]; + } + if (proj4.defs[code] === undefined) { + throw 'No projection definition for code ' + code; + } + } + + return proj4(code); + } + }); + + L.Proj.CRS = L.Class.extend({ + includes: L.CRS, + + options: { + transformation: new L.Transformation(1, 0, -1, 0) + }, + + initialize: function(a, b, c) { + var code, + proj, + def, + options; + + if (L.Proj._isProj4Obj(a)) { + proj = a; + code = proj.srsCode; + options = b || {}; + + this.projection = new L.Proj.Projection(proj, options.bounds); + } else { + code = a; + def = b; + options = c || {}; + this.projection = new L.Proj.Projection(code, def, options.bounds); + } + + L.Util.setOptions(this, options); + this.code = code; + this.transformation = this.options.transformation; + + if (this.options.origin) { + this.transformation = + new L.Transformation(1, -this.options.origin[0], + -1, this.options.origin[1]); + } + + if (this.options.scales) { + this._scales = this.options.scales; + } else if (this.options.resolutions) { + this._scales = []; + for (var i = this.options.resolutions.length - 1; i >= 0; i--) { + if (this.options.resolutions[i]) { + this._scales[i] = 1 / this.options.resolutions[i]; + } + } + } + + this.infinite = !this.options.bounds; + + }, + + scale: function(zoom) { + var iZoom = Math.floor(zoom), + baseScale, + nextScale, + scaleDiff, + zDiff; + if (zoom === iZoom) { + return this._scales[zoom]; + } else { + // Non-integer zoom, interpolate + baseScale = this._scales[iZoom]; + nextScale = this._scales[iZoom + 1]; + scaleDiff = nextScale - baseScale; + zDiff = (zoom - iZoom); + return baseScale + scaleDiff * zDiff; + } + }, + + zoom: function(scale) { + // Find closest number in this._scales, down + var downScale = this._closestElement(this._scales, scale), + downZoom = this._scales.indexOf(downScale), + nextScale, + nextZoom, + scaleDiff; + // Check if scale is downScale => return array index + if (scale === downScale) { + return downZoom; + } + if (downScale === undefined) { + return -Infinity; + } + // Interpolate + nextZoom = downZoom + 1; + nextScale = this._scales[nextZoom]; + if (nextScale === undefined) { + return Infinity; + } + scaleDiff = nextScale - downScale; + return (scale - downScale) / scaleDiff + downZoom; + }, + + distance: L.CRS.Earth.distance, + + R: L.CRS.Earth.R, + + /* Get the closest lowest element in an array */ + _closestElement: function(array, element) { + var low; + for (var i = array.length; i--;) { + if (array[i] <= element && (low === undefined || low < array[i])) { + low = array[i]; + } + } + return low; + } + }); + + L.Proj.GeoJSON = L.GeoJSON.extend({ + initialize: function(geojson, options) { + this._callLevel = 0; + L.GeoJSON.prototype.initialize.call(this, geojson, options); + }, + + addData: function(geojson) { + var crs; + + if (geojson) { + if (geojson.crs && geojson.crs.type === 'name') { + crs = new L.Proj.CRS(geojson.crs.properties.name); + } else if (geojson.crs && geojson.crs.type) { + crs = new L.Proj.CRS(geojson.crs.type + ':' + geojson.crs.properties.code); + } + + if (crs !== undefined) { + this.options.coordsToLatLng = function(coords) { + var point = L.point(coords[0], coords[1]); + return crs.projection.unproject(point); + }; + } + } + + // Base class' addData might call us recursively, but + // CRS shouldn't be cleared in that case, since CRS applies + // to the whole GeoJSON, inluding sub-features. + this._callLevel++; + try { + L.GeoJSON.prototype.addData.call(this, geojson); + } finally { + this._callLevel--; + if (this._callLevel === 0) { + delete this.options.coordsToLatLng; + } + } + } + }); + + L.Proj.geoJson = function(geojson, options) { + return new L.Proj.GeoJSON(geojson, options); + }; + + L.Proj.ImageOverlay = L.ImageOverlay.extend({ + initialize: function (url, bounds, options) { + L.ImageOverlay.prototype.initialize.call(this, url, null, options); + this._projectedBounds = bounds; + }, + + // Danger ahead: Overriding internal methods in Leaflet. + // Decided to do this rather than making a copy of L.ImageOverlay + // and doing very tiny modifications to it. + // Future will tell if this was wise or not. + _animateZoom: function (event) { + var scale = this._map.getZoomScale(event.zoom); + var northWest = L.point(this._projectedBounds.min.x, this._projectedBounds.max.y); + var offset = this._projectedToNewLayerPoint(northWest, event.zoom, event.center); + + L.DomUtil.setTransform(this._image, offset, scale); + }, + + _reset: function () { + var zoom = this._map.getZoom(); + var pixelOrigin = this._map.getPixelOrigin(); + var bounds = L.bounds( + this._transform(this._projectedBounds.min, zoom)._subtract(pixelOrigin), + this._transform(this._projectedBounds.max, zoom)._subtract(pixelOrigin) + ); + var size = bounds.getSize(); + + L.DomUtil.setPosition(this._image, bounds.min); + this._image.style.width = size.x + 'px'; + this._image.style.height = size.y + 'px'; + }, + + _projectedToNewLayerPoint: function (point, zoom, center) { + var viewHalf = this._map.getSize()._divideBy(2); + var newTopLeft = this._map.project(center, zoom)._subtract(viewHalf)._round(); + var topLeft = newTopLeft.add(this._map._getMapPanePos()); + + return this._transform(point, zoom)._subtract(topLeft); + }, + + _transform: function (point, zoom) { + var crs = this._map.options.crs; + var transformation = crs.transformation; + var scale = crs.scale(zoom); + + return transformation.transform(point, scale); + } + }); + + L.Proj.imageOverlay = function (url, bounds, options) { + return new L.Proj.ImageOverlay(url, bounds, options); + }; + + return L.Proj; +})); diff --git a/site_libs/htmlwidgets-1.6.2/htmlwidgets.js b/site_libs/htmlwidgets-1.6.2/htmlwidgets.js new file mode 100644 index 000000000..1067d029f --- /dev/null +++ b/site_libs/htmlwidgets-1.6.2/htmlwidgets.js @@ -0,0 +1,901 @@ +(function() { + // If window.HTMLWidgets is already defined, then use it; otherwise create a + // new object. This allows preceding code to set options that affect the + // initialization process (though none currently exist). + window.HTMLWidgets = window.HTMLWidgets || {}; + + // See if we're running in a viewer pane. If not, we're in a web browser. + var viewerMode = window.HTMLWidgets.viewerMode = + /\bviewer_pane=1\b/.test(window.location); + + // See if we're running in Shiny mode. If not, it's a static document. + // Note that static widgets can appear in both Shiny and static modes, but + // obviously, Shiny widgets can only appear in Shiny apps/documents. + var shinyMode = window.HTMLWidgets.shinyMode = + typeof(window.Shiny) !== "undefined" && !!window.Shiny.outputBindings; + + // We can't count on jQuery being available, so we implement our own + // version if necessary. + function querySelectorAll(scope, selector) { + if (typeof(jQuery) !== "undefined" && scope instanceof jQuery) { + return scope.find(selector); + } + if (scope.querySelectorAll) { + return scope.querySelectorAll(selector); + } + } + + function asArray(value) { + if (value === null) + return []; + if ($.isArray(value)) + return value; + return [value]; + } + + // Implement jQuery's extend + function extend(target /*, ... */) { + if (arguments.length == 1) { + return target; + } + for (var i = 1; i < arguments.length; i++) { + var source = arguments[i]; + for (var prop in source) { + if (source.hasOwnProperty(prop)) { + target[prop] = source[prop]; + } + } + } + return target; + } + + // IE8 doesn't support Array.forEach. + function forEach(values, callback, thisArg) { + if (values.forEach) { + values.forEach(callback, thisArg); + } else { + for (var i = 0; i < values.length; i++) { + callback.call(thisArg, values[i], i, values); + } + } + } + + // Replaces the specified method with the return value of funcSource. + // + // Note that funcSource should not BE the new method, it should be a function + // that RETURNS the new method. funcSource receives a single argument that is + // the overridden method, it can be called from the new method. The overridden + // method can be called like a regular function, it has the target permanently + // bound to it so "this" will work correctly. + function overrideMethod(target, methodName, funcSource) { + var superFunc = target[methodName] || function() {}; + var superFuncBound = function() { + return superFunc.apply(target, arguments); + }; + target[methodName] = funcSource(superFuncBound); + } + + // Add a method to delegator that, when invoked, calls + // delegatee.methodName. If there is no such method on + // the delegatee, but there was one on delegator before + // delegateMethod was called, then the original version + // is invoked instead. + // For example: + // + // var a = { + // method1: function() { console.log('a1'); } + // method2: function() { console.log('a2'); } + // }; + // var b = { + // method1: function() { console.log('b1'); } + // }; + // delegateMethod(a, b, "method1"); + // delegateMethod(a, b, "method2"); + // a.method1(); + // a.method2(); + // + // The output would be "b1", "a2". + function delegateMethod(delegator, delegatee, methodName) { + var inherited = delegator[methodName]; + delegator[methodName] = function() { + var target = delegatee; + var method = delegatee[methodName]; + + // The method doesn't exist on the delegatee. Instead, + // call the method on the delegator, if it exists. + if (!method) { + target = delegator; + method = inherited; + } + + if (method) { + return method.apply(target, arguments); + } + }; + } + + // Implement a vague facsimilie of jQuery's data method + function elementData(el, name, value) { + if (arguments.length == 2) { + return el["htmlwidget_data_" + name]; + } else if (arguments.length == 3) { + el["htmlwidget_data_" + name] = value; + return el; + } else { + throw new Error("Wrong number of arguments for elementData: " + + arguments.length); + } + } + + // http://stackoverflow.com/questions/3446170/escape-string-for-use-in-javascript-regex + function escapeRegExp(str) { + return str.replace(/[\-\[\]\/\{\}\(\)\*\+\?\.\\\^\$\|]/g, "\\$&"); + } + + function hasClass(el, className) { + var re = new RegExp("\\b" + escapeRegExp(className) + "\\b"); + return re.test(el.className); + } + + // elements - array (or array-like object) of HTML elements + // className - class name to test for + // include - if true, only return elements with given className; + // if false, only return elements *without* given className + function filterByClass(elements, className, include) { + var results = []; + for (var i = 0; i < elements.length; i++) { + if (hasClass(elements[i], className) == include) + results.push(elements[i]); + } + return results; + } + + function on(obj, eventName, func) { + if (obj.addEventListener) { + obj.addEventListener(eventName, func, false); + } else if (obj.attachEvent) { + obj.attachEvent(eventName, func); + } + } + + function off(obj, eventName, func) { + if (obj.removeEventListener) + obj.removeEventListener(eventName, func, false); + else if (obj.detachEvent) { + obj.detachEvent(eventName, func); + } + } + + // Translate array of values to top/right/bottom/left, as usual with + // the "padding" CSS property + // https://developer.mozilla.org/en-US/docs/Web/CSS/padding + function unpackPadding(value) { + if (typeof(value) === "number") + value = [value]; + if (value.length === 1) { + return {top: value[0], right: value[0], bottom: value[0], left: value[0]}; + } + if (value.length === 2) { + return {top: value[0], right: value[1], bottom: value[0], left: value[1]}; + } + if (value.length === 3) { + return {top: value[0], right: value[1], bottom: value[2], left: value[1]}; + } + if (value.length === 4) { + return {top: value[0], right: value[1], bottom: value[2], left: value[3]}; + } + } + + // Convert an unpacked padding object to a CSS value + function paddingToCss(paddingObj) { + return paddingObj.top + "px " + paddingObj.right + "px " + paddingObj.bottom + "px " + paddingObj.left + "px"; + } + + // Makes a number suitable for CSS + function px(x) { + if (typeof(x) === "number") + return x + "px"; + else + return x; + } + + // Retrieves runtime widget sizing information for an element. + // The return value is either null, or an object with fill, padding, + // defaultWidth, defaultHeight fields. + function sizingPolicy(el) { + var sizingEl = document.querySelector("script[data-for='" + el.id + "'][type='application/htmlwidget-sizing']"); + if (!sizingEl) + return null; + var sp = JSON.parse(sizingEl.textContent || sizingEl.text || "{}"); + if (viewerMode) { + return sp.viewer; + } else { + return sp.browser; + } + } + + // @param tasks Array of strings (or falsy value, in which case no-op). + // Each element must be a valid JavaScript expression that yields a + // function. Or, can be an array of objects with "code" and "data" + // properties; in this case, the "code" property should be a string + // of JS that's an expr that yields a function, and "data" should be + // an object that will be added as an additional argument when that + // function is called. + // @param target The object that will be "this" for each function + // execution. + // @param args Array of arguments to be passed to the functions. (The + // same arguments will be passed to all functions.) + function evalAndRun(tasks, target, args) { + if (tasks) { + forEach(tasks, function(task) { + var theseArgs = args; + if (typeof(task) === "object") { + theseArgs = theseArgs.concat([task.data]); + task = task.code; + } + var taskFunc = tryEval(task); + if (typeof(taskFunc) !== "function") { + throw new Error("Task must be a function! Source:\n" + task); + } + taskFunc.apply(target, theseArgs); + }); + } + } + + // Attempt eval() both with and without enclosing in parentheses. + // Note that enclosing coerces a function declaration into + // an expression that eval() can parse + // (otherwise, a SyntaxError is thrown) + function tryEval(code) { + var result = null; + try { + result = eval("(" + code + ")"); + } catch(error) { + if (!(error instanceof SyntaxError)) { + throw error; + } + try { + result = eval(code); + } catch(e) { + if (e instanceof SyntaxError) { + throw error; + } else { + throw e; + } + } + } + return result; + } + + function initSizing(el) { + var sizing = sizingPolicy(el); + if (!sizing) + return; + + var cel = document.getElementById("htmlwidget_container"); + if (!cel) + return; + + if (typeof(sizing.padding) !== "undefined") { + document.body.style.margin = "0"; + document.body.style.padding = paddingToCss(unpackPadding(sizing.padding)); + } + + if (sizing.fill) { + document.body.style.overflow = "hidden"; + document.body.style.width = "100%"; + document.body.style.height = "100%"; + document.documentElement.style.width = "100%"; + document.documentElement.style.height = "100%"; + cel.style.position = "absolute"; + var pad = unpackPadding(sizing.padding); + cel.style.top = pad.top + "px"; + cel.style.right = pad.right + "px"; + cel.style.bottom = pad.bottom + "px"; + cel.style.left = pad.left + "px"; + el.style.width = "100%"; + el.style.height = "100%"; + + return { + getWidth: function() { return cel.getBoundingClientRect().width; }, + getHeight: function() { return cel.getBoundingClientRect().height; } + }; + + } else { + el.style.width = px(sizing.width); + el.style.height = px(sizing.height); + + return { + getWidth: function() { return cel.getBoundingClientRect().width; }, + getHeight: function() { return cel.getBoundingClientRect().height; } + }; + } + } + + // Default implementations for methods + var defaults = { + find: function(scope) { + return querySelectorAll(scope, "." + this.name); + }, + renderError: function(el, err) { + var $el = $(el); + + this.clearError(el); + + // Add all these error classes, as Shiny does + var errClass = "shiny-output-error"; + if (err.type !== null) { + // use the classes of the error condition as CSS class names + errClass = errClass + " " + $.map(asArray(err.type), function(type) { + return errClass + "-" + type; + }).join(" "); + } + errClass = errClass + " htmlwidgets-error"; + + // Is el inline or block? If inline or inline-block, just display:none it + // and add an inline error. + var display = $el.css("display"); + $el.data("restore-display-mode", display); + + if (display === "inline" || display === "inline-block") { + $el.hide(); + if (err.message !== "") { + var errorSpan = $("").addClass(errClass); + errorSpan.text(err.message); + $el.after(errorSpan); + } + } else if (display === "block") { + // If block, add an error just after the el, set visibility:none on the + // el, and position the error to be on top of the el. + // Mark it with a unique ID and CSS class so we can remove it later. + $el.css("visibility", "hidden"); + if (err.message !== "") { + var errorDiv = $("
        ").addClass(errClass).css("position", "absolute") + .css("top", el.offsetTop) + .css("left", el.offsetLeft) + // setting width can push out the page size, forcing otherwise + // unnecessary scrollbars to appear and making it impossible for + // the element to shrink; so use max-width instead + .css("maxWidth", el.offsetWidth) + .css("height", el.offsetHeight); + errorDiv.text(err.message); + $el.after(errorDiv); + + // Really dumb way to keep the size/position of the error in sync with + // the parent element as the window is resized or whatever. + var intId = setInterval(function() { + if (!errorDiv[0].parentElement) { + clearInterval(intId); + return; + } + errorDiv + .css("top", el.offsetTop) + .css("left", el.offsetLeft) + .css("maxWidth", el.offsetWidth) + .css("height", el.offsetHeight); + }, 500); + } + } + }, + clearError: function(el) { + var $el = $(el); + var display = $el.data("restore-display-mode"); + $el.data("restore-display-mode", null); + + if (display === "inline" || display === "inline-block") { + if (display) + $el.css("display", display); + $(el.nextSibling).filter(".htmlwidgets-error").remove(); + } else if (display === "block"){ + $el.css("visibility", "inherit"); + $(el.nextSibling).filter(".htmlwidgets-error").remove(); + } + }, + sizing: {} + }; + + // Called by widget bindings to register a new type of widget. The definition + // object can contain the following properties: + // - name (required) - A string indicating the binding name, which will be + // used by default as the CSS classname to look for. + // - initialize (optional) - A function(el) that will be called once per + // widget element; if a value is returned, it will be passed as the third + // value to renderValue. + // - renderValue (required) - A function(el, data, initValue) that will be + // called with data. Static contexts will cause this to be called once per + // element; Shiny apps will cause this to be called multiple times per + // element, as the data changes. + window.HTMLWidgets.widget = function(definition) { + if (!definition.name) { + throw new Error("Widget must have a name"); + } + if (!definition.type) { + throw new Error("Widget must have a type"); + } + // Currently we only support output widgets + if (definition.type !== "output") { + throw new Error("Unrecognized widget type '" + definition.type + "'"); + } + // TODO: Verify that .name is a valid CSS classname + + // Support new-style instance-bound definitions. Old-style class-bound + // definitions have one widget "object" per widget per type/class of + // widget; the renderValue and resize methods on such widget objects + // take el and instance arguments, because the widget object can't + // store them. New-style instance-bound definitions have one widget + // object per widget instance; the definition that's passed in doesn't + // provide renderValue or resize methods at all, just the single method + // factory(el, width, height) + // which returns an object that has renderValue(x) and resize(w, h). + // This enables a far more natural programming style for the widget + // author, who can store per-instance state using either OO-style + // instance fields or functional-style closure variables (I guess this + // is in contrast to what can only be called C-style pseudo-OO which is + // what we required before). + if (definition.factory) { + definition = createLegacyDefinitionAdapter(definition); + } + + if (!definition.renderValue) { + throw new Error("Widget must have a renderValue function"); + } + + // For static rendering (non-Shiny), use a simple widget registration + // scheme. We also use this scheme for Shiny apps/documents that also + // contain static widgets. + window.HTMLWidgets.widgets = window.HTMLWidgets.widgets || []; + // Merge defaults into the definition; don't mutate the original definition. + var staticBinding = extend({}, defaults, definition); + overrideMethod(staticBinding, "find", function(superfunc) { + return function(scope) { + var results = superfunc(scope); + // Filter out Shiny outputs, we only want the static kind + return filterByClass(results, "html-widget-output", false); + }; + }); + window.HTMLWidgets.widgets.push(staticBinding); + + if (shinyMode) { + // Shiny is running. Register the definition with an output binding. + // The definition itself will not be the output binding, instead + // we will make an output binding object that delegates to the + // definition. This is because we foolishly used the same method + // name (renderValue) for htmlwidgets definition and Shiny bindings + // but they actually have quite different semantics (the Shiny + // bindings receive data that includes lots of metadata that it + // strips off before calling htmlwidgets renderValue). We can't + // just ignore the difference because in some widgets it's helpful + // to call this.renderValue() from inside of resize(), and if + // we're not delegating, then that call will go to the Shiny + // version instead of the htmlwidgets version. + + // Merge defaults with definition, without mutating either. + var bindingDef = extend({}, defaults, definition); + + // This object will be our actual Shiny binding. + var shinyBinding = new Shiny.OutputBinding(); + + // With a few exceptions, we'll want to simply use the bindingDef's + // version of methods if they are available, otherwise fall back to + // Shiny's defaults. NOTE: If Shiny's output bindings gain additional + // methods in the future, and we want them to be overrideable by + // HTMLWidget binding definitions, then we'll need to add them to this + // list. + delegateMethod(shinyBinding, bindingDef, "getId"); + delegateMethod(shinyBinding, bindingDef, "onValueChange"); + delegateMethod(shinyBinding, bindingDef, "onValueError"); + delegateMethod(shinyBinding, bindingDef, "renderError"); + delegateMethod(shinyBinding, bindingDef, "clearError"); + delegateMethod(shinyBinding, bindingDef, "showProgress"); + + // The find, renderValue, and resize are handled differently, because we + // want to actually decorate the behavior of the bindingDef methods. + + shinyBinding.find = function(scope) { + var results = bindingDef.find(scope); + + // Only return elements that are Shiny outputs, not static ones + var dynamicResults = results.filter(".html-widget-output"); + + // It's possible that whatever caused Shiny to think there might be + // new dynamic outputs, also caused there to be new static outputs. + // Since there might be lots of different htmlwidgets bindings, we + // schedule execution for later--no need to staticRender multiple + // times. + if (results.length !== dynamicResults.length) + scheduleStaticRender(); + + return dynamicResults; + }; + + // Wrap renderValue to handle initialization, which unfortunately isn't + // supported natively by Shiny at the time of this writing. + + shinyBinding.renderValue = function(el, data) { + Shiny.renderDependencies(data.deps); + // Resolve strings marked as javascript literals to objects + if (!(data.evals instanceof Array)) data.evals = [data.evals]; + for (var i = 0; data.evals && i < data.evals.length; i++) { + window.HTMLWidgets.evaluateStringMember(data.x, data.evals[i]); + } + if (!bindingDef.renderOnNullValue) { + if (data.x === null) { + el.style.visibility = "hidden"; + return; + } else { + el.style.visibility = "inherit"; + } + } + if (!elementData(el, "initialized")) { + initSizing(el); + + elementData(el, "initialized", true); + if (bindingDef.initialize) { + var rect = el.getBoundingClientRect(); + var result = bindingDef.initialize(el, rect.width, rect.height); + elementData(el, "init_result", result); + } + } + bindingDef.renderValue(el, data.x, elementData(el, "init_result")); + evalAndRun(data.jsHooks.render, elementData(el, "init_result"), [el, data.x]); + }; + + // Only override resize if bindingDef implements it + if (bindingDef.resize) { + shinyBinding.resize = function(el, width, height) { + // Shiny can call resize before initialize/renderValue have been + // called, which doesn't make sense for widgets. + if (elementData(el, "initialized")) { + bindingDef.resize(el, width, height, elementData(el, "init_result")); + } + }; + } + + Shiny.outputBindings.register(shinyBinding, bindingDef.name); + } + }; + + var scheduleStaticRenderTimerId = null; + function scheduleStaticRender() { + if (!scheduleStaticRenderTimerId) { + scheduleStaticRenderTimerId = setTimeout(function() { + scheduleStaticRenderTimerId = null; + window.HTMLWidgets.staticRender(); + }, 1); + } + } + + // Render static widgets after the document finishes loading + // Statically render all elements that are of this widget's class + window.HTMLWidgets.staticRender = function() { + var bindings = window.HTMLWidgets.widgets || []; + forEach(bindings, function(binding) { + var matches = binding.find(document.documentElement); + forEach(matches, function(el) { + var sizeObj = initSizing(el, binding); + + var getSize = function(el) { + if (sizeObj) { + return {w: sizeObj.getWidth(), h: sizeObj.getHeight()} + } else { + var rect = el.getBoundingClientRect(); + return {w: rect.width, h: rect.height} + } + }; + + if (hasClass(el, "html-widget-static-bound")) + return; + el.className = el.className + " html-widget-static-bound"; + + var initResult; + if (binding.initialize) { + var size = getSize(el); + initResult = binding.initialize(el, size.w, size.h); + elementData(el, "init_result", initResult); + } + + if (binding.resize) { + var lastSize = getSize(el); + var resizeHandler = function(e) { + var size = getSize(el); + if (size.w === 0 && size.h === 0) + return; + if (size.w === lastSize.w && size.h === lastSize.h) + return; + lastSize = size; + binding.resize(el, size.w, size.h, initResult); + }; + + on(window, "resize", resizeHandler); + + // This is needed for cases where we're running in a Shiny + // app, but the widget itself is not a Shiny output, but + // rather a simple static widget. One example of this is + // an rmarkdown document that has runtime:shiny and widget + // that isn't in a render function. Shiny only knows to + // call resize handlers for Shiny outputs, not for static + // widgets, so we do it ourselves. + if (window.jQuery) { + window.jQuery(document).on( + "shown.htmlwidgets shown.bs.tab.htmlwidgets shown.bs.collapse.htmlwidgets", + resizeHandler + ); + window.jQuery(document).on( + "hidden.htmlwidgets hidden.bs.tab.htmlwidgets hidden.bs.collapse.htmlwidgets", + resizeHandler + ); + } + + // This is needed for the specific case of ioslides, which + // flips slides between display:none and display:block. + // Ideally we would not have to have ioslide-specific code + // here, but rather have ioslides raise a generic event, + // but the rmarkdown package just went to CRAN so the + // window to getting that fixed may be long. + if (window.addEventListener) { + // It's OK to limit this to window.addEventListener + // browsers because ioslides itself only supports + // such browsers. + on(document, "slideenter", resizeHandler); + on(document, "slideleave", resizeHandler); + } + } + + var scriptData = document.querySelector("script[data-for='" + el.id + "'][type='application/json']"); + if (scriptData) { + var data = JSON.parse(scriptData.textContent || scriptData.text); + // Resolve strings marked as javascript literals to objects + if (!(data.evals instanceof Array)) data.evals = [data.evals]; + for (var k = 0; data.evals && k < data.evals.length; k++) { + window.HTMLWidgets.evaluateStringMember(data.x, data.evals[k]); + } + binding.renderValue(el, data.x, initResult); + evalAndRun(data.jsHooks.render, initResult, [el, data.x]); + } + }); + }); + + invokePostRenderHandlers(); + } + + + function has_jQuery3() { + if (!window.jQuery) { + return false; + } + var $version = window.jQuery.fn.jquery; + var $major_version = parseInt($version.split(".")[0]); + return $major_version >= 3; + } + + /* + / Shiny 1.4 bumped jQuery from 1.x to 3.x which means jQuery's + / on-ready handler (i.e., $(fn)) is now asyncronous (i.e., it now + / really means $(setTimeout(fn)). + / https://jquery.com/upgrade-guide/3.0/#breaking-change-document-ready-handlers-are-now-asynchronous + / + / Since Shiny uses $() to schedule initShiny, shiny>=1.4 calls initShiny + / one tick later than it did before, which means staticRender() is + / called renderValue() earlier than (advanced) widget authors might be expecting. + / https://github.com/rstudio/shiny/issues/2630 + / + / For a concrete example, leaflet has some methods (e.g., updateBounds) + / which reference Shiny methods registered in initShiny (e.g., setInputValue). + / Since leaflet is privy to this life-cycle, it knows to use setTimeout() to + / delay execution of those methods (until Shiny methods are ready) + / https://github.com/rstudio/leaflet/blob/18ec981/javascript/src/index.js#L266-L268 + / + / Ideally widget authors wouldn't need to use this setTimeout() hack that + / leaflet uses to call Shiny methods on a staticRender(). In the long run, + / the logic initShiny should be broken up so that method registration happens + / right away, but binding happens later. + */ + function maybeStaticRenderLater() { + if (shinyMode && has_jQuery3()) { + window.jQuery(window.HTMLWidgets.staticRender); + } else { + window.HTMLWidgets.staticRender(); + } + } + + if (document.addEventListener) { + document.addEventListener("DOMContentLoaded", function() { + document.removeEventListener("DOMContentLoaded", arguments.callee, false); + maybeStaticRenderLater(); + }, false); + } else if (document.attachEvent) { + document.attachEvent("onreadystatechange", function() { + if (document.readyState === "complete") { + document.detachEvent("onreadystatechange", arguments.callee); + maybeStaticRenderLater(); + } + }); + } + + + window.HTMLWidgets.getAttachmentUrl = function(depname, key) { + // If no key, default to the first item + if (typeof(key) === "undefined") + key = 1; + + var link = document.getElementById(depname + "-" + key + "-attachment"); + if (!link) { + throw new Error("Attachment " + depname + "/" + key + " not found in document"); + } + return link.getAttribute("href"); + }; + + window.HTMLWidgets.dataframeToD3 = function(df) { + var names = []; + var length; + for (var name in df) { + if (df.hasOwnProperty(name)) + names.push(name); + if (typeof(df[name]) !== "object" || typeof(df[name].length) === "undefined") { + throw new Error("All fields must be arrays"); + } else if (typeof(length) !== "undefined" && length !== df[name].length) { + throw new Error("All fields must be arrays of the same length"); + } + length = df[name].length; + } + var results = []; + var item; + for (var row = 0; row < length; row++) { + item = {}; + for (var col = 0; col < names.length; col++) { + item[names[col]] = df[names[col]][row]; + } + results.push(item); + } + return results; + }; + + window.HTMLWidgets.transposeArray2D = function(array) { + if (array.length === 0) return array; + var newArray = array[0].map(function(col, i) { + return array.map(function(row) { + return row[i] + }) + }); + return newArray; + }; + // Split value at splitChar, but allow splitChar to be escaped + // using escapeChar. Any other characters escaped by escapeChar + // will be included as usual (including escapeChar itself). + function splitWithEscape(value, splitChar, escapeChar) { + var results = []; + var escapeMode = false; + var currentResult = ""; + for (var pos = 0; pos < value.length; pos++) { + if (!escapeMode) { + if (value[pos] === splitChar) { + results.push(currentResult); + currentResult = ""; + } else if (value[pos] === escapeChar) { + escapeMode = true; + } else { + currentResult += value[pos]; + } + } else { + currentResult += value[pos]; + escapeMode = false; + } + } + if (currentResult !== "") { + results.push(currentResult); + } + return results; + } + // Function authored by Yihui/JJ Allaire + window.HTMLWidgets.evaluateStringMember = function(o, member) { + var parts = splitWithEscape(member, '.', '\\'); + for (var i = 0, l = parts.length; i < l; i++) { + var part = parts[i]; + // part may be a character or 'numeric' member name + if (o !== null && typeof o === "object" && part in o) { + if (i == (l - 1)) { // if we are at the end of the line then evalulate + if (typeof o[part] === "string") + o[part] = tryEval(o[part]); + } else { // otherwise continue to next embedded object + o = o[part]; + } + } + } + }; + + // Retrieve the HTMLWidget instance (i.e. the return value of an + // HTMLWidget binding's initialize() or factory() function) + // associated with an element, or null if none. + window.HTMLWidgets.getInstance = function(el) { + return elementData(el, "init_result"); + }; + + // Finds the first element in the scope that matches the selector, + // and returns the HTMLWidget instance (i.e. the return value of + // an HTMLWidget binding's initialize() or factory() function) + // associated with that element, if any. If no element matches the + // selector, or the first matching element has no HTMLWidget + // instance associated with it, then null is returned. + // + // The scope argument is optional, and defaults to window.document. + window.HTMLWidgets.find = function(scope, selector) { + if (arguments.length == 1) { + selector = scope; + scope = document; + } + + var el = scope.querySelector(selector); + if (el === null) { + return null; + } else { + return window.HTMLWidgets.getInstance(el); + } + }; + + // Finds all elements in the scope that match the selector, and + // returns the HTMLWidget instances (i.e. the return values of + // an HTMLWidget binding's initialize() or factory() function) + // associated with the elements, in an array. If elements that + // match the selector don't have an associated HTMLWidget + // instance, the returned array will contain nulls. + // + // The scope argument is optional, and defaults to window.document. + window.HTMLWidgets.findAll = function(scope, selector) { + if (arguments.length == 1) { + selector = scope; + scope = document; + } + + var nodes = scope.querySelectorAll(selector); + var results = []; + for (var i = 0; i < nodes.length; i++) { + results.push(window.HTMLWidgets.getInstance(nodes[i])); + } + return results; + }; + + var postRenderHandlers = []; + function invokePostRenderHandlers() { + while (postRenderHandlers.length) { + var handler = postRenderHandlers.shift(); + if (handler) { + handler(); + } + } + } + + // Register the given callback function to be invoked after the + // next time static widgets are rendered. + window.HTMLWidgets.addPostRenderHandler = function(callback) { + postRenderHandlers.push(callback); + }; + + // Takes a new-style instance-bound definition, and returns an + // old-style class-bound definition. This saves us from having + // to rewrite all the logic in this file to accomodate both + // types of definitions. + function createLegacyDefinitionAdapter(defn) { + var result = { + name: defn.name, + type: defn.type, + initialize: function(el, width, height) { + return defn.factory(el, width, height); + }, + renderValue: function(el, x, instance) { + return instance.renderValue(x); + }, + resize: function(el, width, height, instance) { + return instance.resize(width, height); + } + }; + + if (defn.find) + result.find = defn.find; + if (defn.renderError) + result.renderError = defn.renderError; + if (defn.clearError) + result.clearError = defn.clearError; + + return result; + } +})(); diff --git a/site_libs/jquery-3.6.0/jquery-3.6.0.js b/site_libs/jquery-3.6.0/jquery-3.6.0.js new file mode 100644 index 000000000..fc6c299b7 --- /dev/null +++ b/site_libs/jquery-3.6.0/jquery-3.6.0.js @@ -0,0 +1,10881 @@ +/*! + * jQuery JavaScript Library v3.6.0 + * https://jquery.com/ + * + * Includes Sizzle.js + * https://sizzlejs.com/ + * + * Copyright OpenJS Foundation and other contributors + * Released under the MIT license + * https://jquery.org/license + * + * Date: 2021-03-02T17:08Z + */ +( function( global, factory ) { + + "use strict"; + + if ( typeof module === "object" && typeof module.exports === "object" ) { + + // For CommonJS and CommonJS-like environments where a proper `window` + // is present, execute the factory and get jQuery. + // For environments that do not have a `window` with a `document` + // (such as Node.js), expose a factory as module.exports. + // This accentuates the need for the creation of a real `window`. + // e.g. var jQuery = require("jquery")(window); + // See ticket #14549 for more info. + module.exports = global.document ? + factory( global, true ) : + function( w ) { + if ( !w.document ) { + throw new Error( "jQuery requires a window with a document" ); + } + return factory( w ); + }; + } else { + factory( global ); + } + +// Pass this if window is not defined yet +} )( typeof window !== "undefined" ? window : this, function( window, noGlobal ) { + +// Edge <= 12 - 13+, Firefox <=18 - 45+, IE 10 - 11, Safari 5.1 - 9+, iOS 6 - 9.1 +// throw exceptions when non-strict code (e.g., ASP.NET 4.5) accesses strict mode +// arguments.callee.caller (trac-13335). But as of jQuery 3.0 (2016), strict mode should be common +// enough that all such attempts are guarded in a try block. +"use strict"; + +var arr = []; + +var getProto = Object.getPrototypeOf; + +var slice = arr.slice; + +var flat = arr.flat ? function( array ) { + return arr.flat.call( array ); +} : function( array ) { + return arr.concat.apply( [], array ); +}; + + +var push = arr.push; + +var indexOf = arr.indexOf; + +var class2type = {}; + +var toString = class2type.toString; + +var hasOwn = class2type.hasOwnProperty; + +var fnToString = hasOwn.toString; + +var ObjectFunctionString = fnToString.call( Object ); + +var support = {}; + +var isFunction = function isFunction( obj ) { + + // Support: Chrome <=57, Firefox <=52 + // In some browsers, typeof returns "function" for HTML elements + // (i.e., `typeof document.createElement( "object" ) === "function"`). + // We don't want to classify *any* DOM node as a function. + // Support: QtWeb <=3.8.5, WebKit <=534.34, wkhtmltopdf tool <=0.12.5 + // Plus for old WebKit, typeof returns "function" for HTML collections + // (e.g., `typeof document.getElementsByTagName("div") === "function"`). (gh-4756) + return typeof obj === "function" && typeof obj.nodeType !== "number" && + typeof obj.item !== "function"; + }; + + +var isWindow = function isWindow( obj ) { + return obj != null && obj === obj.window; + }; + + +var document = window.document; + + + + var preservedScriptAttributes = { + type: true, + src: true, + nonce: true, + noModule: true + }; + + function DOMEval( code, node, doc ) { + doc = doc || document; + + var i, val, + script = doc.createElement( "script" ); + + script.text = code; + if ( node ) { + for ( i in preservedScriptAttributes ) { + + // Support: Firefox 64+, Edge 18+ + // Some browsers don't support the "nonce" property on scripts. + // On the other hand, just using `getAttribute` is not enough as + // the `nonce` attribute is reset to an empty string whenever it + // becomes browsing-context connected. + // See https://github.com/whatwg/html/issues/2369 + // See https://html.spec.whatwg.org/#nonce-attributes + // The `node.getAttribute` check was added for the sake of + // `jQuery.globalEval` so that it can fake a nonce-containing node + // via an object. + val = node[ i ] || node.getAttribute && node.getAttribute( i ); + if ( val ) { + script.setAttribute( i, val ); + } + } + } + doc.head.appendChild( script ).parentNode.removeChild( script ); + } + + +function toType( obj ) { + if ( obj == null ) { + return obj + ""; + } + + // Support: Android <=2.3 only (functionish RegExp) + return typeof obj === "object" || typeof obj === "function" ? + class2type[ toString.call( obj ) ] || "object" : + typeof obj; +} +/* global Symbol */ +// Defining this global in .eslintrc.json would create a danger of using the global +// unguarded in another place, it seems safer to define global only for this module + + + +var + version = "3.6.0", + + // Define a local copy of jQuery + jQuery = function( selector, context ) { + + // The jQuery object is actually just the init constructor 'enhanced' + // Need init if jQuery is called (just allow error to be thrown if not included) + return new jQuery.fn.init( selector, context ); + }; + +jQuery.fn = jQuery.prototype = { + + // The current version of jQuery being used + jquery: version, + + constructor: jQuery, + + // The default length of a jQuery object is 0 + length: 0, + + toArray: function() { + return slice.call( this ); + }, + + // Get the Nth element in the matched element set OR + // Get the whole matched element set as a clean array + get: function( num ) { + + // Return all the elements in a clean array + if ( num == null ) { + return slice.call( this ); + } + + // Return just the one element from the set + return num < 0 ? this[ num + this.length ] : this[ num ]; + }, + + // Take an array of elements and push it onto the stack + // (returning the new matched element set) + pushStack: function( elems ) { + + // Build a new jQuery matched element set + var ret = jQuery.merge( this.constructor(), elems ); + + // Add the old object onto the stack (as a reference) + ret.prevObject = this; + + // Return the newly-formed element set + return ret; + }, + + // Execute a callback for every element in the matched set. + each: function( callback ) { + return jQuery.each( this, callback ); + }, + + map: function( callback ) { + return this.pushStack( jQuery.map( this, function( elem, i ) { + return callback.call( elem, i, elem ); + } ) ); + }, + + slice: function() { + return this.pushStack( slice.apply( this, arguments ) ); + }, + + first: function() { + return this.eq( 0 ); + }, + + last: function() { + return this.eq( -1 ); + }, + + even: function() { + return this.pushStack( jQuery.grep( this, function( _elem, i ) { + return ( i + 1 ) % 2; + } ) ); + }, + + odd: function() { + return this.pushStack( jQuery.grep( this, function( _elem, i ) { + return i % 2; + } ) ); + }, + + eq: function( i ) { + var len = this.length, + j = +i + ( i < 0 ? len : 0 ); + return this.pushStack( j >= 0 && j < len ? [ this[ j ] ] : [] ); + }, + + end: function() { + return this.prevObject || this.constructor(); + }, + + // For internal use only. + // Behaves like an Array's method, not like a jQuery method. + push: push, + sort: arr.sort, + splice: arr.splice +}; + +jQuery.extend = jQuery.fn.extend = function() { + var options, name, src, copy, copyIsArray, clone, + target = arguments[ 0 ] || {}, + i = 1, + length = arguments.length, + deep = false; + + // Handle a deep copy situation + if ( typeof target === "boolean" ) { + deep = target; + + // Skip the boolean and the target + target = arguments[ i ] || {}; + i++; + } + + // Handle case when target is a string or something (possible in deep copy) + if ( typeof target !== "object" && !isFunction( target ) ) { + target = {}; + } + + // Extend jQuery itself if only one argument is passed + if ( i === length ) { + target = this; + i--; + } + + for ( ; i < length; i++ ) { + + // Only deal with non-null/undefined values + if ( ( options = arguments[ i ] ) != null ) { + + // Extend the base object + for ( name in options ) { + copy = options[ name ]; + + // Prevent Object.prototype pollution + // Prevent never-ending loop + if ( name === "__proto__" || target === copy ) { + continue; + } + + // Recurse if we're merging plain objects or arrays + if ( deep && copy && ( jQuery.isPlainObject( copy ) || + ( copyIsArray = Array.isArray( copy ) ) ) ) { + src = target[ name ]; + + // Ensure proper type for the source value + if ( copyIsArray && !Array.isArray( src ) ) { + clone = []; + } else if ( !copyIsArray && !jQuery.isPlainObject( src ) ) { + clone = {}; + } else { + clone = src; + } + copyIsArray = false; + + // Never move original objects, clone them + target[ name ] = jQuery.extend( deep, clone, copy ); + + // Don't bring in undefined values + } else if ( copy !== undefined ) { + target[ name ] = copy; + } + } + } + } + + // Return the modified object + return target; +}; + +jQuery.extend( { + + // Unique for each copy of jQuery on the page + expando: "jQuery" + ( version + Math.random() ).replace( /\D/g, "" ), + + // Assume jQuery is ready without the ready module + isReady: true, + + error: function( msg ) { + throw new Error( msg ); + }, + + noop: function() {}, + + isPlainObject: function( obj ) { + var proto, Ctor; + + // Detect obvious negatives + // Use toString instead of jQuery.type to catch host objects + if ( !obj || toString.call( obj ) !== "[object Object]" ) { + return false; + } + + proto = getProto( obj ); + + // Objects with no prototype (e.g., `Object.create( null )`) are plain + if ( !proto ) { + return true; + } + + // Objects with prototype are plain iff they were constructed by a global Object function + Ctor = hasOwn.call( proto, "constructor" ) && proto.constructor; + return typeof Ctor === "function" && fnToString.call( Ctor ) === ObjectFunctionString; + }, + + isEmptyObject: function( obj ) { + var name; + + for ( name in obj ) { + return false; + } + return true; + }, + + // Evaluates a script in a provided context; falls back to the global one + // if not specified. + globalEval: function( code, options, doc ) { + DOMEval( code, { nonce: options && options.nonce }, doc ); + }, + + each: function( obj, callback ) { + var length, i = 0; + + if ( isArrayLike( obj ) ) { + length = obj.length; + for ( ; i < length; i++ ) { + if ( callback.call( obj[ i ], i, obj[ i ] ) === false ) { + break; + } + } + } else { + for ( i in obj ) { + if ( callback.call( obj[ i ], i, obj[ i ] ) === false ) { + break; + } + } + } + + return obj; + }, + + // results is for internal usage only + makeArray: function( arr, results ) { + var ret = results || []; + + if ( arr != null ) { + if ( isArrayLike( Object( arr ) ) ) { + jQuery.merge( ret, + typeof arr === "string" ? + [ arr ] : arr + ); + } else { + push.call( ret, arr ); + } + } + + return ret; + }, + + inArray: function( elem, arr, i ) { + return arr == null ? -1 : indexOf.call( arr, elem, i ); + }, + + // Support: Android <=4.0 only, PhantomJS 1 only + // push.apply(_, arraylike) throws on ancient WebKit + merge: function( first, second ) { + var len = +second.length, + j = 0, + i = first.length; + + for ( ; j < len; j++ ) { + first[ i++ ] = second[ j ]; + } + + first.length = i; + + return first; + }, + + grep: function( elems, callback, invert ) { + var callbackInverse, + matches = [], + i = 0, + length = elems.length, + callbackExpect = !invert; + + // Go through the array, only saving the items + // that pass the validator function + for ( ; i < length; i++ ) { + callbackInverse = !callback( elems[ i ], i ); + if ( callbackInverse !== callbackExpect ) { + matches.push( elems[ i ] ); + } + } + + return matches; + }, + + // arg is for internal usage only + map: function( elems, callback, arg ) { + var length, value, + i = 0, + ret = []; + + // Go through the array, translating each of the items to their new values + if ( isArrayLike( elems ) ) { + length = elems.length; + for ( ; i < length; i++ ) { + value = callback( elems[ i ], i, arg ); + + if ( value != null ) { + ret.push( value ); + } + } + + // Go through every key on the object, + } else { + for ( i in elems ) { + value = callback( elems[ i ], i, arg ); + + if ( value != null ) { + ret.push( value ); + } + } + } + + // Flatten any nested arrays + return flat( ret ); + }, + + // A global GUID counter for objects + guid: 1, + + // jQuery.support is not used in Core but other projects attach their + // properties to it so it needs to exist. + support: support +} ); + +if ( typeof Symbol === "function" ) { + jQuery.fn[ Symbol.iterator ] = arr[ Symbol.iterator ]; +} + +// Populate the class2type map +jQuery.each( "Boolean Number String Function Array Date RegExp Object Error Symbol".split( " " ), + function( _i, name ) { + class2type[ "[object " + name + "]" ] = name.toLowerCase(); + } ); + +function isArrayLike( obj ) { + + // Support: real iOS 8.2 only (not reproducible in simulator) + // `in` check used to prevent JIT error (gh-2145) + // hasOwn isn't used here due to false negatives + // regarding Nodelist length in IE + var length = !!obj && "length" in obj && obj.length, + type = toType( obj ); + + if ( isFunction( obj ) || isWindow( obj ) ) { + return false; + } + + return type === "array" || length === 0 || + typeof length === "number" && length > 0 && ( length - 1 ) in obj; +} +var Sizzle = +/*! + * Sizzle CSS Selector Engine v2.3.6 + * https://sizzlejs.com/ + * + * Copyright JS Foundation and other contributors + * Released under the MIT license + * https://js.foundation/ + * + * Date: 2021-02-16 + */ +( function( window ) { +var i, + support, + Expr, + getText, + isXML, + tokenize, + compile, + select, + outermostContext, + sortInput, + hasDuplicate, + + // Local document vars + setDocument, + document, + docElem, + documentIsHTML, + rbuggyQSA, + rbuggyMatches, + matches, + contains, + + // Instance-specific data + expando = "sizzle" + 1 * new Date(), + preferredDoc = window.document, + dirruns = 0, + done = 0, + classCache = createCache(), + tokenCache = createCache(), + compilerCache = createCache(), + nonnativeSelectorCache = createCache(), + sortOrder = function( a, b ) { + if ( a === b ) { + hasDuplicate = true; + } + return 0; + }, + + // Instance methods + hasOwn = ( {} ).hasOwnProperty, + arr = [], + pop = arr.pop, + pushNative = arr.push, + push = arr.push, + slice = arr.slice, + + // Use a stripped-down indexOf as it's faster than native + // https://jsperf.com/thor-indexof-vs-for/5 + indexOf = function( list, elem ) { + var i = 0, + len = list.length; + for ( ; i < len; i++ ) { + if ( list[ i ] === elem ) { + return i; + } + } + return -1; + }, + + booleans = "checked|selected|async|autofocus|autoplay|controls|defer|disabled|hidden|" + + "ismap|loop|multiple|open|readonly|required|scoped", + + // Regular expressions + + // http://www.w3.org/TR/css3-selectors/#whitespace + whitespace = "[\\x20\\t\\r\\n\\f]", + + // https://www.w3.org/TR/css-syntax-3/#ident-token-diagram + identifier = "(?:\\\\[\\da-fA-F]{1,6}" + whitespace + + "?|\\\\[^\\r\\n\\f]|[\\w-]|[^\0-\\x7f])+", + + // Attribute selectors: http://www.w3.org/TR/selectors/#attribute-selectors + attributes = "\\[" + whitespace + "*(" + identifier + ")(?:" + whitespace + + + // Operator (capture 2) + "*([*^$|!~]?=)" + whitespace + + + // "Attribute values must be CSS identifiers [capture 5] + // or strings [capture 3 or capture 4]" + "*(?:'((?:\\\\.|[^\\\\'])*)'|\"((?:\\\\.|[^\\\\\"])*)\"|(" + identifier + "))|)" + + whitespace + "*\\]", + + pseudos = ":(" + identifier + ")(?:\\((" + + + // To reduce the number of selectors needing tokenize in the preFilter, prefer arguments: + // 1. quoted (capture 3; capture 4 or capture 5) + "('((?:\\\\.|[^\\\\'])*)'|\"((?:\\\\.|[^\\\\\"])*)\")|" + + + // 2. simple (capture 6) + "((?:\\\\.|[^\\\\()[\\]]|" + attributes + ")*)|" + + + // 3. anything else (capture 2) + ".*" + + ")\\)|)", + + // Leading and non-escaped trailing whitespace, capturing some non-whitespace characters preceding the latter + rwhitespace = new RegExp( whitespace + "+", "g" ), + rtrim = new RegExp( "^" + whitespace + "+|((?:^|[^\\\\])(?:\\\\.)*)" + + whitespace + "+$", "g" ), + + rcomma = new RegExp( "^" + whitespace + "*," + whitespace + "*" ), + rcombinators = new RegExp( "^" + whitespace + "*([>+~]|" + whitespace + ")" + whitespace + + "*" ), + rdescend = new RegExp( whitespace + "|>" ), + + rpseudo = new RegExp( pseudos ), + ridentifier = new RegExp( "^" + identifier + "$" ), + + matchExpr = { + "ID": new RegExp( "^#(" + identifier + ")" ), + "CLASS": new RegExp( "^\\.(" + identifier + ")" ), + "TAG": new RegExp( "^(" + identifier + "|[*])" ), + "ATTR": new RegExp( "^" + attributes ), + "PSEUDO": new RegExp( "^" + pseudos ), + "CHILD": new RegExp( "^:(only|first|last|nth|nth-last)-(child|of-type)(?:\\(" + + whitespace + "*(even|odd|(([+-]|)(\\d*)n|)" + whitespace + "*(?:([+-]|)" + + whitespace + "*(\\d+)|))" + whitespace + "*\\)|)", "i" ), + "bool": new RegExp( "^(?:" + booleans + ")$", "i" ), + + // For use in libraries implementing .is() + // We use this for POS matching in `select` + "needsContext": new RegExp( "^" + whitespace + + "*[>+~]|:(even|odd|eq|gt|lt|nth|first|last)(?:\\(" + whitespace + + "*((?:-\\d)?\\d*)" + whitespace + "*\\)|)(?=[^-]|$)", "i" ) + }, + + rhtml = /HTML$/i, + rinputs = /^(?:input|select|textarea|button)$/i, + rheader = /^h\d$/i, + + rnative = /^[^{]+\{\s*\[native \w/, + + // Easily-parseable/retrievable ID or TAG or CLASS selectors + rquickExpr = /^(?:#([\w-]+)|(\w+)|\.([\w-]+))$/, + + rsibling = /[+~]/, + + // CSS escapes + // http://www.w3.org/TR/CSS21/syndata.html#escaped-characters + runescape = new RegExp( "\\\\[\\da-fA-F]{1,6}" + whitespace + "?|\\\\([^\\r\\n\\f])", "g" ), + funescape = function( escape, nonHex ) { + var high = "0x" + escape.slice( 1 ) - 0x10000; + + return nonHex ? + + // Strip the backslash prefix from a non-hex escape sequence + nonHex : + + // Replace a hexadecimal escape sequence with the encoded Unicode code point + // Support: IE <=11+ + // For values outside the Basic Multilingual Plane (BMP), manually construct a + // surrogate pair + high < 0 ? + String.fromCharCode( high + 0x10000 ) : + String.fromCharCode( high >> 10 | 0xD800, high & 0x3FF | 0xDC00 ); + }, + + // CSS string/identifier serialization + // https://drafts.csswg.org/cssom/#common-serializing-idioms + rcssescape = /([\0-\x1f\x7f]|^-?\d)|^-$|[^\0-\x1f\x7f-\uFFFF\w-]/g, + fcssescape = function( ch, asCodePoint ) { + if ( asCodePoint ) { + + // U+0000 NULL becomes U+FFFD REPLACEMENT CHARACTER + if ( ch === "\0" ) { + return "\uFFFD"; + } + + // Control characters and (dependent upon position) numbers get escaped as code points + return ch.slice( 0, -1 ) + "\\" + + ch.charCodeAt( ch.length - 1 ).toString( 16 ) + " "; + } + + // Other potentially-special ASCII characters get backslash-escaped + return "\\" + ch; + }, + + // Used for iframes + // See setDocument() + // Removing the function wrapper causes a "Permission Denied" + // error in IE + unloadHandler = function() { + setDocument(); + }, + + inDisabledFieldset = addCombinator( + function( elem ) { + return elem.disabled === true && elem.nodeName.toLowerCase() === "fieldset"; + }, + { dir: "parentNode", next: "legend" } + ); + +// Optimize for push.apply( _, NodeList ) +try { + push.apply( + ( arr = slice.call( preferredDoc.childNodes ) ), + preferredDoc.childNodes + ); + + // Support: Android<4.0 + // Detect silently failing push.apply + // eslint-disable-next-line no-unused-expressions + arr[ preferredDoc.childNodes.length ].nodeType; +} catch ( e ) { + push = { apply: arr.length ? + + // Leverage slice if possible + function( target, els ) { + pushNative.apply( target, slice.call( els ) ); + } : + + // Support: IE<9 + // Otherwise append directly + function( target, els ) { + var j = target.length, + i = 0; + + // Can't trust NodeList.length + while ( ( target[ j++ ] = els[ i++ ] ) ) {} + target.length = j - 1; + } + }; +} + +function Sizzle( selector, context, results, seed ) { + var m, i, elem, nid, match, groups, newSelector, + newContext = context && context.ownerDocument, + + // nodeType defaults to 9, since context defaults to document + nodeType = context ? context.nodeType : 9; + + results = results || []; + + // Return early from calls with invalid selector or context + if ( typeof selector !== "string" || !selector || + nodeType !== 1 && nodeType !== 9 && nodeType !== 11 ) { + + return results; + } + + // Try to shortcut find operations (as opposed to filters) in HTML documents + if ( !seed ) { + setDocument( context ); + context = context || document; + + if ( documentIsHTML ) { + + // If the selector is sufficiently simple, try using a "get*By*" DOM method + // (excepting DocumentFragment context, where the methods don't exist) + if ( nodeType !== 11 && ( match = rquickExpr.exec( selector ) ) ) { + + // ID selector + if ( ( m = match[ 1 ] ) ) { + + // Document context + if ( nodeType === 9 ) { + if ( ( elem = context.getElementById( m ) ) ) { + + // Support: IE, Opera, Webkit + // TODO: identify versions + // getElementById can match elements by name instead of ID + if ( elem.id === m ) { + results.push( elem ); + return results; + } + } else { + return results; + } + + // Element context + } else { + + // Support: IE, Opera, Webkit + // TODO: identify versions + // getElementById can match elements by name instead of ID + if ( newContext && ( elem = newContext.getElementById( m ) ) && + contains( context, elem ) && + elem.id === m ) { + + results.push( elem ); + return results; + } + } + + // Type selector + } else if ( match[ 2 ] ) { + push.apply( results, context.getElementsByTagName( selector ) ); + return results; + + // Class selector + } else if ( ( m = match[ 3 ] ) && support.getElementsByClassName && + context.getElementsByClassName ) { + + push.apply( results, context.getElementsByClassName( m ) ); + return results; + } + } + + // Take advantage of querySelectorAll + if ( support.qsa && + !nonnativeSelectorCache[ selector + " " ] && + ( !rbuggyQSA || !rbuggyQSA.test( selector ) ) && + + // Support: IE 8 only + // Exclude object elements + ( nodeType !== 1 || context.nodeName.toLowerCase() !== "object" ) ) { + + newSelector = selector; + newContext = context; + + // qSA considers elements outside a scoping root when evaluating child or + // descendant combinators, which is not what we want. + // In such cases, we work around the behavior by prefixing every selector in the + // list with an ID selector referencing the scope context. + // The technique has to be used as well when a leading combinator is used + // as such selectors are not recognized by querySelectorAll. + // Thanks to Andrew Dupont for this technique. + if ( nodeType === 1 && + ( rdescend.test( selector ) || rcombinators.test( selector ) ) ) { + + // Expand context for sibling selectors + newContext = rsibling.test( selector ) && testContext( context.parentNode ) || + context; + + // We can use :scope instead of the ID hack if the browser + // supports it & if we're not changing the context. + if ( newContext !== context || !support.scope ) { + + // Capture the context ID, setting it first if necessary + if ( ( nid = context.getAttribute( "id" ) ) ) { + nid = nid.replace( rcssescape, fcssescape ); + } else { + context.setAttribute( "id", ( nid = expando ) ); + } + } + + // Prefix every selector in the list + groups = tokenize( selector ); + i = groups.length; + while ( i-- ) { + groups[ i ] = ( nid ? "#" + nid : ":scope" ) + " " + + toSelector( groups[ i ] ); + } + newSelector = groups.join( "," ); + } + + try { + push.apply( results, + newContext.querySelectorAll( newSelector ) + ); + return results; + } catch ( qsaError ) { + nonnativeSelectorCache( selector, true ); + } finally { + if ( nid === expando ) { + context.removeAttribute( "id" ); + } + } + } + } + } + + // All others + return select( selector.replace( rtrim, "$1" ), context, results, seed ); +} + +/** + * Create key-value caches of limited size + * @returns {function(string, object)} Returns the Object data after storing it on itself with + * property name the (space-suffixed) string and (if the cache is larger than Expr.cacheLength) + * deleting the oldest entry + */ +function createCache() { + var keys = []; + + function cache( key, value ) { + + // Use (key + " ") to avoid collision with native prototype properties (see Issue #157) + if ( keys.push( key + " " ) > Expr.cacheLength ) { + + // Only keep the most recent entries + delete cache[ keys.shift() ]; + } + return ( cache[ key + " " ] = value ); + } + return cache; +} + +/** + * Mark a function for special use by Sizzle + * @param {Function} fn The function to mark + */ +function markFunction( fn ) { + fn[ expando ] = true; + return fn; +} + +/** + * Support testing using an element + * @param {Function} fn Passed the created element and returns a boolean result + */ +function assert( fn ) { + var el = document.createElement( "fieldset" ); + + try { + return !!fn( el ); + } catch ( e ) { + return false; + } finally { + + // Remove from its parent by default + if ( el.parentNode ) { + el.parentNode.removeChild( el ); + } + + // release memory in IE + el = null; + } +} + +/** + * Adds the same handler for all of the specified attrs + * @param {String} attrs Pipe-separated list of attributes + * @param {Function} handler The method that will be applied + */ +function addHandle( attrs, handler ) { + var arr = attrs.split( "|" ), + i = arr.length; + + while ( i-- ) { + Expr.attrHandle[ arr[ i ] ] = handler; + } +} + +/** + * Checks document order of two siblings + * @param {Element} a + * @param {Element} b + * @returns {Number} Returns less than 0 if a precedes b, greater than 0 if a follows b + */ +function siblingCheck( a, b ) { + var cur = b && a, + diff = cur && a.nodeType === 1 && b.nodeType === 1 && + a.sourceIndex - b.sourceIndex; + + // Use IE sourceIndex if available on both nodes + if ( diff ) { + return diff; + } + + // Check if b follows a + if ( cur ) { + while ( ( cur = cur.nextSibling ) ) { + if ( cur === b ) { + return -1; + } + } + } + + return a ? 1 : -1; +} + +/** + * Returns a function to use in pseudos for input types + * @param {String} type + */ +function createInputPseudo( type ) { + return function( elem ) { + var name = elem.nodeName.toLowerCase(); + return name === "input" && elem.type === type; + }; +} + +/** + * Returns a function to use in pseudos for buttons + * @param {String} type + */ +function createButtonPseudo( type ) { + return function( elem ) { + var name = elem.nodeName.toLowerCase(); + return ( name === "input" || name === "button" ) && elem.type === type; + }; +} + +/** + * Returns a function to use in pseudos for :enabled/:disabled + * @param {Boolean} disabled true for :disabled; false for :enabled + */ +function createDisabledPseudo( disabled ) { + + // Known :disabled false positives: fieldset[disabled] > legend:nth-of-type(n+2) :can-disable + return function( elem ) { + + // Only certain elements can match :enabled or :disabled + // https://html.spec.whatwg.org/multipage/scripting.html#selector-enabled + // https://html.spec.whatwg.org/multipage/scripting.html#selector-disabled + if ( "form" in elem ) { + + // Check for inherited disabledness on relevant non-disabled elements: + // * listed form-associated elements in a disabled fieldset + // https://html.spec.whatwg.org/multipage/forms.html#category-listed + // https://html.spec.whatwg.org/multipage/forms.html#concept-fe-disabled + // * option elements in a disabled optgroup + // https://html.spec.whatwg.org/multipage/forms.html#concept-option-disabled + // All such elements have a "form" property. + if ( elem.parentNode && elem.disabled === false ) { + + // Option elements defer to a parent optgroup if present + if ( "label" in elem ) { + if ( "label" in elem.parentNode ) { + return elem.parentNode.disabled === disabled; + } else { + return elem.disabled === disabled; + } + } + + // Support: IE 6 - 11 + // Use the isDisabled shortcut property to check for disabled fieldset ancestors + return elem.isDisabled === disabled || + + // Where there is no isDisabled, check manually + /* jshint -W018 */ + elem.isDisabled !== !disabled && + inDisabledFieldset( elem ) === disabled; + } + + return elem.disabled === disabled; + + // Try to winnow out elements that can't be disabled before trusting the disabled property. + // Some victims get caught in our net (label, legend, menu, track), but it shouldn't + // even exist on them, let alone have a boolean value. + } else if ( "label" in elem ) { + return elem.disabled === disabled; + } + + // Remaining elements are neither :enabled nor :disabled + return false; + }; +} + +/** + * Returns a function to use in pseudos for positionals + * @param {Function} fn + */ +function createPositionalPseudo( fn ) { + return markFunction( function( argument ) { + argument = +argument; + return markFunction( function( seed, matches ) { + var j, + matchIndexes = fn( [], seed.length, argument ), + i = matchIndexes.length; + + // Match elements found at the specified indexes + while ( i-- ) { + if ( seed[ ( j = matchIndexes[ i ] ) ] ) { + seed[ j ] = !( matches[ j ] = seed[ j ] ); + } + } + } ); + } ); +} + +/** + * Checks a node for validity as a Sizzle context + * @param {Element|Object=} context + * @returns {Element|Object|Boolean} The input node if acceptable, otherwise a falsy value + */ +function testContext( context ) { + return context && typeof context.getElementsByTagName !== "undefined" && context; +} + +// Expose support vars for convenience +support = Sizzle.support = {}; + +/** + * Detects XML nodes + * @param {Element|Object} elem An element or a document + * @returns {Boolean} True iff elem is a non-HTML XML node + */ +isXML = Sizzle.isXML = function( elem ) { + var namespace = elem && elem.namespaceURI, + docElem = elem && ( elem.ownerDocument || elem ).documentElement; + + // Support: IE <=8 + // Assume HTML when documentElement doesn't yet exist, such as inside loading iframes + // https://bugs.jquery.com/ticket/4833 + return !rhtml.test( namespace || docElem && docElem.nodeName || "HTML" ); +}; + +/** + * Sets document-related variables once based on the current document + * @param {Element|Object} [doc] An element or document object to use to set the document + * @returns {Object} Returns the current document + */ +setDocument = Sizzle.setDocument = function( node ) { + var hasCompare, subWindow, + doc = node ? node.ownerDocument || node : preferredDoc; + + // Return early if doc is invalid or already selected + // Support: IE 11+, Edge 17 - 18+ + // IE/Edge sometimes throw a "Permission denied" error when strict-comparing + // two documents; shallow comparisons work. + // eslint-disable-next-line eqeqeq + if ( doc == document || doc.nodeType !== 9 || !doc.documentElement ) { + return document; + } + + // Update global variables + document = doc; + docElem = document.documentElement; + documentIsHTML = !isXML( document ); + + // Support: IE 9 - 11+, Edge 12 - 18+ + // Accessing iframe documents after unload throws "permission denied" errors (jQuery #13936) + // Support: IE 11+, Edge 17 - 18+ + // IE/Edge sometimes throw a "Permission denied" error when strict-comparing + // two documents; shallow comparisons work. + // eslint-disable-next-line eqeqeq + if ( preferredDoc != document && + ( subWindow = document.defaultView ) && subWindow.top !== subWindow ) { + + // Support: IE 11, Edge + if ( subWindow.addEventListener ) { + subWindow.addEventListener( "unload", unloadHandler, false ); + + // Support: IE 9 - 10 only + } else if ( subWindow.attachEvent ) { + subWindow.attachEvent( "onunload", unloadHandler ); + } + } + + // Support: IE 8 - 11+, Edge 12 - 18+, Chrome <=16 - 25 only, Firefox <=3.6 - 31 only, + // Safari 4 - 5 only, Opera <=11.6 - 12.x only + // IE/Edge & older browsers don't support the :scope pseudo-class. + // Support: Safari 6.0 only + // Safari 6.0 supports :scope but it's an alias of :root there. + support.scope = assert( function( el ) { + docElem.appendChild( el ).appendChild( document.createElement( "div" ) ); + return typeof el.querySelectorAll !== "undefined" && + !el.querySelectorAll( ":scope fieldset div" ).length; + } ); + + /* Attributes + ---------------------------------------------------------------------- */ + + // Support: IE<8 + // Verify that getAttribute really returns attributes and not properties + // (excepting IE8 booleans) + support.attributes = assert( function( el ) { + el.className = "i"; + return !el.getAttribute( "className" ); + } ); + + /* getElement(s)By* + ---------------------------------------------------------------------- */ + + // Check if getElementsByTagName("*") returns only elements + support.getElementsByTagName = assert( function( el ) { + el.appendChild( document.createComment( "" ) ); + return !el.getElementsByTagName( "*" ).length; + } ); + + // Support: IE<9 + support.getElementsByClassName = rnative.test( document.getElementsByClassName ); + + // Support: IE<10 + // Check if getElementById returns elements by name + // The broken getElementById methods don't pick up programmatically-set names, + // so use a roundabout getElementsByName test + support.getById = assert( function( el ) { + docElem.appendChild( el ).id = expando; + return !document.getElementsByName || !document.getElementsByName( expando ).length; + } ); + + // ID filter and find + if ( support.getById ) { + Expr.filter[ "ID" ] = function( id ) { + var attrId = id.replace( runescape, funescape ); + return function( elem ) { + return elem.getAttribute( "id" ) === attrId; + }; + }; + Expr.find[ "ID" ] = function( id, context ) { + if ( typeof context.getElementById !== "undefined" && documentIsHTML ) { + var elem = context.getElementById( id ); + return elem ? [ elem ] : []; + } + }; + } else { + Expr.filter[ "ID" ] = function( id ) { + var attrId = id.replace( runescape, funescape ); + return function( elem ) { + var node = typeof elem.getAttributeNode !== "undefined" && + elem.getAttributeNode( "id" ); + return node && node.value === attrId; + }; + }; + + // Support: IE 6 - 7 only + // getElementById is not reliable as a find shortcut + Expr.find[ "ID" ] = function( id, context ) { + if ( typeof context.getElementById !== "undefined" && documentIsHTML ) { + var node, i, elems, + elem = context.getElementById( id ); + + if ( elem ) { + + // Verify the id attribute + node = elem.getAttributeNode( "id" ); + if ( node && node.value === id ) { + return [ elem ]; + } + + // Fall back on getElementsByName + elems = context.getElementsByName( id ); + i = 0; + while ( ( elem = elems[ i++ ] ) ) { + node = elem.getAttributeNode( "id" ); + if ( node && node.value === id ) { + return [ elem ]; + } + } + } + + return []; + } + }; + } + + // Tag + Expr.find[ "TAG" ] = support.getElementsByTagName ? + function( tag, context ) { + if ( typeof context.getElementsByTagName !== "undefined" ) { + return context.getElementsByTagName( tag ); + + // DocumentFragment nodes don't have gEBTN + } else if ( support.qsa ) { + return context.querySelectorAll( tag ); + } + } : + + function( tag, context ) { + var elem, + tmp = [], + i = 0, + + // By happy coincidence, a (broken) gEBTN appears on DocumentFragment nodes too + results = context.getElementsByTagName( tag ); + + // Filter out possible comments + if ( tag === "*" ) { + while ( ( elem = results[ i++ ] ) ) { + if ( elem.nodeType === 1 ) { + tmp.push( elem ); + } + } + + return tmp; + } + return results; + }; + + // Class + Expr.find[ "CLASS" ] = support.getElementsByClassName && function( className, context ) { + if ( typeof context.getElementsByClassName !== "undefined" && documentIsHTML ) { + return context.getElementsByClassName( className ); + } + }; + + /* QSA/matchesSelector + ---------------------------------------------------------------------- */ + + // QSA and matchesSelector support + + // matchesSelector(:active) reports false when true (IE9/Opera 11.5) + rbuggyMatches = []; + + // qSa(:focus) reports false when true (Chrome 21) + // We allow this because of a bug in IE8/9 that throws an error + // whenever `document.activeElement` is accessed on an iframe + // So, we allow :focus to pass through QSA all the time to avoid the IE error + // See https://bugs.jquery.com/ticket/13378 + rbuggyQSA = []; + + if ( ( support.qsa = rnative.test( document.querySelectorAll ) ) ) { + + // Build QSA regex + // Regex strategy adopted from Diego Perini + assert( function( el ) { + + var input; + + // Select is set to empty string on purpose + // This is to test IE's treatment of not explicitly + // setting a boolean content attribute, + // since its presence should be enough + // https://bugs.jquery.com/ticket/12359 + docElem.appendChild( el ).innerHTML = "" + + ""; + + // Support: IE8, Opera 11-12.16 + // Nothing should be selected when empty strings follow ^= or $= or *= + // The test attribute must be unknown in Opera but "safe" for WinRT + // https://msdn.microsoft.com/en-us/library/ie/hh465388.aspx#attribute_section + if ( el.querySelectorAll( "[msallowcapture^='']" ).length ) { + rbuggyQSA.push( "[*^$]=" + whitespace + "*(?:''|\"\")" ); + } + + // Support: IE8 + // Boolean attributes and "value" are not treated correctly + if ( !el.querySelectorAll( "[selected]" ).length ) { + rbuggyQSA.push( "\\[" + whitespace + "*(?:value|" + booleans + ")" ); + } + + // Support: Chrome<29, Android<4.4, Safari<7.0+, iOS<7.0+, PhantomJS<1.9.8+ + if ( !el.querySelectorAll( "[id~=" + expando + "-]" ).length ) { + rbuggyQSA.push( "~=" ); + } + + // Support: IE 11+, Edge 15 - 18+ + // IE 11/Edge don't find elements on a `[name='']` query in some cases. + // Adding a temporary attribute to the document before the selection works + // around the issue. + // Interestingly, IE 10 & older don't seem to have the issue. + input = document.createElement( "input" ); + input.setAttribute( "name", "" ); + el.appendChild( input ); + if ( !el.querySelectorAll( "[name='']" ).length ) { + rbuggyQSA.push( "\\[" + whitespace + "*name" + whitespace + "*=" + + whitespace + "*(?:''|\"\")" ); + } + + // Webkit/Opera - :checked should return selected option elements + // http://www.w3.org/TR/2011/REC-css3-selectors-20110929/#checked + // IE8 throws error here and will not see later tests + if ( !el.querySelectorAll( ":checked" ).length ) { + rbuggyQSA.push( ":checked" ); + } + + // Support: Safari 8+, iOS 8+ + // https://bugs.webkit.org/show_bug.cgi?id=136851 + // In-page `selector#id sibling-combinator selector` fails + if ( !el.querySelectorAll( "a#" + expando + "+*" ).length ) { + rbuggyQSA.push( ".#.+[+~]" ); + } + + // Support: Firefox <=3.6 - 5 only + // Old Firefox doesn't throw on a badly-escaped identifier. + el.querySelectorAll( "\\\f" ); + rbuggyQSA.push( "[\\r\\n\\f]" ); + } ); + + assert( function( el ) { + el.innerHTML = "" + + ""; + + // Support: Windows 8 Native Apps + // The type and name attributes are restricted during .innerHTML assignment + var input = document.createElement( "input" ); + input.setAttribute( "type", "hidden" ); + el.appendChild( input ).setAttribute( "name", "D" ); + + // Support: IE8 + // Enforce case-sensitivity of name attribute + if ( el.querySelectorAll( "[name=d]" ).length ) { + rbuggyQSA.push( "name" + whitespace + "*[*^$|!~]?=" ); + } + + // FF 3.5 - :enabled/:disabled and hidden elements (hidden elements are still enabled) + // IE8 throws error here and will not see later tests + if ( el.querySelectorAll( ":enabled" ).length !== 2 ) { + rbuggyQSA.push( ":enabled", ":disabled" ); + } + + // Support: IE9-11+ + // IE's :disabled selector does not pick up the children of disabled fieldsets + docElem.appendChild( el ).disabled = true; + if ( el.querySelectorAll( ":disabled" ).length !== 2 ) { + rbuggyQSA.push( ":enabled", ":disabled" ); + } + + // Support: Opera 10 - 11 only + // Opera 10-11 does not throw on post-comma invalid pseudos + el.querySelectorAll( "*,:x" ); + rbuggyQSA.push( ",.*:" ); + } ); + } + + if ( ( support.matchesSelector = rnative.test( ( matches = docElem.matches || + docElem.webkitMatchesSelector || + docElem.mozMatchesSelector || + docElem.oMatchesSelector || + docElem.msMatchesSelector ) ) ) ) { + + assert( function( el ) { + + // Check to see if it's possible to do matchesSelector + // on a disconnected node (IE 9) + support.disconnectedMatch = matches.call( el, "*" ); + + // This should fail with an exception + // Gecko does not error, returns false instead + matches.call( el, "[s!='']:x" ); + rbuggyMatches.push( "!=", pseudos ); + } ); + } + + rbuggyQSA = rbuggyQSA.length && new RegExp( rbuggyQSA.join( "|" ) ); + rbuggyMatches = rbuggyMatches.length && new RegExp( rbuggyMatches.join( "|" ) ); + + /* Contains + ---------------------------------------------------------------------- */ + hasCompare = rnative.test( docElem.compareDocumentPosition ); + + // Element contains another + // Purposefully self-exclusive + // As in, an element does not contain itself + contains = hasCompare || rnative.test( docElem.contains ) ? + function( a, b ) { + var adown = a.nodeType === 9 ? a.documentElement : a, + bup = b && b.parentNode; + return a === bup || !!( bup && bup.nodeType === 1 && ( + adown.contains ? + adown.contains( bup ) : + a.compareDocumentPosition && a.compareDocumentPosition( bup ) & 16 + ) ); + } : + function( a, b ) { + if ( b ) { + while ( ( b = b.parentNode ) ) { + if ( b === a ) { + return true; + } + } + } + return false; + }; + + /* Sorting + ---------------------------------------------------------------------- */ + + // Document order sorting + sortOrder = hasCompare ? + function( a, b ) { + + // Flag for duplicate removal + if ( a === b ) { + hasDuplicate = true; + return 0; + } + + // Sort on method existence if only one input has compareDocumentPosition + var compare = !a.compareDocumentPosition - !b.compareDocumentPosition; + if ( compare ) { + return compare; + } + + // Calculate position if both inputs belong to the same document + // Support: IE 11+, Edge 17 - 18+ + // IE/Edge sometimes throw a "Permission denied" error when strict-comparing + // two documents; shallow comparisons work. + // eslint-disable-next-line eqeqeq + compare = ( a.ownerDocument || a ) == ( b.ownerDocument || b ) ? + a.compareDocumentPosition( b ) : + + // Otherwise we know they are disconnected + 1; + + // Disconnected nodes + if ( compare & 1 || + ( !support.sortDetached && b.compareDocumentPosition( a ) === compare ) ) { + + // Choose the first element that is related to our preferred document + // Support: IE 11+, Edge 17 - 18+ + // IE/Edge sometimes throw a "Permission denied" error when strict-comparing + // two documents; shallow comparisons work. + // eslint-disable-next-line eqeqeq + if ( a == document || a.ownerDocument == preferredDoc && + contains( preferredDoc, a ) ) { + return -1; + } + + // Support: IE 11+, Edge 17 - 18+ + // IE/Edge sometimes throw a "Permission denied" error when strict-comparing + // two documents; shallow comparisons work. + // eslint-disable-next-line eqeqeq + if ( b == document || b.ownerDocument == preferredDoc && + contains( preferredDoc, b ) ) { + return 1; + } + + // Maintain original order + return sortInput ? + ( indexOf( sortInput, a ) - indexOf( sortInput, b ) ) : + 0; + } + + return compare & 4 ? -1 : 1; + } : + function( a, b ) { + + // Exit early if the nodes are identical + if ( a === b ) { + hasDuplicate = true; + return 0; + } + + var cur, + i = 0, + aup = a.parentNode, + bup = b.parentNode, + ap = [ a ], + bp = [ b ]; + + // Parentless nodes are either documents or disconnected + if ( !aup || !bup ) { + + // Support: IE 11+, Edge 17 - 18+ + // IE/Edge sometimes throw a "Permission denied" error when strict-comparing + // two documents; shallow comparisons work. + /* eslint-disable eqeqeq */ + return a == document ? -1 : + b == document ? 1 : + /* eslint-enable eqeqeq */ + aup ? -1 : + bup ? 1 : + sortInput ? + ( indexOf( sortInput, a ) - indexOf( sortInput, b ) ) : + 0; + + // If the nodes are siblings, we can do a quick check + } else if ( aup === bup ) { + return siblingCheck( a, b ); + } + + // Otherwise we need full lists of their ancestors for comparison + cur = a; + while ( ( cur = cur.parentNode ) ) { + ap.unshift( cur ); + } + cur = b; + while ( ( cur = cur.parentNode ) ) { + bp.unshift( cur ); + } + + // Walk down the tree looking for a discrepancy + while ( ap[ i ] === bp[ i ] ) { + i++; + } + + return i ? + + // Do a sibling check if the nodes have a common ancestor + siblingCheck( ap[ i ], bp[ i ] ) : + + // Otherwise nodes in our document sort first + // Support: IE 11+, Edge 17 - 18+ + // IE/Edge sometimes throw a "Permission denied" error when strict-comparing + // two documents; shallow comparisons work. + /* eslint-disable eqeqeq */ + ap[ i ] == preferredDoc ? -1 : + bp[ i ] == preferredDoc ? 1 : + /* eslint-enable eqeqeq */ + 0; + }; + + return document; +}; + +Sizzle.matches = function( expr, elements ) { + return Sizzle( expr, null, null, elements ); +}; + +Sizzle.matchesSelector = function( elem, expr ) { + setDocument( elem ); + + if ( support.matchesSelector && documentIsHTML && + !nonnativeSelectorCache[ expr + " " ] && + ( !rbuggyMatches || !rbuggyMatches.test( expr ) ) && + ( !rbuggyQSA || !rbuggyQSA.test( expr ) ) ) { + + try { + var ret = matches.call( elem, expr ); + + // IE 9's matchesSelector returns false on disconnected nodes + if ( ret || support.disconnectedMatch || + + // As well, disconnected nodes are said to be in a document + // fragment in IE 9 + elem.document && elem.document.nodeType !== 11 ) { + return ret; + } + } catch ( e ) { + nonnativeSelectorCache( expr, true ); + } + } + + return Sizzle( expr, document, null, [ elem ] ).length > 0; +}; + +Sizzle.contains = function( context, elem ) { + + // Set document vars if needed + // Support: IE 11+, Edge 17 - 18+ + // IE/Edge sometimes throw a "Permission denied" error when strict-comparing + // two documents; shallow comparisons work. + // eslint-disable-next-line eqeqeq + if ( ( context.ownerDocument || context ) != document ) { + setDocument( context ); + } + return contains( context, elem ); +}; + +Sizzle.attr = function( elem, name ) { + + // Set document vars if needed + // Support: IE 11+, Edge 17 - 18+ + // IE/Edge sometimes throw a "Permission denied" error when strict-comparing + // two documents; shallow comparisons work. + // eslint-disable-next-line eqeqeq + if ( ( elem.ownerDocument || elem ) != document ) { + setDocument( elem ); + } + + var fn = Expr.attrHandle[ name.toLowerCase() ], + + // Don't get fooled by Object.prototype properties (jQuery #13807) + val = fn && hasOwn.call( Expr.attrHandle, name.toLowerCase() ) ? + fn( elem, name, !documentIsHTML ) : + undefined; + + return val !== undefined ? + val : + support.attributes || !documentIsHTML ? + elem.getAttribute( name ) : + ( val = elem.getAttributeNode( name ) ) && val.specified ? + val.value : + null; +}; + +Sizzle.escape = function( sel ) { + return ( sel + "" ).replace( rcssescape, fcssescape ); +}; + +Sizzle.error = function( msg ) { + throw new Error( "Syntax error, unrecognized expression: " + msg ); +}; + +/** + * Document sorting and removing duplicates + * @param {ArrayLike} results + */ +Sizzle.uniqueSort = function( results ) { + var elem, + duplicates = [], + j = 0, + i = 0; + + // Unless we *know* we can detect duplicates, assume their presence + hasDuplicate = !support.detectDuplicates; + sortInput = !support.sortStable && results.slice( 0 ); + results.sort( sortOrder ); + + if ( hasDuplicate ) { + while ( ( elem = results[ i++ ] ) ) { + if ( elem === results[ i ] ) { + j = duplicates.push( i ); + } + } + while ( j-- ) { + results.splice( duplicates[ j ], 1 ); + } + } + + // Clear input after sorting to release objects + // See https://github.com/jquery/sizzle/pull/225 + sortInput = null; + + return results; +}; + +/** + * Utility function for retrieving the text value of an array of DOM nodes + * @param {Array|Element} elem + */ +getText = Sizzle.getText = function( elem ) { + var node, + ret = "", + i = 0, + nodeType = elem.nodeType; + + if ( !nodeType ) { + + // If no nodeType, this is expected to be an array + while ( ( node = elem[ i++ ] ) ) { + + // Do not traverse comment nodes + ret += getText( node ); + } + } else if ( nodeType === 1 || nodeType === 9 || nodeType === 11 ) { + + // Use textContent for elements + // innerText usage removed for consistency of new lines (jQuery #11153) + if ( typeof elem.textContent === "string" ) { + return elem.textContent; + } else { + + // Traverse its children + for ( elem = elem.firstChild; elem; elem = elem.nextSibling ) { + ret += getText( elem ); + } + } + } else if ( nodeType === 3 || nodeType === 4 ) { + return elem.nodeValue; + } + + // Do not include comment or processing instruction nodes + + return ret; +}; + +Expr = Sizzle.selectors = { + + // Can be adjusted by the user + cacheLength: 50, + + createPseudo: markFunction, + + match: matchExpr, + + attrHandle: {}, + + find: {}, + + relative: { + ">": { dir: "parentNode", first: true }, + " ": { dir: "parentNode" }, + "+": { dir: "previousSibling", first: true }, + "~": { dir: "previousSibling" } + }, + + preFilter: { + "ATTR": function( match ) { + match[ 1 ] = match[ 1 ].replace( runescape, funescape ); + + // Move the given value to match[3] whether quoted or unquoted + match[ 3 ] = ( match[ 3 ] || match[ 4 ] || + match[ 5 ] || "" ).replace( runescape, funescape ); + + if ( match[ 2 ] === "~=" ) { + match[ 3 ] = " " + match[ 3 ] + " "; + } + + return match.slice( 0, 4 ); + }, + + "CHILD": function( match ) { + + /* matches from matchExpr["CHILD"] + 1 type (only|nth|...) + 2 what (child|of-type) + 3 argument (even|odd|\d*|\d*n([+-]\d+)?|...) + 4 xn-component of xn+y argument ([+-]?\d*n|) + 5 sign of xn-component + 6 x of xn-component + 7 sign of y-component + 8 y of y-component + */ + match[ 1 ] = match[ 1 ].toLowerCase(); + + if ( match[ 1 ].slice( 0, 3 ) === "nth" ) { + + // nth-* requires argument + if ( !match[ 3 ] ) { + Sizzle.error( match[ 0 ] ); + } + + // numeric x and y parameters for Expr.filter.CHILD + // remember that false/true cast respectively to 0/1 + match[ 4 ] = +( match[ 4 ] ? + match[ 5 ] + ( match[ 6 ] || 1 ) : + 2 * ( match[ 3 ] === "even" || match[ 3 ] === "odd" ) ); + match[ 5 ] = +( ( match[ 7 ] + match[ 8 ] ) || match[ 3 ] === "odd" ); + + // other types prohibit arguments + } else if ( match[ 3 ] ) { + Sizzle.error( match[ 0 ] ); + } + + return match; + }, + + "PSEUDO": function( match ) { + var excess, + unquoted = !match[ 6 ] && match[ 2 ]; + + if ( matchExpr[ "CHILD" ].test( match[ 0 ] ) ) { + return null; + } + + // Accept quoted arguments as-is + if ( match[ 3 ] ) { + match[ 2 ] = match[ 4 ] || match[ 5 ] || ""; + + // Strip excess characters from unquoted arguments + } else if ( unquoted && rpseudo.test( unquoted ) && + + // Get excess from tokenize (recursively) + ( excess = tokenize( unquoted, true ) ) && + + // advance to the next closing parenthesis + ( excess = unquoted.indexOf( ")", unquoted.length - excess ) - unquoted.length ) ) { + + // excess is a negative index + match[ 0 ] = match[ 0 ].slice( 0, excess ); + match[ 2 ] = unquoted.slice( 0, excess ); + } + + // Return only captures needed by the pseudo filter method (type and argument) + return match.slice( 0, 3 ); + } + }, + + filter: { + + "TAG": function( nodeNameSelector ) { + var nodeName = nodeNameSelector.replace( runescape, funescape ).toLowerCase(); + return nodeNameSelector === "*" ? + function() { + return true; + } : + function( elem ) { + return elem.nodeName && elem.nodeName.toLowerCase() === nodeName; + }; + }, + + "CLASS": function( className ) { + var pattern = classCache[ className + " " ]; + + return pattern || + ( pattern = new RegExp( "(^|" + whitespace + + ")" + className + "(" + whitespace + "|$)" ) ) && classCache( + className, function( elem ) { + return pattern.test( + typeof elem.className === "string" && elem.className || + typeof elem.getAttribute !== "undefined" && + elem.getAttribute( "class" ) || + "" + ); + } ); + }, + + "ATTR": function( name, operator, check ) { + return function( elem ) { + var result = Sizzle.attr( elem, name ); + + if ( result == null ) { + return operator === "!="; + } + if ( !operator ) { + return true; + } + + result += ""; + + /* eslint-disable max-len */ + + return operator === "=" ? result === check : + operator === "!=" ? result !== check : + operator === "^=" ? check && result.indexOf( check ) === 0 : + operator === "*=" ? check && result.indexOf( check ) > -1 : + operator === "$=" ? check && result.slice( -check.length ) === check : + operator === "~=" ? ( " " + result.replace( rwhitespace, " " ) + " " ).indexOf( check ) > -1 : + operator === "|=" ? result === check || result.slice( 0, check.length + 1 ) === check + "-" : + false; + /* eslint-enable max-len */ + + }; + }, + + "CHILD": function( type, what, _argument, first, last ) { + var simple = type.slice( 0, 3 ) !== "nth", + forward = type.slice( -4 ) !== "last", + ofType = what === "of-type"; + + return first === 1 && last === 0 ? + + // Shortcut for :nth-*(n) + function( elem ) { + return !!elem.parentNode; + } : + + function( elem, _context, xml ) { + var cache, uniqueCache, outerCache, node, nodeIndex, start, + dir = simple !== forward ? "nextSibling" : "previousSibling", + parent = elem.parentNode, + name = ofType && elem.nodeName.toLowerCase(), + useCache = !xml && !ofType, + diff = false; + + if ( parent ) { + + // :(first|last|only)-(child|of-type) + if ( simple ) { + while ( dir ) { + node = elem; + while ( ( node = node[ dir ] ) ) { + if ( ofType ? + node.nodeName.toLowerCase() === name : + node.nodeType === 1 ) { + + return false; + } + } + + // Reverse direction for :only-* (if we haven't yet done so) + start = dir = type === "only" && !start && "nextSibling"; + } + return true; + } + + start = [ forward ? parent.firstChild : parent.lastChild ]; + + // non-xml :nth-child(...) stores cache data on `parent` + if ( forward && useCache ) { + + // Seek `elem` from a previously-cached index + + // ...in a gzip-friendly way + node = parent; + outerCache = node[ expando ] || ( node[ expando ] = {} ); + + // Support: IE <9 only + // Defend against cloned attroperties (jQuery gh-1709) + uniqueCache = outerCache[ node.uniqueID ] || + ( outerCache[ node.uniqueID ] = {} ); + + cache = uniqueCache[ type ] || []; + nodeIndex = cache[ 0 ] === dirruns && cache[ 1 ]; + diff = nodeIndex && cache[ 2 ]; + node = nodeIndex && parent.childNodes[ nodeIndex ]; + + while ( ( node = ++nodeIndex && node && node[ dir ] || + + // Fallback to seeking `elem` from the start + ( diff = nodeIndex = 0 ) || start.pop() ) ) { + + // When found, cache indexes on `parent` and break + if ( node.nodeType === 1 && ++diff && node === elem ) { + uniqueCache[ type ] = [ dirruns, nodeIndex, diff ]; + break; + } + } + + } else { + + // Use previously-cached element index if available + if ( useCache ) { + + // ...in a gzip-friendly way + node = elem; + outerCache = node[ expando ] || ( node[ expando ] = {} ); + + // Support: IE <9 only + // Defend against cloned attroperties (jQuery gh-1709) + uniqueCache = outerCache[ node.uniqueID ] || + ( outerCache[ node.uniqueID ] = {} ); + + cache = uniqueCache[ type ] || []; + nodeIndex = cache[ 0 ] === dirruns && cache[ 1 ]; + diff = nodeIndex; + } + + // xml :nth-child(...) + // or :nth-last-child(...) or :nth(-last)?-of-type(...) + if ( diff === false ) { + + // Use the same loop as above to seek `elem` from the start + while ( ( node = ++nodeIndex && node && node[ dir ] || + ( diff = nodeIndex = 0 ) || start.pop() ) ) { + + if ( ( ofType ? + node.nodeName.toLowerCase() === name : + node.nodeType === 1 ) && + ++diff ) { + + // Cache the index of each encountered element + if ( useCache ) { + outerCache = node[ expando ] || + ( node[ expando ] = {} ); + + // Support: IE <9 only + // Defend against cloned attroperties (jQuery gh-1709) + uniqueCache = outerCache[ node.uniqueID ] || + ( outerCache[ node.uniqueID ] = {} ); + + uniqueCache[ type ] = [ dirruns, diff ]; + } + + if ( node === elem ) { + break; + } + } + } + } + } + + // Incorporate the offset, then check against cycle size + diff -= last; + return diff === first || ( diff % first === 0 && diff / first >= 0 ); + } + }; + }, + + "PSEUDO": function( pseudo, argument ) { + + // pseudo-class names are case-insensitive + // http://www.w3.org/TR/selectors/#pseudo-classes + // Prioritize by case sensitivity in case custom pseudos are added with uppercase letters + // Remember that setFilters inherits from pseudos + var args, + fn = Expr.pseudos[ pseudo ] || Expr.setFilters[ pseudo.toLowerCase() ] || + Sizzle.error( "unsupported pseudo: " + pseudo ); + + // The user may use createPseudo to indicate that + // arguments are needed to create the filter function + // just as Sizzle does + if ( fn[ expando ] ) { + return fn( argument ); + } + + // But maintain support for old signatures + if ( fn.length > 1 ) { + args = [ pseudo, pseudo, "", argument ]; + return Expr.setFilters.hasOwnProperty( pseudo.toLowerCase() ) ? + markFunction( function( seed, matches ) { + var idx, + matched = fn( seed, argument ), + i = matched.length; + while ( i-- ) { + idx = indexOf( seed, matched[ i ] ); + seed[ idx ] = !( matches[ idx ] = matched[ i ] ); + } + } ) : + function( elem ) { + return fn( elem, 0, args ); + }; + } + + return fn; + } + }, + + pseudos: { + + // Potentially complex pseudos + "not": markFunction( function( selector ) { + + // Trim the selector passed to compile + // to avoid treating leading and trailing + // spaces as combinators + var input = [], + results = [], + matcher = compile( selector.replace( rtrim, "$1" ) ); + + return matcher[ expando ] ? + markFunction( function( seed, matches, _context, xml ) { + var elem, + unmatched = matcher( seed, null, xml, [] ), + i = seed.length; + + // Match elements unmatched by `matcher` + while ( i-- ) { + if ( ( elem = unmatched[ i ] ) ) { + seed[ i ] = !( matches[ i ] = elem ); + } + } + } ) : + function( elem, _context, xml ) { + input[ 0 ] = elem; + matcher( input, null, xml, results ); + + // Don't keep the element (issue #299) + input[ 0 ] = null; + return !results.pop(); + }; + } ), + + "has": markFunction( function( selector ) { + return function( elem ) { + return Sizzle( selector, elem ).length > 0; + }; + } ), + + "contains": markFunction( function( text ) { + text = text.replace( runescape, funescape ); + return function( elem ) { + return ( elem.textContent || getText( elem ) ).indexOf( text ) > -1; + }; + } ), + + // "Whether an element is represented by a :lang() selector + // is based solely on the element's language value + // being equal to the identifier C, + // or beginning with the identifier C immediately followed by "-". + // The matching of C against the element's language value is performed case-insensitively. + // The identifier C does not have to be a valid language name." + // http://www.w3.org/TR/selectors/#lang-pseudo + "lang": markFunction( function( lang ) { + + // lang value must be a valid identifier + if ( !ridentifier.test( lang || "" ) ) { + Sizzle.error( "unsupported lang: " + lang ); + } + lang = lang.replace( runescape, funescape ).toLowerCase(); + return function( elem ) { + var elemLang; + do { + if ( ( elemLang = documentIsHTML ? + elem.lang : + elem.getAttribute( "xml:lang" ) || elem.getAttribute( "lang" ) ) ) { + + elemLang = elemLang.toLowerCase(); + return elemLang === lang || elemLang.indexOf( lang + "-" ) === 0; + } + } while ( ( elem = elem.parentNode ) && elem.nodeType === 1 ); + return false; + }; + } ), + + // Miscellaneous + "target": function( elem ) { + var hash = window.location && window.location.hash; + return hash && hash.slice( 1 ) === elem.id; + }, + + "root": function( elem ) { + return elem === docElem; + }, + + "focus": function( elem ) { + return elem === document.activeElement && + ( !document.hasFocus || document.hasFocus() ) && + !!( elem.type || elem.href || ~elem.tabIndex ); + }, + + // Boolean properties + "enabled": createDisabledPseudo( false ), + "disabled": createDisabledPseudo( true ), + + "checked": function( elem ) { + + // In CSS3, :checked should return both checked and selected elements + // http://www.w3.org/TR/2011/REC-css3-selectors-20110929/#checked + var nodeName = elem.nodeName.toLowerCase(); + return ( nodeName === "input" && !!elem.checked ) || + ( nodeName === "option" && !!elem.selected ); + }, + + "selected": function( elem ) { + + // Accessing this property makes selected-by-default + // options in Safari work properly + if ( elem.parentNode ) { + // eslint-disable-next-line no-unused-expressions + elem.parentNode.selectedIndex; + } + + return elem.selected === true; + }, + + // Contents + "empty": function( elem ) { + + // http://www.w3.org/TR/selectors/#empty-pseudo + // :empty is negated by element (1) or content nodes (text: 3; cdata: 4; entity ref: 5), + // but not by others (comment: 8; processing instruction: 7; etc.) + // nodeType < 6 works because attributes (2) do not appear as children + for ( elem = elem.firstChild; elem; elem = elem.nextSibling ) { + if ( elem.nodeType < 6 ) { + return false; + } + } + return true; + }, + + "parent": function( elem ) { + return !Expr.pseudos[ "empty" ]( elem ); + }, + + // Element/input types + "header": function( elem ) { + return rheader.test( elem.nodeName ); + }, + + "input": function( elem ) { + return rinputs.test( elem.nodeName ); + }, + + "button": function( elem ) { + var name = elem.nodeName.toLowerCase(); + return name === "input" && elem.type === "button" || name === "button"; + }, + + "text": function( elem ) { + var attr; + return elem.nodeName.toLowerCase() === "input" && + elem.type === "text" && + + // Support: IE<8 + // New HTML5 attribute values (e.g., "search") appear with elem.type === "text" + ( ( attr = elem.getAttribute( "type" ) ) == null || + attr.toLowerCase() === "text" ); + }, + + // Position-in-collection + "first": createPositionalPseudo( function() { + return [ 0 ]; + } ), + + "last": createPositionalPseudo( function( _matchIndexes, length ) { + return [ length - 1 ]; + } ), + + "eq": createPositionalPseudo( function( _matchIndexes, length, argument ) { + return [ argument < 0 ? argument + length : argument ]; + } ), + + "even": createPositionalPseudo( function( matchIndexes, length ) { + var i = 0; + for ( ; i < length; i += 2 ) { + matchIndexes.push( i ); + } + return matchIndexes; + } ), + + "odd": createPositionalPseudo( function( matchIndexes, length ) { + var i = 1; + for ( ; i < length; i += 2 ) { + matchIndexes.push( i ); + } + return matchIndexes; + } ), + + "lt": createPositionalPseudo( function( matchIndexes, length, argument ) { + var i = argument < 0 ? + argument + length : + argument > length ? + length : + argument; + for ( ; --i >= 0; ) { + matchIndexes.push( i ); + } + return matchIndexes; + } ), + + "gt": createPositionalPseudo( function( matchIndexes, length, argument ) { + var i = argument < 0 ? argument + length : argument; + for ( ; ++i < length; ) { + matchIndexes.push( i ); + } + return matchIndexes; + } ) + } +}; + +Expr.pseudos[ "nth" ] = Expr.pseudos[ "eq" ]; + +// Add button/input type pseudos +for ( i in { radio: true, checkbox: true, file: true, password: true, image: true } ) { + Expr.pseudos[ i ] = createInputPseudo( i ); +} +for ( i in { submit: true, reset: true } ) { + Expr.pseudos[ i ] = createButtonPseudo( i ); +} + +// Easy API for creating new setFilters +function setFilters() {} +setFilters.prototype = Expr.filters = Expr.pseudos; +Expr.setFilters = new setFilters(); + +tokenize = Sizzle.tokenize = function( selector, parseOnly ) { + var matched, match, tokens, type, + soFar, groups, preFilters, + cached = tokenCache[ selector + " " ]; + + if ( cached ) { + return parseOnly ? 0 : cached.slice( 0 ); + } + + soFar = selector; + groups = []; + preFilters = Expr.preFilter; + + while ( soFar ) { + + // Comma and first run + if ( !matched || ( match = rcomma.exec( soFar ) ) ) { + if ( match ) { + + // Don't consume trailing commas as valid + soFar = soFar.slice( match[ 0 ].length ) || soFar; + } + groups.push( ( tokens = [] ) ); + } + + matched = false; + + // Combinators + if ( ( match = rcombinators.exec( soFar ) ) ) { + matched = match.shift(); + tokens.push( { + value: matched, + + // Cast descendant combinators to space + type: match[ 0 ].replace( rtrim, " " ) + } ); + soFar = soFar.slice( matched.length ); + } + + // Filters + for ( type in Expr.filter ) { + if ( ( match = matchExpr[ type ].exec( soFar ) ) && ( !preFilters[ type ] || + ( match = preFilters[ type ]( match ) ) ) ) { + matched = match.shift(); + tokens.push( { + value: matched, + type: type, + matches: match + } ); + soFar = soFar.slice( matched.length ); + } + } + + if ( !matched ) { + break; + } + } + + // Return the length of the invalid excess + // if we're just parsing + // Otherwise, throw an error or return tokens + return parseOnly ? + soFar.length : + soFar ? + Sizzle.error( selector ) : + + // Cache the tokens + tokenCache( selector, groups ).slice( 0 ); +}; + +function toSelector( tokens ) { + var i = 0, + len = tokens.length, + selector = ""; + for ( ; i < len; i++ ) { + selector += tokens[ i ].value; + } + return selector; +} + +function addCombinator( matcher, combinator, base ) { + var dir = combinator.dir, + skip = combinator.next, + key = skip || dir, + checkNonElements = base && key === "parentNode", + doneName = done++; + + return combinator.first ? + + // Check against closest ancestor/preceding element + function( elem, context, xml ) { + while ( ( elem = elem[ dir ] ) ) { + if ( elem.nodeType === 1 || checkNonElements ) { + return matcher( elem, context, xml ); + } + } + return false; + } : + + // Check against all ancestor/preceding elements + function( elem, context, xml ) { + var oldCache, uniqueCache, outerCache, + newCache = [ dirruns, doneName ]; + + // We can't set arbitrary data on XML nodes, so they don't benefit from combinator caching + if ( xml ) { + while ( ( elem = elem[ dir ] ) ) { + if ( elem.nodeType === 1 || checkNonElements ) { + if ( matcher( elem, context, xml ) ) { + return true; + } + } + } + } else { + while ( ( elem = elem[ dir ] ) ) { + if ( elem.nodeType === 1 || checkNonElements ) { + outerCache = elem[ expando ] || ( elem[ expando ] = {} ); + + // Support: IE <9 only + // Defend against cloned attroperties (jQuery gh-1709) + uniqueCache = outerCache[ elem.uniqueID ] || + ( outerCache[ elem.uniqueID ] = {} ); + + if ( skip && skip === elem.nodeName.toLowerCase() ) { + elem = elem[ dir ] || elem; + } else if ( ( oldCache = uniqueCache[ key ] ) && + oldCache[ 0 ] === dirruns && oldCache[ 1 ] === doneName ) { + + // Assign to newCache so results back-propagate to previous elements + return ( newCache[ 2 ] = oldCache[ 2 ] ); + } else { + + // Reuse newcache so results back-propagate to previous elements + uniqueCache[ key ] = newCache; + + // A match means we're done; a fail means we have to keep checking + if ( ( newCache[ 2 ] = matcher( elem, context, xml ) ) ) { + return true; + } + } + } + } + } + return false; + }; +} + +function elementMatcher( matchers ) { + return matchers.length > 1 ? + function( elem, context, xml ) { + var i = matchers.length; + while ( i-- ) { + if ( !matchers[ i ]( elem, context, xml ) ) { + return false; + } + } + return true; + } : + matchers[ 0 ]; +} + +function multipleContexts( selector, contexts, results ) { + var i = 0, + len = contexts.length; + for ( ; i < len; i++ ) { + Sizzle( selector, contexts[ i ], results ); + } + return results; +} + +function condense( unmatched, map, filter, context, xml ) { + var elem, + newUnmatched = [], + i = 0, + len = unmatched.length, + mapped = map != null; + + for ( ; i < len; i++ ) { + if ( ( elem = unmatched[ i ] ) ) { + if ( !filter || filter( elem, context, xml ) ) { + newUnmatched.push( elem ); + if ( mapped ) { + map.push( i ); + } + } + } + } + + return newUnmatched; +} + +function setMatcher( preFilter, selector, matcher, postFilter, postFinder, postSelector ) { + if ( postFilter && !postFilter[ expando ] ) { + postFilter = setMatcher( postFilter ); + } + if ( postFinder && !postFinder[ expando ] ) { + postFinder = setMatcher( postFinder, postSelector ); + } + return markFunction( function( seed, results, context, xml ) { + var temp, i, elem, + preMap = [], + postMap = [], + preexisting = results.length, + + // Get initial elements from seed or context + elems = seed || multipleContexts( + selector || "*", + context.nodeType ? [ context ] : context, + [] + ), + + // Prefilter to get matcher input, preserving a map for seed-results synchronization + matcherIn = preFilter && ( seed || !selector ) ? + condense( elems, preMap, preFilter, context, xml ) : + elems, + + matcherOut = matcher ? + + // If we have a postFinder, or filtered seed, or non-seed postFilter or preexisting results, + postFinder || ( seed ? preFilter : preexisting || postFilter ) ? + + // ...intermediate processing is necessary + [] : + + // ...otherwise use results directly + results : + matcherIn; + + // Find primary matches + if ( matcher ) { + matcher( matcherIn, matcherOut, context, xml ); + } + + // Apply postFilter + if ( postFilter ) { + temp = condense( matcherOut, postMap ); + postFilter( temp, [], context, xml ); + + // Un-match failing elements by moving them back to matcherIn + i = temp.length; + while ( i-- ) { + if ( ( elem = temp[ i ] ) ) { + matcherOut[ postMap[ i ] ] = !( matcherIn[ postMap[ i ] ] = elem ); + } + } + } + + if ( seed ) { + if ( postFinder || preFilter ) { + if ( postFinder ) { + + // Get the final matcherOut by condensing this intermediate into postFinder contexts + temp = []; + i = matcherOut.length; + while ( i-- ) { + if ( ( elem = matcherOut[ i ] ) ) { + + // Restore matcherIn since elem is not yet a final match + temp.push( ( matcherIn[ i ] = elem ) ); + } + } + postFinder( null, ( matcherOut = [] ), temp, xml ); + } + + // Move matched elements from seed to results to keep them synchronized + i = matcherOut.length; + while ( i-- ) { + if ( ( elem = matcherOut[ i ] ) && + ( temp = postFinder ? indexOf( seed, elem ) : preMap[ i ] ) > -1 ) { + + seed[ temp ] = !( results[ temp ] = elem ); + } + } + } + + // Add elements to results, through postFinder if defined + } else { + matcherOut = condense( + matcherOut === results ? + matcherOut.splice( preexisting, matcherOut.length ) : + matcherOut + ); + if ( postFinder ) { + postFinder( null, results, matcherOut, xml ); + } else { + push.apply( results, matcherOut ); + } + } + } ); +} + +function matcherFromTokens( tokens ) { + var checkContext, matcher, j, + len = tokens.length, + leadingRelative = Expr.relative[ tokens[ 0 ].type ], + implicitRelative = leadingRelative || Expr.relative[ " " ], + i = leadingRelative ? 1 : 0, + + // The foundational matcher ensures that elements are reachable from top-level context(s) + matchContext = addCombinator( function( elem ) { + return elem === checkContext; + }, implicitRelative, true ), + matchAnyContext = addCombinator( function( elem ) { + return indexOf( checkContext, elem ) > -1; + }, implicitRelative, true ), + matchers = [ function( elem, context, xml ) { + var ret = ( !leadingRelative && ( xml || context !== outermostContext ) ) || ( + ( checkContext = context ).nodeType ? + matchContext( elem, context, xml ) : + matchAnyContext( elem, context, xml ) ); + + // Avoid hanging onto element (issue #299) + checkContext = null; + return ret; + } ]; + + for ( ; i < len; i++ ) { + if ( ( matcher = Expr.relative[ tokens[ i ].type ] ) ) { + matchers = [ addCombinator( elementMatcher( matchers ), matcher ) ]; + } else { + matcher = Expr.filter[ tokens[ i ].type ].apply( null, tokens[ i ].matches ); + + // Return special upon seeing a positional matcher + if ( matcher[ expando ] ) { + + // Find the next relative operator (if any) for proper handling + j = ++i; + for ( ; j < len; j++ ) { + if ( Expr.relative[ tokens[ j ].type ] ) { + break; + } + } + return setMatcher( + i > 1 && elementMatcher( matchers ), + i > 1 && toSelector( + + // If the preceding token was a descendant combinator, insert an implicit any-element `*` + tokens + .slice( 0, i - 1 ) + .concat( { value: tokens[ i - 2 ].type === " " ? "*" : "" } ) + ).replace( rtrim, "$1" ), + matcher, + i < j && matcherFromTokens( tokens.slice( i, j ) ), + j < len && matcherFromTokens( ( tokens = tokens.slice( j ) ) ), + j < len && toSelector( tokens ) + ); + } + matchers.push( matcher ); + } + } + + return elementMatcher( matchers ); +} + +function matcherFromGroupMatchers( elementMatchers, setMatchers ) { + var bySet = setMatchers.length > 0, + byElement = elementMatchers.length > 0, + superMatcher = function( seed, context, xml, results, outermost ) { + var elem, j, matcher, + matchedCount = 0, + i = "0", + unmatched = seed && [], + setMatched = [], + contextBackup = outermostContext, + + // We must always have either seed elements or outermost context + elems = seed || byElement && Expr.find[ "TAG" ]( "*", outermost ), + + // Use integer dirruns iff this is the outermost matcher + dirrunsUnique = ( dirruns += contextBackup == null ? 1 : Math.random() || 0.1 ), + len = elems.length; + + if ( outermost ) { + + // Support: IE 11+, Edge 17 - 18+ + // IE/Edge sometimes throw a "Permission denied" error when strict-comparing + // two documents; shallow comparisons work. + // eslint-disable-next-line eqeqeq + outermostContext = context == document || context || outermost; + } + + // Add elements passing elementMatchers directly to results + // Support: IE<9, Safari + // Tolerate NodeList properties (IE: "length"; Safari: ) matching elements by id + for ( ; i !== len && ( elem = elems[ i ] ) != null; i++ ) { + if ( byElement && elem ) { + j = 0; + + // Support: IE 11+, Edge 17 - 18+ + // IE/Edge sometimes throw a "Permission denied" error when strict-comparing + // two documents; shallow comparisons work. + // eslint-disable-next-line eqeqeq + if ( !context && elem.ownerDocument != document ) { + setDocument( elem ); + xml = !documentIsHTML; + } + while ( ( matcher = elementMatchers[ j++ ] ) ) { + if ( matcher( elem, context || document, xml ) ) { + results.push( elem ); + break; + } + } + if ( outermost ) { + dirruns = dirrunsUnique; + } + } + + // Track unmatched elements for set filters + if ( bySet ) { + + // They will have gone through all possible matchers + if ( ( elem = !matcher && elem ) ) { + matchedCount--; + } + + // Lengthen the array for every element, matched or not + if ( seed ) { + unmatched.push( elem ); + } + } + } + + // `i` is now the count of elements visited above, and adding it to `matchedCount` + // makes the latter nonnegative. + matchedCount += i; + + // Apply set filters to unmatched elements + // NOTE: This can be skipped if there are no unmatched elements (i.e., `matchedCount` + // equals `i`), unless we didn't visit _any_ elements in the above loop because we have + // no element matchers and no seed. + // Incrementing an initially-string "0" `i` allows `i` to remain a string only in that + // case, which will result in a "00" `matchedCount` that differs from `i` but is also + // numerically zero. + if ( bySet && i !== matchedCount ) { + j = 0; + while ( ( matcher = setMatchers[ j++ ] ) ) { + matcher( unmatched, setMatched, context, xml ); + } + + if ( seed ) { + + // Reintegrate element matches to eliminate the need for sorting + if ( matchedCount > 0 ) { + while ( i-- ) { + if ( !( unmatched[ i ] || setMatched[ i ] ) ) { + setMatched[ i ] = pop.call( results ); + } + } + } + + // Discard index placeholder values to get only actual matches + setMatched = condense( setMatched ); + } + + // Add matches to results + push.apply( results, setMatched ); + + // Seedless set matches succeeding multiple successful matchers stipulate sorting + if ( outermost && !seed && setMatched.length > 0 && + ( matchedCount + setMatchers.length ) > 1 ) { + + Sizzle.uniqueSort( results ); + } + } + + // Override manipulation of globals by nested matchers + if ( outermost ) { + dirruns = dirrunsUnique; + outermostContext = contextBackup; + } + + return unmatched; + }; + + return bySet ? + markFunction( superMatcher ) : + superMatcher; +} + +compile = Sizzle.compile = function( selector, match /* Internal Use Only */ ) { + var i, + setMatchers = [], + elementMatchers = [], + cached = compilerCache[ selector + " " ]; + + if ( !cached ) { + + // Generate a function of recursive functions that can be used to check each element + if ( !match ) { + match = tokenize( selector ); + } + i = match.length; + while ( i-- ) { + cached = matcherFromTokens( match[ i ] ); + if ( cached[ expando ] ) { + setMatchers.push( cached ); + } else { + elementMatchers.push( cached ); + } + } + + // Cache the compiled function + cached = compilerCache( + selector, + matcherFromGroupMatchers( elementMatchers, setMatchers ) + ); + + // Save selector and tokenization + cached.selector = selector; + } + return cached; +}; + +/** + * A low-level selection function that works with Sizzle's compiled + * selector functions + * @param {String|Function} selector A selector or a pre-compiled + * selector function built with Sizzle.compile + * @param {Element} context + * @param {Array} [results] + * @param {Array} [seed] A set of elements to match against + */ +select = Sizzle.select = function( selector, context, results, seed ) { + var i, tokens, token, type, find, + compiled = typeof selector === "function" && selector, + match = !seed && tokenize( ( selector = compiled.selector || selector ) ); + + results = results || []; + + // Try to minimize operations if there is only one selector in the list and no seed + // (the latter of which guarantees us context) + if ( match.length === 1 ) { + + // Reduce context if the leading compound selector is an ID + tokens = match[ 0 ] = match[ 0 ].slice( 0 ); + if ( tokens.length > 2 && ( token = tokens[ 0 ] ).type === "ID" && + context.nodeType === 9 && documentIsHTML && Expr.relative[ tokens[ 1 ].type ] ) { + + context = ( Expr.find[ "ID" ]( token.matches[ 0 ] + .replace( runescape, funescape ), context ) || [] )[ 0 ]; + if ( !context ) { + return results; + + // Precompiled matchers will still verify ancestry, so step up a level + } else if ( compiled ) { + context = context.parentNode; + } + + selector = selector.slice( tokens.shift().value.length ); + } + + // Fetch a seed set for right-to-left matching + i = matchExpr[ "needsContext" ].test( selector ) ? 0 : tokens.length; + while ( i-- ) { + token = tokens[ i ]; + + // Abort if we hit a combinator + if ( Expr.relative[ ( type = token.type ) ] ) { + break; + } + if ( ( find = Expr.find[ type ] ) ) { + + // Search, expanding context for leading sibling combinators + if ( ( seed = find( + token.matches[ 0 ].replace( runescape, funescape ), + rsibling.test( tokens[ 0 ].type ) && testContext( context.parentNode ) || + context + ) ) ) { + + // If seed is empty or no tokens remain, we can return early + tokens.splice( i, 1 ); + selector = seed.length && toSelector( tokens ); + if ( !selector ) { + push.apply( results, seed ); + return results; + } + + break; + } + } + } + } + + // Compile and execute a filtering function if one is not provided + // Provide `match` to avoid retokenization if we modified the selector above + ( compiled || compile( selector, match ) )( + seed, + context, + !documentIsHTML, + results, + !context || rsibling.test( selector ) && testContext( context.parentNode ) || context + ); + return results; +}; + +// One-time assignments + +// Sort stability +support.sortStable = expando.split( "" ).sort( sortOrder ).join( "" ) === expando; + +// Support: Chrome 14-35+ +// Always assume duplicates if they aren't passed to the comparison function +support.detectDuplicates = !!hasDuplicate; + +// Initialize against the default document +setDocument(); + +// Support: Webkit<537.32 - Safari 6.0.3/Chrome 25 (fixed in Chrome 27) +// Detached nodes confoundingly follow *each other* +support.sortDetached = assert( function( el ) { + + // Should return 1, but returns 4 (following) + return el.compareDocumentPosition( document.createElement( "fieldset" ) ) & 1; +} ); + +// Support: IE<8 +// Prevent attribute/property "interpolation" +// https://msdn.microsoft.com/en-us/library/ms536429%28VS.85%29.aspx +if ( !assert( function( el ) { + el.innerHTML = ""; + return el.firstChild.getAttribute( "href" ) === "#"; +} ) ) { + addHandle( "type|href|height|width", function( elem, name, isXML ) { + if ( !isXML ) { + return elem.getAttribute( name, name.toLowerCase() === "type" ? 1 : 2 ); + } + } ); +} + +// Support: IE<9 +// Use defaultValue in place of getAttribute("value") +if ( !support.attributes || !assert( function( el ) { + el.innerHTML = ""; + el.firstChild.setAttribute( "value", "" ); + return el.firstChild.getAttribute( "value" ) === ""; +} ) ) { + addHandle( "value", function( elem, _name, isXML ) { + if ( !isXML && elem.nodeName.toLowerCase() === "input" ) { + return elem.defaultValue; + } + } ); +} + +// Support: IE<9 +// Use getAttributeNode to fetch booleans when getAttribute lies +if ( !assert( function( el ) { + return el.getAttribute( "disabled" ) == null; +} ) ) { + addHandle( booleans, function( elem, name, isXML ) { + var val; + if ( !isXML ) { + return elem[ name ] === true ? name.toLowerCase() : + ( val = elem.getAttributeNode( name ) ) && val.specified ? + val.value : + null; + } + } ); +} + +return Sizzle; + +} )( window ); + + + +jQuery.find = Sizzle; +jQuery.expr = Sizzle.selectors; + +// Deprecated +jQuery.expr[ ":" ] = jQuery.expr.pseudos; +jQuery.uniqueSort = jQuery.unique = Sizzle.uniqueSort; +jQuery.text = Sizzle.getText; +jQuery.isXMLDoc = Sizzle.isXML; +jQuery.contains = Sizzle.contains; +jQuery.escapeSelector = Sizzle.escape; + + + + +var dir = function( elem, dir, until ) { + var matched = [], + truncate = until !== undefined; + + while ( ( elem = elem[ dir ] ) && elem.nodeType !== 9 ) { + if ( elem.nodeType === 1 ) { + if ( truncate && jQuery( elem ).is( until ) ) { + break; + } + matched.push( elem ); + } + } + return matched; +}; + + +var siblings = function( n, elem ) { + var matched = []; + + for ( ; n; n = n.nextSibling ) { + if ( n.nodeType === 1 && n !== elem ) { + matched.push( n ); + } + } + + return matched; +}; + + +var rneedsContext = jQuery.expr.match.needsContext; + + + +function nodeName( elem, name ) { + + return elem.nodeName && elem.nodeName.toLowerCase() === name.toLowerCase(); + +} +var rsingleTag = ( /^<([a-z][^\/\0>:\x20\t\r\n\f]*)[\x20\t\r\n\f]*\/?>(?:<\/\1>|)$/i ); + + + +// Implement the identical functionality for filter and not +function winnow( elements, qualifier, not ) { + if ( isFunction( qualifier ) ) { + return jQuery.grep( elements, function( elem, i ) { + return !!qualifier.call( elem, i, elem ) !== not; + } ); + } + + // Single element + if ( qualifier.nodeType ) { + return jQuery.grep( elements, function( elem ) { + return ( elem === qualifier ) !== not; + } ); + } + + // Arraylike of elements (jQuery, arguments, Array) + if ( typeof qualifier !== "string" ) { + return jQuery.grep( elements, function( elem ) { + return ( indexOf.call( qualifier, elem ) > -1 ) !== not; + } ); + } + + // Filtered directly for both simple and complex selectors + return jQuery.filter( qualifier, elements, not ); +} + +jQuery.filter = function( expr, elems, not ) { + var elem = elems[ 0 ]; + + if ( not ) { + expr = ":not(" + expr + ")"; + } + + if ( elems.length === 1 && elem.nodeType === 1 ) { + return jQuery.find.matchesSelector( elem, expr ) ? [ elem ] : []; + } + + return jQuery.find.matches( expr, jQuery.grep( elems, function( elem ) { + return elem.nodeType === 1; + } ) ); +}; + +jQuery.fn.extend( { + find: function( selector ) { + var i, ret, + len = this.length, + self = this; + + if ( typeof selector !== "string" ) { + return this.pushStack( jQuery( selector ).filter( function() { + for ( i = 0; i < len; i++ ) { + if ( jQuery.contains( self[ i ], this ) ) { + return true; + } + } + } ) ); + } + + ret = this.pushStack( [] ); + + for ( i = 0; i < len; i++ ) { + jQuery.find( selector, self[ i ], ret ); + } + + return len > 1 ? jQuery.uniqueSort( ret ) : ret; + }, + filter: function( selector ) { + return this.pushStack( winnow( this, selector || [], false ) ); + }, + not: function( selector ) { + return this.pushStack( winnow( this, selector || [], true ) ); + }, + is: function( selector ) { + return !!winnow( + this, + + // If this is a positional/relative selector, check membership in the returned set + // so $("p:first").is("p:last") won't return true for a doc with two "p". + typeof selector === "string" && rneedsContext.test( selector ) ? + jQuery( selector ) : + selector || [], + false + ).length; + } +} ); + + +// Initialize a jQuery object + + +// A central reference to the root jQuery(document) +var rootjQuery, + + // A simple way to check for HTML strings + // Prioritize #id over to avoid XSS via location.hash (#9521) + // Strict HTML recognition (#11290: must start with <) + // Shortcut simple #id case for speed + rquickExpr = /^(?:\s*(<[\w\W]+>)[^>]*|#([\w-]+))$/, + + init = jQuery.fn.init = function( selector, context, root ) { + var match, elem; + + // HANDLE: $(""), $(null), $(undefined), $(false) + if ( !selector ) { + return this; + } + + // Method init() accepts an alternate rootjQuery + // so migrate can support jQuery.sub (gh-2101) + root = root || rootjQuery; + + // Handle HTML strings + if ( typeof selector === "string" ) { + if ( selector[ 0 ] === "<" && + selector[ selector.length - 1 ] === ">" && + selector.length >= 3 ) { + + // Assume that strings that start and end with <> are HTML and skip the regex check + match = [ null, selector, null ]; + + } else { + match = rquickExpr.exec( selector ); + } + + // Match html or make sure no context is specified for #id + if ( match && ( match[ 1 ] || !context ) ) { + + // HANDLE: $(html) -> $(array) + if ( match[ 1 ] ) { + context = context instanceof jQuery ? context[ 0 ] : context; + + // Option to run scripts is true for back-compat + // Intentionally let the error be thrown if parseHTML is not present + jQuery.merge( this, jQuery.parseHTML( + match[ 1 ], + context && context.nodeType ? context.ownerDocument || context : document, + true + ) ); + + // HANDLE: $(html, props) + if ( rsingleTag.test( match[ 1 ] ) && jQuery.isPlainObject( context ) ) { + for ( match in context ) { + + // Properties of context are called as methods if possible + if ( isFunction( this[ match ] ) ) { + this[ match ]( context[ match ] ); + + // ...and otherwise set as attributes + } else { + this.attr( match, context[ match ] ); + } + } + } + + return this; + + // HANDLE: $(#id) + } else { + elem = document.getElementById( match[ 2 ] ); + + if ( elem ) { + + // Inject the element directly into the jQuery object + this[ 0 ] = elem; + this.length = 1; + } + return this; + } + + // HANDLE: $(expr, $(...)) + } else if ( !context || context.jquery ) { + return ( context || root ).find( selector ); + + // HANDLE: $(expr, context) + // (which is just equivalent to: $(context).find(expr) + } else { + return this.constructor( context ).find( selector ); + } + + // HANDLE: $(DOMElement) + } else if ( selector.nodeType ) { + this[ 0 ] = selector; + this.length = 1; + return this; + + // HANDLE: $(function) + // Shortcut for document ready + } else if ( isFunction( selector ) ) { + return root.ready !== undefined ? + root.ready( selector ) : + + // Execute immediately if ready is not present + selector( jQuery ); + } + + return jQuery.makeArray( selector, this ); + }; + +// Give the init function the jQuery prototype for later instantiation +init.prototype = jQuery.fn; + +// Initialize central reference +rootjQuery = jQuery( document ); + + +var rparentsprev = /^(?:parents|prev(?:Until|All))/, + + // Methods guaranteed to produce a unique set when starting from a unique set + guaranteedUnique = { + children: true, + contents: true, + next: true, + prev: true + }; + +jQuery.fn.extend( { + has: function( target ) { + var targets = jQuery( target, this ), + l = targets.length; + + return this.filter( function() { + var i = 0; + for ( ; i < l; i++ ) { + if ( jQuery.contains( this, targets[ i ] ) ) { + return true; + } + } + } ); + }, + + closest: function( selectors, context ) { + var cur, + i = 0, + l = this.length, + matched = [], + targets = typeof selectors !== "string" && jQuery( selectors ); + + // Positional selectors never match, since there's no _selection_ context + if ( !rneedsContext.test( selectors ) ) { + for ( ; i < l; i++ ) { + for ( cur = this[ i ]; cur && cur !== context; cur = cur.parentNode ) { + + // Always skip document fragments + if ( cur.nodeType < 11 && ( targets ? + targets.index( cur ) > -1 : + + // Don't pass non-elements to Sizzle + cur.nodeType === 1 && + jQuery.find.matchesSelector( cur, selectors ) ) ) { + + matched.push( cur ); + break; + } + } + } + } + + return this.pushStack( matched.length > 1 ? jQuery.uniqueSort( matched ) : matched ); + }, + + // Determine the position of an element within the set + index: function( elem ) { + + // No argument, return index in parent + if ( !elem ) { + return ( this[ 0 ] && this[ 0 ].parentNode ) ? this.first().prevAll().length : -1; + } + + // Index in selector + if ( typeof elem === "string" ) { + return indexOf.call( jQuery( elem ), this[ 0 ] ); + } + + // Locate the position of the desired element + return indexOf.call( this, + + // If it receives a jQuery object, the first element is used + elem.jquery ? elem[ 0 ] : elem + ); + }, + + add: function( selector, context ) { + return this.pushStack( + jQuery.uniqueSort( + jQuery.merge( this.get(), jQuery( selector, context ) ) + ) + ); + }, + + addBack: function( selector ) { + return this.add( selector == null ? + this.prevObject : this.prevObject.filter( selector ) + ); + } +} ); + +function sibling( cur, dir ) { + while ( ( cur = cur[ dir ] ) && cur.nodeType !== 1 ) {} + return cur; +} + +jQuery.each( { + parent: function( elem ) { + var parent = elem.parentNode; + return parent && parent.nodeType !== 11 ? parent : null; + }, + parents: function( elem ) { + return dir( elem, "parentNode" ); + }, + parentsUntil: function( elem, _i, until ) { + return dir( elem, "parentNode", until ); + }, + next: function( elem ) { + return sibling( elem, "nextSibling" ); + }, + prev: function( elem ) { + return sibling( elem, "previousSibling" ); + }, + nextAll: function( elem ) { + return dir( elem, "nextSibling" ); + }, + prevAll: function( elem ) { + return dir( elem, "previousSibling" ); + }, + nextUntil: function( elem, _i, until ) { + return dir( elem, "nextSibling", until ); + }, + prevUntil: function( elem, _i, until ) { + return dir( elem, "previousSibling", until ); + }, + siblings: function( elem ) { + return siblings( ( elem.parentNode || {} ).firstChild, elem ); + }, + children: function( elem ) { + return siblings( elem.firstChild ); + }, + contents: function( elem ) { + if ( elem.contentDocument != null && + + // Support: IE 11+ + // elements with no `data` attribute has an object + // `contentDocument` with a `null` prototype. + getProto( elem.contentDocument ) ) { + + return elem.contentDocument; + } + + // Support: IE 9 - 11 only, iOS 7 only, Android Browser <=4.3 only + // Treat the template element as a regular one in browsers that + // don't support it. + if ( nodeName( elem, "template" ) ) { + elem = elem.content || elem; + } + + return jQuery.merge( [], elem.childNodes ); + } +}, function( name, fn ) { + jQuery.fn[ name ] = function( until, selector ) { + var matched = jQuery.map( this, fn, until ); + + if ( name.slice( -5 ) !== "Until" ) { + selector = until; + } + + if ( selector && typeof selector === "string" ) { + matched = jQuery.filter( selector, matched ); + } + + if ( this.length > 1 ) { + + // Remove duplicates + if ( !guaranteedUnique[ name ] ) { + jQuery.uniqueSort( matched ); + } + + // Reverse order for parents* and prev-derivatives + if ( rparentsprev.test( name ) ) { + matched.reverse(); + } + } + + return this.pushStack( matched ); + }; +} ); +var rnothtmlwhite = ( /[^\x20\t\r\n\f]+/g ); + + + +// Convert String-formatted options into Object-formatted ones +function createOptions( options ) { + var object = {}; + jQuery.each( options.match( rnothtmlwhite ) || [], function( _, flag ) { + object[ flag ] = true; + } ); + return object; +} + +/* + * Create a callback list using the following parameters: + * + * options: an optional list of space-separated options that will change how + * the callback list behaves or a more traditional option object + * + * By default a callback list will act like an event callback list and can be + * "fired" multiple times. + * + * Possible options: + * + * once: will ensure the callback list can only be fired once (like a Deferred) + * + * memory: will keep track of previous values and will call any callback added + * after the list has been fired right away with the latest "memorized" + * values (like a Deferred) + * + * unique: will ensure a callback can only be added once (no duplicate in the list) + * + * stopOnFalse: interrupt callings when a callback returns false + * + */ +jQuery.Callbacks = function( options ) { + + // Convert options from String-formatted to Object-formatted if needed + // (we check in cache first) + options = typeof options === "string" ? + createOptions( options ) : + jQuery.extend( {}, options ); + + var // Flag to know if list is currently firing + firing, + + // Last fire value for non-forgettable lists + memory, + + // Flag to know if list was already fired + fired, + + // Flag to prevent firing + locked, + + // Actual callback list + list = [], + + // Queue of execution data for repeatable lists + queue = [], + + // Index of currently firing callback (modified by add/remove as needed) + firingIndex = -1, + + // Fire callbacks + fire = function() { + + // Enforce single-firing + locked = locked || options.once; + + // Execute callbacks for all pending executions, + // respecting firingIndex overrides and runtime changes + fired = firing = true; + for ( ; queue.length; firingIndex = -1 ) { + memory = queue.shift(); + while ( ++firingIndex < list.length ) { + + // Run callback and check for early termination + if ( list[ firingIndex ].apply( memory[ 0 ], memory[ 1 ] ) === false && + options.stopOnFalse ) { + + // Jump to end and forget the data so .add doesn't re-fire + firingIndex = list.length; + memory = false; + } + } + } + + // Forget the data if we're done with it + if ( !options.memory ) { + memory = false; + } + + firing = false; + + // Clean up if we're done firing for good + if ( locked ) { + + // Keep an empty list if we have data for future add calls + if ( memory ) { + list = []; + + // Otherwise, this object is spent + } else { + list = ""; + } + } + }, + + // Actual Callbacks object + self = { + + // Add a callback or a collection of callbacks to the list + add: function() { + if ( list ) { + + // If we have memory from a past run, we should fire after adding + if ( memory && !firing ) { + firingIndex = list.length - 1; + queue.push( memory ); + } + + ( function add( args ) { + jQuery.each( args, function( _, arg ) { + if ( isFunction( arg ) ) { + if ( !options.unique || !self.has( arg ) ) { + list.push( arg ); + } + } else if ( arg && arg.length && toType( arg ) !== "string" ) { + + // Inspect recursively + add( arg ); + } + } ); + } )( arguments ); + + if ( memory && !firing ) { + fire(); + } + } + return this; + }, + + // Remove a callback from the list + remove: function() { + jQuery.each( arguments, function( _, arg ) { + var index; + while ( ( index = jQuery.inArray( arg, list, index ) ) > -1 ) { + list.splice( index, 1 ); + + // Handle firing indexes + if ( index <= firingIndex ) { + firingIndex--; + } + } + } ); + return this; + }, + + // Check if a given callback is in the list. + // If no argument is given, return whether or not list has callbacks attached. + has: function( fn ) { + return fn ? + jQuery.inArray( fn, list ) > -1 : + list.length > 0; + }, + + // Remove all callbacks from the list + empty: function() { + if ( list ) { + list = []; + } + return this; + }, + + // Disable .fire and .add + // Abort any current/pending executions + // Clear all callbacks and values + disable: function() { + locked = queue = []; + list = memory = ""; + return this; + }, + disabled: function() { + return !list; + }, + + // Disable .fire + // Also disable .add unless we have memory (since it would have no effect) + // Abort any pending executions + lock: function() { + locked = queue = []; + if ( !memory && !firing ) { + list = memory = ""; + } + return this; + }, + locked: function() { + return !!locked; + }, + + // Call all callbacks with the given context and arguments + fireWith: function( context, args ) { + if ( !locked ) { + args = args || []; + args = [ context, args.slice ? args.slice() : args ]; + queue.push( args ); + if ( !firing ) { + fire(); + } + } + return this; + }, + + // Call all the callbacks with the given arguments + fire: function() { + self.fireWith( this, arguments ); + return this; + }, + + // To know if the callbacks have already been called at least once + fired: function() { + return !!fired; + } + }; + + return self; +}; + + +function Identity( v ) { + return v; +} +function Thrower( ex ) { + throw ex; +} + +function adoptValue( value, resolve, reject, noValue ) { + var method; + + try { + + // Check for promise aspect first to privilege synchronous behavior + if ( value && isFunction( ( method = value.promise ) ) ) { + method.call( value ).done( resolve ).fail( reject ); + + // Other thenables + } else if ( value && isFunction( ( method = value.then ) ) ) { + method.call( value, resolve, reject ); + + // Other non-thenables + } else { + + // Control `resolve` arguments by letting Array#slice cast boolean `noValue` to integer: + // * false: [ value ].slice( 0 ) => resolve( value ) + // * true: [ value ].slice( 1 ) => resolve() + resolve.apply( undefined, [ value ].slice( noValue ) ); + } + + // For Promises/A+, convert exceptions into rejections + // Since jQuery.when doesn't unwrap thenables, we can skip the extra checks appearing in + // Deferred#then to conditionally suppress rejection. + } catch ( value ) { + + // Support: Android 4.0 only + // Strict mode functions invoked without .call/.apply get global-object context + reject.apply( undefined, [ value ] ); + } +} + +jQuery.extend( { + + Deferred: function( func ) { + var tuples = [ + + // action, add listener, callbacks, + // ... .then handlers, argument index, [final state] + [ "notify", "progress", jQuery.Callbacks( "memory" ), + jQuery.Callbacks( "memory" ), 2 ], + [ "resolve", "done", jQuery.Callbacks( "once memory" ), + jQuery.Callbacks( "once memory" ), 0, "resolved" ], + [ "reject", "fail", jQuery.Callbacks( "once memory" ), + jQuery.Callbacks( "once memory" ), 1, "rejected" ] + ], + state = "pending", + promise = { + state: function() { + return state; + }, + always: function() { + deferred.done( arguments ).fail( arguments ); + return this; + }, + "catch": function( fn ) { + return promise.then( null, fn ); + }, + + // Keep pipe for back-compat + pipe: function( /* fnDone, fnFail, fnProgress */ ) { + var fns = arguments; + + return jQuery.Deferred( function( newDefer ) { + jQuery.each( tuples, function( _i, tuple ) { + + // Map tuples (progress, done, fail) to arguments (done, fail, progress) + var fn = isFunction( fns[ tuple[ 4 ] ] ) && fns[ tuple[ 4 ] ]; + + // deferred.progress(function() { bind to newDefer or newDefer.notify }) + // deferred.done(function() { bind to newDefer or newDefer.resolve }) + // deferred.fail(function() { bind to newDefer or newDefer.reject }) + deferred[ tuple[ 1 ] ]( function() { + var returned = fn && fn.apply( this, arguments ); + if ( returned && isFunction( returned.promise ) ) { + returned.promise() + .progress( newDefer.notify ) + .done( newDefer.resolve ) + .fail( newDefer.reject ); + } else { + newDefer[ tuple[ 0 ] + "With" ]( + this, + fn ? [ returned ] : arguments + ); + } + } ); + } ); + fns = null; + } ).promise(); + }, + then: function( onFulfilled, onRejected, onProgress ) { + var maxDepth = 0; + function resolve( depth, deferred, handler, special ) { + return function() { + var that = this, + args = arguments, + mightThrow = function() { + var returned, then; + + // Support: Promises/A+ section 2.3.3.3.3 + // https://promisesaplus.com/#point-59 + // Ignore double-resolution attempts + if ( depth < maxDepth ) { + return; + } + + returned = handler.apply( that, args ); + + // Support: Promises/A+ section 2.3.1 + // https://promisesaplus.com/#point-48 + if ( returned === deferred.promise() ) { + throw new TypeError( "Thenable self-resolution" ); + } + + // Support: Promises/A+ sections 2.3.3.1, 3.5 + // https://promisesaplus.com/#point-54 + // https://promisesaplus.com/#point-75 + // Retrieve `then` only once + then = returned && + + // Support: Promises/A+ section 2.3.4 + // https://promisesaplus.com/#point-64 + // Only check objects and functions for thenability + ( typeof returned === "object" || + typeof returned === "function" ) && + returned.then; + + // Handle a returned thenable + if ( isFunction( then ) ) { + + // Special processors (notify) just wait for resolution + if ( special ) { + then.call( + returned, + resolve( maxDepth, deferred, Identity, special ), + resolve( maxDepth, deferred, Thrower, special ) + ); + + // Normal processors (resolve) also hook into progress + } else { + + // ...and disregard older resolution values + maxDepth++; + + then.call( + returned, + resolve( maxDepth, deferred, Identity, special ), + resolve( maxDepth, deferred, Thrower, special ), + resolve( maxDepth, deferred, Identity, + deferred.notifyWith ) + ); + } + + // Handle all other returned values + } else { + + // Only substitute handlers pass on context + // and multiple values (non-spec behavior) + if ( handler !== Identity ) { + that = undefined; + args = [ returned ]; + } + + // Process the value(s) + // Default process is resolve + ( special || deferred.resolveWith )( that, args ); + } + }, + + // Only normal processors (resolve) catch and reject exceptions + process = special ? + mightThrow : + function() { + try { + mightThrow(); + } catch ( e ) { + + if ( jQuery.Deferred.exceptionHook ) { + jQuery.Deferred.exceptionHook( e, + process.stackTrace ); + } + + // Support: Promises/A+ section 2.3.3.3.4.1 + // https://promisesaplus.com/#point-61 + // Ignore post-resolution exceptions + if ( depth + 1 >= maxDepth ) { + + // Only substitute handlers pass on context + // and multiple values (non-spec behavior) + if ( handler !== Thrower ) { + that = undefined; + args = [ e ]; + } + + deferred.rejectWith( that, args ); + } + } + }; + + // Support: Promises/A+ section 2.3.3.3.1 + // https://promisesaplus.com/#point-57 + // Re-resolve promises immediately to dodge false rejection from + // subsequent errors + if ( depth ) { + process(); + } else { + + // Call an optional hook to record the stack, in case of exception + // since it's otherwise lost when execution goes async + if ( jQuery.Deferred.getStackHook ) { + process.stackTrace = jQuery.Deferred.getStackHook(); + } + window.setTimeout( process ); + } + }; + } + + return jQuery.Deferred( function( newDefer ) { + + // progress_handlers.add( ... ) + tuples[ 0 ][ 3 ].add( + resolve( + 0, + newDefer, + isFunction( onProgress ) ? + onProgress : + Identity, + newDefer.notifyWith + ) + ); + + // fulfilled_handlers.add( ... ) + tuples[ 1 ][ 3 ].add( + resolve( + 0, + newDefer, + isFunction( onFulfilled ) ? + onFulfilled : + Identity + ) + ); + + // rejected_handlers.add( ... ) + tuples[ 2 ][ 3 ].add( + resolve( + 0, + newDefer, + isFunction( onRejected ) ? + onRejected : + Thrower + ) + ); + } ).promise(); + }, + + // Get a promise for this deferred + // If obj is provided, the promise aspect is added to the object + promise: function( obj ) { + return obj != null ? jQuery.extend( obj, promise ) : promise; + } + }, + deferred = {}; + + // Add list-specific methods + jQuery.each( tuples, function( i, tuple ) { + var list = tuple[ 2 ], + stateString = tuple[ 5 ]; + + // promise.progress = list.add + // promise.done = list.add + // promise.fail = list.add + promise[ tuple[ 1 ] ] = list.add; + + // Handle state + if ( stateString ) { + list.add( + function() { + + // state = "resolved" (i.e., fulfilled) + // state = "rejected" + state = stateString; + }, + + // rejected_callbacks.disable + // fulfilled_callbacks.disable + tuples[ 3 - i ][ 2 ].disable, + + // rejected_handlers.disable + // fulfilled_handlers.disable + tuples[ 3 - i ][ 3 ].disable, + + // progress_callbacks.lock + tuples[ 0 ][ 2 ].lock, + + // progress_handlers.lock + tuples[ 0 ][ 3 ].lock + ); + } + + // progress_handlers.fire + // fulfilled_handlers.fire + // rejected_handlers.fire + list.add( tuple[ 3 ].fire ); + + // deferred.notify = function() { deferred.notifyWith(...) } + // deferred.resolve = function() { deferred.resolveWith(...) } + // deferred.reject = function() { deferred.rejectWith(...) } + deferred[ tuple[ 0 ] ] = function() { + deferred[ tuple[ 0 ] + "With" ]( this === deferred ? undefined : this, arguments ); + return this; + }; + + // deferred.notifyWith = list.fireWith + // deferred.resolveWith = list.fireWith + // deferred.rejectWith = list.fireWith + deferred[ tuple[ 0 ] + "With" ] = list.fireWith; + } ); + + // Make the deferred a promise + promise.promise( deferred ); + + // Call given func if any + if ( func ) { + func.call( deferred, deferred ); + } + + // All done! + return deferred; + }, + + // Deferred helper + when: function( singleValue ) { + var + + // count of uncompleted subordinates + remaining = arguments.length, + + // count of unprocessed arguments + i = remaining, + + // subordinate fulfillment data + resolveContexts = Array( i ), + resolveValues = slice.call( arguments ), + + // the primary Deferred + primary = jQuery.Deferred(), + + // subordinate callback factory + updateFunc = function( i ) { + return function( value ) { + resolveContexts[ i ] = this; + resolveValues[ i ] = arguments.length > 1 ? slice.call( arguments ) : value; + if ( !( --remaining ) ) { + primary.resolveWith( resolveContexts, resolveValues ); + } + }; + }; + + // Single- and empty arguments are adopted like Promise.resolve + if ( remaining <= 1 ) { + adoptValue( singleValue, primary.done( updateFunc( i ) ).resolve, primary.reject, + !remaining ); + + // Use .then() to unwrap secondary thenables (cf. gh-3000) + if ( primary.state() === "pending" || + isFunction( resolveValues[ i ] && resolveValues[ i ].then ) ) { + + return primary.then(); + } + } + + // Multiple arguments are aggregated like Promise.all array elements + while ( i-- ) { + adoptValue( resolveValues[ i ], updateFunc( i ), primary.reject ); + } + + return primary.promise(); + } +} ); + + +// These usually indicate a programmer mistake during development, +// warn about them ASAP rather than swallowing them by default. +var rerrorNames = /^(Eval|Internal|Range|Reference|Syntax|Type|URI)Error$/; + +jQuery.Deferred.exceptionHook = function( error, stack ) { + + // Support: IE 8 - 9 only + // Console exists when dev tools are open, which can happen at any time + if ( window.console && window.console.warn && error && rerrorNames.test( error.name ) ) { + window.console.warn( "jQuery.Deferred exception: " + error.message, error.stack, stack ); + } +}; + + + + +jQuery.readyException = function( error ) { + window.setTimeout( function() { + throw error; + } ); +}; + + + + +// The deferred used on DOM ready +var readyList = jQuery.Deferred(); + +jQuery.fn.ready = function( fn ) { + + readyList + .then( fn ) + + // Wrap jQuery.readyException in a function so that the lookup + // happens at the time of error handling instead of callback + // registration. + .catch( function( error ) { + jQuery.readyException( error ); + } ); + + return this; +}; + +jQuery.extend( { + + // Is the DOM ready to be used? Set to true once it occurs. + isReady: false, + + // A counter to track how many items to wait for before + // the ready event fires. See #6781 + readyWait: 1, + + // Handle when the DOM is ready + ready: function( wait ) { + + // Abort if there are pending holds or we're already ready + if ( wait === true ? --jQuery.readyWait : jQuery.isReady ) { + return; + } + + // Remember that the DOM is ready + jQuery.isReady = true; + + // If a normal DOM Ready event fired, decrement, and wait if need be + if ( wait !== true && --jQuery.readyWait > 0 ) { + return; + } + + // If there are functions bound, to execute + readyList.resolveWith( document, [ jQuery ] ); + } +} ); + +jQuery.ready.then = readyList.then; + +// The ready event handler and self cleanup method +function completed() { + document.removeEventListener( "DOMContentLoaded", completed ); + window.removeEventListener( "load", completed ); + jQuery.ready(); +} + +// Catch cases where $(document).ready() is called +// after the browser event has already occurred. +// Support: IE <=9 - 10 only +// Older IE sometimes signals "interactive" too soon +if ( document.readyState === "complete" || + ( document.readyState !== "loading" && !document.documentElement.doScroll ) ) { + + // Handle it asynchronously to allow scripts the opportunity to delay ready + window.setTimeout( jQuery.ready ); + +} else { + + // Use the handy event callback + document.addEventListener( "DOMContentLoaded", completed ); + + // A fallback to window.onload, that will always work + window.addEventListener( "load", completed ); +} + + + + +// Multifunctional method to get and set values of a collection +// The value/s can optionally be executed if it's a function +var access = function( elems, fn, key, value, chainable, emptyGet, raw ) { + var i = 0, + len = elems.length, + bulk = key == null; + + // Sets many values + if ( toType( key ) === "object" ) { + chainable = true; + for ( i in key ) { + access( elems, fn, i, key[ i ], true, emptyGet, raw ); + } + + // Sets one value + } else if ( value !== undefined ) { + chainable = true; + + if ( !isFunction( value ) ) { + raw = true; + } + + if ( bulk ) { + + // Bulk operations run against the entire set + if ( raw ) { + fn.call( elems, value ); + fn = null; + + // ...except when executing function values + } else { + bulk = fn; + fn = function( elem, _key, value ) { + return bulk.call( jQuery( elem ), value ); + }; + } + } + + if ( fn ) { + for ( ; i < len; i++ ) { + fn( + elems[ i ], key, raw ? + value : + value.call( elems[ i ], i, fn( elems[ i ], key ) ) + ); + } + } + } + + if ( chainable ) { + return elems; + } + + // Gets + if ( bulk ) { + return fn.call( elems ); + } + + return len ? fn( elems[ 0 ], key ) : emptyGet; +}; + + +// Matches dashed string for camelizing +var rmsPrefix = /^-ms-/, + rdashAlpha = /-([a-z])/g; + +// Used by camelCase as callback to replace() +function fcamelCase( _all, letter ) { + return letter.toUpperCase(); +} + +// Convert dashed to camelCase; used by the css and data modules +// Support: IE <=9 - 11, Edge 12 - 15 +// Microsoft forgot to hump their vendor prefix (#9572) +function camelCase( string ) { + return string.replace( rmsPrefix, "ms-" ).replace( rdashAlpha, fcamelCase ); +} +var acceptData = function( owner ) { + + // Accepts only: + // - Node + // - Node.ELEMENT_NODE + // - Node.DOCUMENT_NODE + // - Object + // - Any + return owner.nodeType === 1 || owner.nodeType === 9 || !( +owner.nodeType ); +}; + + + + +function Data() { + this.expando = jQuery.expando + Data.uid++; +} + +Data.uid = 1; + +Data.prototype = { + + cache: function( owner ) { + + // Check if the owner object already has a cache + var value = owner[ this.expando ]; + + // If not, create one + if ( !value ) { + value = {}; + + // We can accept data for non-element nodes in modern browsers, + // but we should not, see #8335. + // Always return an empty object. + if ( acceptData( owner ) ) { + + // If it is a node unlikely to be stringify-ed or looped over + // use plain assignment + if ( owner.nodeType ) { + owner[ this.expando ] = value; + + // Otherwise secure it in a non-enumerable property + // configurable must be true to allow the property to be + // deleted when data is removed + } else { + Object.defineProperty( owner, this.expando, { + value: value, + configurable: true + } ); + } + } + } + + return value; + }, + set: function( owner, data, value ) { + var prop, + cache = this.cache( owner ); + + // Handle: [ owner, key, value ] args + // Always use camelCase key (gh-2257) + if ( typeof data === "string" ) { + cache[ camelCase( data ) ] = value; + + // Handle: [ owner, { properties } ] args + } else { + + // Copy the properties one-by-one to the cache object + for ( prop in data ) { + cache[ camelCase( prop ) ] = data[ prop ]; + } + } + return cache; + }, + get: function( owner, key ) { + return key === undefined ? + this.cache( owner ) : + + // Always use camelCase key (gh-2257) + owner[ this.expando ] && owner[ this.expando ][ camelCase( key ) ]; + }, + access: function( owner, key, value ) { + + // In cases where either: + // + // 1. No key was specified + // 2. A string key was specified, but no value provided + // + // Take the "read" path and allow the get method to determine + // which value to return, respectively either: + // + // 1. The entire cache object + // 2. The data stored at the key + // + if ( key === undefined || + ( ( key && typeof key === "string" ) && value === undefined ) ) { + + return this.get( owner, key ); + } + + // When the key is not a string, or both a key and value + // are specified, set or extend (existing objects) with either: + // + // 1. An object of properties + // 2. A key and value + // + this.set( owner, key, value ); + + // Since the "set" path can have two possible entry points + // return the expected data based on which path was taken[*] + return value !== undefined ? value : key; + }, + remove: function( owner, key ) { + var i, + cache = owner[ this.expando ]; + + if ( cache === undefined ) { + return; + } + + if ( key !== undefined ) { + + // Support array or space separated string of keys + if ( Array.isArray( key ) ) { + + // If key is an array of keys... + // We always set camelCase keys, so remove that. + key = key.map( camelCase ); + } else { + key = camelCase( key ); + + // If a key with the spaces exists, use it. + // Otherwise, create an array by matching non-whitespace + key = key in cache ? + [ key ] : + ( key.match( rnothtmlwhite ) || [] ); + } + + i = key.length; + + while ( i-- ) { + delete cache[ key[ i ] ]; + } + } + + // Remove the expando if there's no more data + if ( key === undefined || jQuery.isEmptyObject( cache ) ) { + + // Support: Chrome <=35 - 45 + // Webkit & Blink performance suffers when deleting properties + // from DOM nodes, so set to undefined instead + // https://bugs.chromium.org/p/chromium/issues/detail?id=378607 (bug restricted) + if ( owner.nodeType ) { + owner[ this.expando ] = undefined; + } else { + delete owner[ this.expando ]; + } + } + }, + hasData: function( owner ) { + var cache = owner[ this.expando ]; + return cache !== undefined && !jQuery.isEmptyObject( cache ); + } +}; +var dataPriv = new Data(); + +var dataUser = new Data(); + + + +// Implementation Summary +// +// 1. Enforce API surface and semantic compatibility with 1.9.x branch +// 2. Improve the module's maintainability by reducing the storage +// paths to a single mechanism. +// 3. Use the same single mechanism to support "private" and "user" data. +// 4. _Never_ expose "private" data to user code (TODO: Drop _data, _removeData) +// 5. Avoid exposing implementation details on user objects (eg. expando properties) +// 6. Provide a clear path for implementation upgrade to WeakMap in 2014 + +var rbrace = /^(?:\{[\w\W]*\}|\[[\w\W]*\])$/, + rmultiDash = /[A-Z]/g; + +function getData( data ) { + if ( data === "true" ) { + return true; + } + + if ( data === "false" ) { + return false; + } + + if ( data === "null" ) { + return null; + } + + // Only convert to a number if it doesn't change the string + if ( data === +data + "" ) { + return +data; + } + + if ( rbrace.test( data ) ) { + return JSON.parse( data ); + } + + return data; +} + +function dataAttr( elem, key, data ) { + var name; + + // If nothing was found internally, try to fetch any + // data from the HTML5 data-* attribute + if ( data === undefined && elem.nodeType === 1 ) { + name = "data-" + key.replace( rmultiDash, "-$&" ).toLowerCase(); + data = elem.getAttribute( name ); + + if ( typeof data === "string" ) { + try { + data = getData( data ); + } catch ( e ) {} + + // Make sure we set the data so it isn't changed later + dataUser.set( elem, key, data ); + } else { + data = undefined; + } + } + return data; +} + +jQuery.extend( { + hasData: function( elem ) { + return dataUser.hasData( elem ) || dataPriv.hasData( elem ); + }, + + data: function( elem, name, data ) { + return dataUser.access( elem, name, data ); + }, + + removeData: function( elem, name ) { + dataUser.remove( elem, name ); + }, + + // TODO: Now that all calls to _data and _removeData have been replaced + // with direct calls to dataPriv methods, these can be deprecated. + _data: function( elem, name, data ) { + return dataPriv.access( elem, name, data ); + }, + + _removeData: function( elem, name ) { + dataPriv.remove( elem, name ); + } +} ); + +jQuery.fn.extend( { + data: function( key, value ) { + var i, name, data, + elem = this[ 0 ], + attrs = elem && elem.attributes; + + // Gets all values + if ( key === undefined ) { + if ( this.length ) { + data = dataUser.get( elem ); + + if ( elem.nodeType === 1 && !dataPriv.get( elem, "hasDataAttrs" ) ) { + i = attrs.length; + while ( i-- ) { + + // Support: IE 11 only + // The attrs elements can be null (#14894) + if ( attrs[ i ] ) { + name = attrs[ i ].name; + if ( name.indexOf( "data-" ) === 0 ) { + name = camelCase( name.slice( 5 ) ); + dataAttr( elem, name, data[ name ] ); + } + } + } + dataPriv.set( elem, "hasDataAttrs", true ); + } + } + + return data; + } + + // Sets multiple values + if ( typeof key === "object" ) { + return this.each( function() { + dataUser.set( this, key ); + } ); + } + + return access( this, function( value ) { + var data; + + // The calling jQuery object (element matches) is not empty + // (and therefore has an element appears at this[ 0 ]) and the + // `value` parameter was not undefined. An empty jQuery object + // will result in `undefined` for elem = this[ 0 ] which will + // throw an exception if an attempt to read a data cache is made. + if ( elem && value === undefined ) { + + // Attempt to get data from the cache + // The key will always be camelCased in Data + data = dataUser.get( elem, key ); + if ( data !== undefined ) { + return data; + } + + // Attempt to "discover" the data in + // HTML5 custom data-* attrs + data = dataAttr( elem, key ); + if ( data !== undefined ) { + return data; + } + + // We tried really hard, but the data doesn't exist. + return; + } + + // Set the data... + this.each( function() { + + // We always store the camelCased key + dataUser.set( this, key, value ); + } ); + }, null, value, arguments.length > 1, null, true ); + }, + + removeData: function( key ) { + return this.each( function() { + dataUser.remove( this, key ); + } ); + } +} ); + + +jQuery.extend( { + queue: function( elem, type, data ) { + var queue; + + if ( elem ) { + type = ( type || "fx" ) + "queue"; + queue = dataPriv.get( elem, type ); + + // Speed up dequeue by getting out quickly if this is just a lookup + if ( data ) { + if ( !queue || Array.isArray( data ) ) { + queue = dataPriv.access( elem, type, jQuery.makeArray( data ) ); + } else { + queue.push( data ); + } + } + return queue || []; + } + }, + + dequeue: function( elem, type ) { + type = type || "fx"; + + var queue = jQuery.queue( elem, type ), + startLength = queue.length, + fn = queue.shift(), + hooks = jQuery._queueHooks( elem, type ), + next = function() { + jQuery.dequeue( elem, type ); + }; + + // If the fx queue is dequeued, always remove the progress sentinel + if ( fn === "inprogress" ) { + fn = queue.shift(); + startLength--; + } + + if ( fn ) { + + // Add a progress sentinel to prevent the fx queue from being + // automatically dequeued + if ( type === "fx" ) { + queue.unshift( "inprogress" ); + } + + // Clear up the last queue stop function + delete hooks.stop; + fn.call( elem, next, hooks ); + } + + if ( !startLength && hooks ) { + hooks.empty.fire(); + } + }, + + // Not public - generate a queueHooks object, or return the current one + _queueHooks: function( elem, type ) { + var key = type + "queueHooks"; + return dataPriv.get( elem, key ) || dataPriv.access( elem, key, { + empty: jQuery.Callbacks( "once memory" ).add( function() { + dataPriv.remove( elem, [ type + "queue", key ] ); + } ) + } ); + } +} ); + +jQuery.fn.extend( { + queue: function( type, data ) { + var setter = 2; + + if ( typeof type !== "string" ) { + data = type; + type = "fx"; + setter--; + } + + if ( arguments.length < setter ) { + return jQuery.queue( this[ 0 ], type ); + } + + return data === undefined ? + this : + this.each( function() { + var queue = jQuery.queue( this, type, data ); + + // Ensure a hooks for this queue + jQuery._queueHooks( this, type ); + + if ( type === "fx" && queue[ 0 ] !== "inprogress" ) { + jQuery.dequeue( this, type ); + } + } ); + }, + dequeue: function( type ) { + return this.each( function() { + jQuery.dequeue( this, type ); + } ); + }, + clearQueue: function( type ) { + return this.queue( type || "fx", [] ); + }, + + // Get a promise resolved when queues of a certain type + // are emptied (fx is the type by default) + promise: function( type, obj ) { + var tmp, + count = 1, + defer = jQuery.Deferred(), + elements = this, + i = this.length, + resolve = function() { + if ( !( --count ) ) { + defer.resolveWith( elements, [ elements ] ); + } + }; + + if ( typeof type !== "string" ) { + obj = type; + type = undefined; + } + type = type || "fx"; + + while ( i-- ) { + tmp = dataPriv.get( elements[ i ], type + "queueHooks" ); + if ( tmp && tmp.empty ) { + count++; + tmp.empty.add( resolve ); + } + } + resolve(); + return defer.promise( obj ); + } +} ); +var pnum = ( /[+-]?(?:\d*\.|)\d+(?:[eE][+-]?\d+|)/ ).source; + +var rcssNum = new RegExp( "^(?:([+-])=|)(" + pnum + ")([a-z%]*)$", "i" ); + + +var cssExpand = [ "Top", "Right", "Bottom", "Left" ]; + +var documentElement = document.documentElement; + + + + var isAttached = function( elem ) { + return jQuery.contains( elem.ownerDocument, elem ); + }, + composed = { composed: true }; + + // Support: IE 9 - 11+, Edge 12 - 18+, iOS 10.0 - 10.2 only + // Check attachment across shadow DOM boundaries when possible (gh-3504) + // Support: iOS 10.0-10.2 only + // Early iOS 10 versions support `attachShadow` but not `getRootNode`, + // leading to errors. We need to check for `getRootNode`. + if ( documentElement.getRootNode ) { + isAttached = function( elem ) { + return jQuery.contains( elem.ownerDocument, elem ) || + elem.getRootNode( composed ) === elem.ownerDocument; + }; + } +var isHiddenWithinTree = function( elem, el ) { + + // isHiddenWithinTree might be called from jQuery#filter function; + // in that case, element will be second argument + elem = el || elem; + + // Inline style trumps all + return elem.style.display === "none" || + elem.style.display === "" && + + // Otherwise, check computed style + // Support: Firefox <=43 - 45 + // Disconnected elements can have computed display: none, so first confirm that elem is + // in the document. + isAttached( elem ) && + + jQuery.css( elem, "display" ) === "none"; + }; + + + +function adjustCSS( elem, prop, valueParts, tween ) { + var adjusted, scale, + maxIterations = 20, + currentValue = tween ? + function() { + return tween.cur(); + } : + function() { + return jQuery.css( elem, prop, "" ); + }, + initial = currentValue(), + unit = valueParts && valueParts[ 3 ] || ( jQuery.cssNumber[ prop ] ? "" : "px" ), + + // Starting value computation is required for potential unit mismatches + initialInUnit = elem.nodeType && + ( jQuery.cssNumber[ prop ] || unit !== "px" && +initial ) && + rcssNum.exec( jQuery.css( elem, prop ) ); + + if ( initialInUnit && initialInUnit[ 3 ] !== unit ) { + + // Support: Firefox <=54 + // Halve the iteration target value to prevent interference from CSS upper bounds (gh-2144) + initial = initial / 2; + + // Trust units reported by jQuery.css + unit = unit || initialInUnit[ 3 ]; + + // Iteratively approximate from a nonzero starting point + initialInUnit = +initial || 1; + + while ( maxIterations-- ) { + + // Evaluate and update our best guess (doubling guesses that zero out). + // Finish if the scale equals or crosses 1 (making the old*new product non-positive). + jQuery.style( elem, prop, initialInUnit + unit ); + if ( ( 1 - scale ) * ( 1 - ( scale = currentValue() / initial || 0.5 ) ) <= 0 ) { + maxIterations = 0; + } + initialInUnit = initialInUnit / scale; + + } + + initialInUnit = initialInUnit * 2; + jQuery.style( elem, prop, initialInUnit + unit ); + + // Make sure we update the tween properties later on + valueParts = valueParts || []; + } + + if ( valueParts ) { + initialInUnit = +initialInUnit || +initial || 0; + + // Apply relative offset (+=/-=) if specified + adjusted = valueParts[ 1 ] ? + initialInUnit + ( valueParts[ 1 ] + 1 ) * valueParts[ 2 ] : + +valueParts[ 2 ]; + if ( tween ) { + tween.unit = unit; + tween.start = initialInUnit; + tween.end = adjusted; + } + } + return adjusted; +} + + +var defaultDisplayMap = {}; + +function getDefaultDisplay( elem ) { + var temp, + doc = elem.ownerDocument, + nodeName = elem.nodeName, + display = defaultDisplayMap[ nodeName ]; + + if ( display ) { + return display; + } + + temp = doc.body.appendChild( doc.createElement( nodeName ) ); + display = jQuery.css( temp, "display" ); + + temp.parentNode.removeChild( temp ); + + if ( display === "none" ) { + display = "block"; + } + defaultDisplayMap[ nodeName ] = display; + + return display; +} + +function showHide( elements, show ) { + var display, elem, + values = [], + index = 0, + length = elements.length; + + // Determine new display value for elements that need to change + for ( ; index < length; index++ ) { + elem = elements[ index ]; + if ( !elem.style ) { + continue; + } + + display = elem.style.display; + if ( show ) { + + // Since we force visibility upon cascade-hidden elements, an immediate (and slow) + // check is required in this first loop unless we have a nonempty display value (either + // inline or about-to-be-restored) + if ( display === "none" ) { + values[ index ] = dataPriv.get( elem, "display" ) || null; + if ( !values[ index ] ) { + elem.style.display = ""; + } + } + if ( elem.style.display === "" && isHiddenWithinTree( elem ) ) { + values[ index ] = getDefaultDisplay( elem ); + } + } else { + if ( display !== "none" ) { + values[ index ] = "none"; + + // Remember what we're overwriting + dataPriv.set( elem, "display", display ); + } + } + } + + // Set the display of the elements in a second loop to avoid constant reflow + for ( index = 0; index < length; index++ ) { + if ( values[ index ] != null ) { + elements[ index ].style.display = values[ index ]; + } + } + + return elements; +} + +jQuery.fn.extend( { + show: function() { + return showHide( this, true ); + }, + hide: function() { + return showHide( this ); + }, + toggle: function( state ) { + if ( typeof state === "boolean" ) { + return state ? this.show() : this.hide(); + } + + return this.each( function() { + if ( isHiddenWithinTree( this ) ) { + jQuery( this ).show(); + } else { + jQuery( this ).hide(); + } + } ); + } +} ); +var rcheckableType = ( /^(?:checkbox|radio)$/i ); + +var rtagName = ( /<([a-z][^\/\0>\x20\t\r\n\f]*)/i ); + +var rscriptType = ( /^$|^module$|\/(?:java|ecma)script/i ); + + + +( function() { + var fragment = document.createDocumentFragment(), + div = fragment.appendChild( document.createElement( "div" ) ), + input = document.createElement( "input" ); + + // Support: Android 4.0 - 4.3 only + // Check state lost if the name is set (#11217) + // Support: Windows Web Apps (WWA) + // `name` and `type` must use .setAttribute for WWA (#14901) + input.setAttribute( "type", "radio" ); + input.setAttribute( "checked", "checked" ); + input.setAttribute( "name", "t" ); + + div.appendChild( input ); + + // Support: Android <=4.1 only + // Older WebKit doesn't clone checked state correctly in fragments + support.checkClone = div.cloneNode( true ).cloneNode( true ).lastChild.checked; + + // Support: IE <=11 only + // Make sure textarea (and checkbox) defaultValue is properly cloned + div.innerHTML = ""; + support.noCloneChecked = !!div.cloneNode( true ).lastChild.defaultValue; + + // Support: IE <=9 only + // IE <=9 replaces "; + support.option = !!div.lastChild; +} )(); + + +// We have to close these tags to support XHTML (#13200) +var wrapMap = { + + // XHTML parsers do not magically insert elements in the + // same way that tag soup parsers do. So we cannot shorten + // this by omitting or other required elements. + thead: [ 1, "", "
        " ], + col: [ 2, "", "
        " ], + tr: [ 2, "", "
        " ], + td: [ 3, "", "
        " ], + + _default: [ 0, "", "" ] +}; + +wrapMap.tbody = wrapMap.tfoot = wrapMap.colgroup = wrapMap.caption = wrapMap.thead; +wrapMap.th = wrapMap.td; + +// Support: IE <=9 only +if ( !support.option ) { + wrapMap.optgroup = wrapMap.option = [ 1, "" ]; +} + + +function getAll( context, tag ) { + + // Support: IE <=9 - 11 only + // Use typeof to avoid zero-argument method invocation on host objects (#15151) + var ret; + + if ( typeof context.getElementsByTagName !== "undefined" ) { + ret = context.getElementsByTagName( tag || "*" ); + + } else if ( typeof context.querySelectorAll !== "undefined" ) { + ret = context.querySelectorAll( tag || "*" ); + + } else { + ret = []; + } + + if ( tag === undefined || tag && nodeName( context, tag ) ) { + return jQuery.merge( [ context ], ret ); + } + + return ret; +} + + +// Mark scripts as having already been evaluated +function setGlobalEval( elems, refElements ) { + var i = 0, + l = elems.length; + + for ( ; i < l; i++ ) { + dataPriv.set( + elems[ i ], + "globalEval", + !refElements || dataPriv.get( refElements[ i ], "globalEval" ) + ); + } +} + + +var rhtml = /<|&#?\w+;/; + +function buildFragment( elems, context, scripts, selection, ignored ) { + var elem, tmp, tag, wrap, attached, j, + fragment = context.createDocumentFragment(), + nodes = [], + i = 0, + l = elems.length; + + for ( ; i < l; i++ ) { + elem = elems[ i ]; + + if ( elem || elem === 0 ) { + + // Add nodes directly + if ( toType( elem ) === "object" ) { + + // Support: Android <=4.0 only, PhantomJS 1 only + // push.apply(_, arraylike) throws on ancient WebKit + jQuery.merge( nodes, elem.nodeType ? [ elem ] : elem ); + + // Convert non-html into a text node + } else if ( !rhtml.test( elem ) ) { + nodes.push( context.createTextNode( elem ) ); + + // Convert html into DOM nodes + } else { + tmp = tmp || fragment.appendChild( context.createElement( "div" ) ); + + // Deserialize a standard representation + tag = ( rtagName.exec( elem ) || [ "", "" ] )[ 1 ].toLowerCase(); + wrap = wrapMap[ tag ] || wrapMap._default; + tmp.innerHTML = wrap[ 1 ] + jQuery.htmlPrefilter( elem ) + wrap[ 2 ]; + + // Descend through wrappers to the right content + j = wrap[ 0 ]; + while ( j-- ) { + tmp = tmp.lastChild; + } + + // Support: Android <=4.0 only, PhantomJS 1 only + // push.apply(_, arraylike) throws on ancient WebKit + jQuery.merge( nodes, tmp.childNodes ); + + // Remember the top-level container + tmp = fragment.firstChild; + + // Ensure the created nodes are orphaned (#12392) + tmp.textContent = ""; + } + } + } + + // Remove wrapper from fragment + fragment.textContent = ""; + + i = 0; + while ( ( elem = nodes[ i++ ] ) ) { + + // Skip elements already in the context collection (trac-4087) + if ( selection && jQuery.inArray( elem, selection ) > -1 ) { + if ( ignored ) { + ignored.push( elem ); + } + continue; + } + + attached = isAttached( elem ); + + // Append to fragment + tmp = getAll( fragment.appendChild( elem ), "script" ); + + // Preserve script evaluation history + if ( attached ) { + setGlobalEval( tmp ); + } + + // Capture executables + if ( scripts ) { + j = 0; + while ( ( elem = tmp[ j++ ] ) ) { + if ( rscriptType.test( elem.type || "" ) ) { + scripts.push( elem ); + } + } + } + } + + return fragment; +} + + +var rtypenamespace = /^([^.]*)(?:\.(.+)|)/; + +function returnTrue() { + return true; +} + +function returnFalse() { + return false; +} + +// Support: IE <=9 - 11+ +// focus() and blur() are asynchronous, except when they are no-op. +// So expect focus to be synchronous when the element is already active, +// and blur to be synchronous when the element is not already active. +// (focus and blur are always synchronous in other supported browsers, +// this just defines when we can count on it). +function expectSync( elem, type ) { + return ( elem === safeActiveElement() ) === ( type === "focus" ); +} + +// Support: IE <=9 only +// Accessing document.activeElement can throw unexpectedly +// https://bugs.jquery.com/ticket/13393 +function safeActiveElement() { + try { + return document.activeElement; + } catch ( err ) { } +} + +function on( elem, types, selector, data, fn, one ) { + var origFn, type; + + // Types can be a map of types/handlers + if ( typeof types === "object" ) { + + // ( types-Object, selector, data ) + if ( typeof selector !== "string" ) { + + // ( types-Object, data ) + data = data || selector; + selector = undefined; + } + for ( type in types ) { + on( elem, type, selector, data, types[ type ], one ); + } + return elem; + } + + if ( data == null && fn == null ) { + + // ( types, fn ) + fn = selector; + data = selector = undefined; + } else if ( fn == null ) { + if ( typeof selector === "string" ) { + + // ( types, selector, fn ) + fn = data; + data = undefined; + } else { + + // ( types, data, fn ) + fn = data; + data = selector; + selector = undefined; + } + } + if ( fn === false ) { + fn = returnFalse; + } else if ( !fn ) { + return elem; + } + + if ( one === 1 ) { + origFn = fn; + fn = function( event ) { + + // Can use an empty set, since event contains the info + jQuery().off( event ); + return origFn.apply( this, arguments ); + }; + + // Use same guid so caller can remove using origFn + fn.guid = origFn.guid || ( origFn.guid = jQuery.guid++ ); + } + return elem.each( function() { + jQuery.event.add( this, types, fn, data, selector ); + } ); +} + +/* + * Helper functions for managing events -- not part of the public interface. + * Props to Dean Edwards' addEvent library for many of the ideas. + */ +jQuery.event = { + + global: {}, + + add: function( elem, types, handler, data, selector ) { + + var handleObjIn, eventHandle, tmp, + events, t, handleObj, + special, handlers, type, namespaces, origType, + elemData = dataPriv.get( elem ); + + // Only attach events to objects that accept data + if ( !acceptData( elem ) ) { + return; + } + + // Caller can pass in an object of custom data in lieu of the handler + if ( handler.handler ) { + handleObjIn = handler; + handler = handleObjIn.handler; + selector = handleObjIn.selector; + } + + // Ensure that invalid selectors throw exceptions at attach time + // Evaluate against documentElement in case elem is a non-element node (e.g., document) + if ( selector ) { + jQuery.find.matchesSelector( documentElement, selector ); + } + + // Make sure that the handler has a unique ID, used to find/remove it later + if ( !handler.guid ) { + handler.guid = jQuery.guid++; + } + + // Init the element's event structure and main handler, if this is the first + if ( !( events = elemData.events ) ) { + events = elemData.events = Object.create( null ); + } + if ( !( eventHandle = elemData.handle ) ) { + eventHandle = elemData.handle = function( e ) { + + // Discard the second event of a jQuery.event.trigger() and + // when an event is called after a page has unloaded + return typeof jQuery !== "undefined" && jQuery.event.triggered !== e.type ? + jQuery.event.dispatch.apply( elem, arguments ) : undefined; + }; + } + + // Handle multiple events separated by a space + types = ( types || "" ).match( rnothtmlwhite ) || [ "" ]; + t = types.length; + while ( t-- ) { + tmp = rtypenamespace.exec( types[ t ] ) || []; + type = origType = tmp[ 1 ]; + namespaces = ( tmp[ 2 ] || "" ).split( "." ).sort(); + + // There *must* be a type, no attaching namespace-only handlers + if ( !type ) { + continue; + } + + // If event changes its type, use the special event handlers for the changed type + special = jQuery.event.special[ type ] || {}; + + // If selector defined, determine special event api type, otherwise given type + type = ( selector ? special.delegateType : special.bindType ) || type; + + // Update special based on newly reset type + special = jQuery.event.special[ type ] || {}; + + // handleObj is passed to all event handlers + handleObj = jQuery.extend( { + type: type, + origType: origType, + data: data, + handler: handler, + guid: handler.guid, + selector: selector, + needsContext: selector && jQuery.expr.match.needsContext.test( selector ), + namespace: namespaces.join( "." ) + }, handleObjIn ); + + // Init the event handler queue if we're the first + if ( !( handlers = events[ type ] ) ) { + handlers = events[ type ] = []; + handlers.delegateCount = 0; + + // Only use addEventListener if the special events handler returns false + if ( !special.setup || + special.setup.call( elem, data, namespaces, eventHandle ) === false ) { + + if ( elem.addEventListener ) { + elem.addEventListener( type, eventHandle ); + } + } + } + + if ( special.add ) { + special.add.call( elem, handleObj ); + + if ( !handleObj.handler.guid ) { + handleObj.handler.guid = handler.guid; + } + } + + // Add to the element's handler list, delegates in front + if ( selector ) { + handlers.splice( handlers.delegateCount++, 0, handleObj ); + } else { + handlers.push( handleObj ); + } + + // Keep track of which events have ever been used, for event optimization + jQuery.event.global[ type ] = true; + } + + }, + + // Detach an event or set of events from an element + remove: function( elem, types, handler, selector, mappedTypes ) { + + var j, origCount, tmp, + events, t, handleObj, + special, handlers, type, namespaces, origType, + elemData = dataPriv.hasData( elem ) && dataPriv.get( elem ); + + if ( !elemData || !( events = elemData.events ) ) { + return; + } + + // Once for each type.namespace in types; type may be omitted + types = ( types || "" ).match( rnothtmlwhite ) || [ "" ]; + t = types.length; + while ( t-- ) { + tmp = rtypenamespace.exec( types[ t ] ) || []; + type = origType = tmp[ 1 ]; + namespaces = ( tmp[ 2 ] || "" ).split( "." ).sort(); + + // Unbind all events (on this namespace, if provided) for the element + if ( !type ) { + for ( type in events ) { + jQuery.event.remove( elem, type + types[ t ], handler, selector, true ); + } + continue; + } + + special = jQuery.event.special[ type ] || {}; + type = ( selector ? special.delegateType : special.bindType ) || type; + handlers = events[ type ] || []; + tmp = tmp[ 2 ] && + new RegExp( "(^|\\.)" + namespaces.join( "\\.(?:.*\\.|)" ) + "(\\.|$)" ); + + // Remove matching events + origCount = j = handlers.length; + while ( j-- ) { + handleObj = handlers[ j ]; + + if ( ( mappedTypes || origType === handleObj.origType ) && + ( !handler || handler.guid === handleObj.guid ) && + ( !tmp || tmp.test( handleObj.namespace ) ) && + ( !selector || selector === handleObj.selector || + selector === "**" && handleObj.selector ) ) { + handlers.splice( j, 1 ); + + if ( handleObj.selector ) { + handlers.delegateCount--; + } + if ( special.remove ) { + special.remove.call( elem, handleObj ); + } + } + } + + // Remove generic event handler if we removed something and no more handlers exist + // (avoids potential for endless recursion during removal of special event handlers) + if ( origCount && !handlers.length ) { + if ( !special.teardown || + special.teardown.call( elem, namespaces, elemData.handle ) === false ) { + + jQuery.removeEvent( elem, type, elemData.handle ); + } + + delete events[ type ]; + } + } + + // Remove data and the expando if it's no longer used + if ( jQuery.isEmptyObject( events ) ) { + dataPriv.remove( elem, "handle events" ); + } + }, + + dispatch: function( nativeEvent ) { + + var i, j, ret, matched, handleObj, handlerQueue, + args = new Array( arguments.length ), + + // Make a writable jQuery.Event from the native event object + event = jQuery.event.fix( nativeEvent ), + + handlers = ( + dataPriv.get( this, "events" ) || Object.create( null ) + )[ event.type ] || [], + special = jQuery.event.special[ event.type ] || {}; + + // Use the fix-ed jQuery.Event rather than the (read-only) native event + args[ 0 ] = event; + + for ( i = 1; i < arguments.length; i++ ) { + args[ i ] = arguments[ i ]; + } + + event.delegateTarget = this; + + // Call the preDispatch hook for the mapped type, and let it bail if desired + if ( special.preDispatch && special.preDispatch.call( this, event ) === false ) { + return; + } + + // Determine handlers + handlerQueue = jQuery.event.handlers.call( this, event, handlers ); + + // Run delegates first; they may want to stop propagation beneath us + i = 0; + while ( ( matched = handlerQueue[ i++ ] ) && !event.isPropagationStopped() ) { + event.currentTarget = matched.elem; + + j = 0; + while ( ( handleObj = matched.handlers[ j++ ] ) && + !event.isImmediatePropagationStopped() ) { + + // If the event is namespaced, then each handler is only invoked if it is + // specially universal or its namespaces are a superset of the event's. + if ( !event.rnamespace || handleObj.namespace === false || + event.rnamespace.test( handleObj.namespace ) ) { + + event.handleObj = handleObj; + event.data = handleObj.data; + + ret = ( ( jQuery.event.special[ handleObj.origType ] || {} ).handle || + handleObj.handler ).apply( matched.elem, args ); + + if ( ret !== undefined ) { + if ( ( event.result = ret ) === false ) { + event.preventDefault(); + event.stopPropagation(); + } + } + } + } + } + + // Call the postDispatch hook for the mapped type + if ( special.postDispatch ) { + special.postDispatch.call( this, event ); + } + + return event.result; + }, + + handlers: function( event, handlers ) { + var i, handleObj, sel, matchedHandlers, matchedSelectors, + handlerQueue = [], + delegateCount = handlers.delegateCount, + cur = event.target; + + // Find delegate handlers + if ( delegateCount && + + // Support: IE <=9 + // Black-hole SVG instance trees (trac-13180) + cur.nodeType && + + // Support: Firefox <=42 + // Suppress spec-violating clicks indicating a non-primary pointer button (trac-3861) + // https://www.w3.org/TR/DOM-Level-3-Events/#event-type-click + // Support: IE 11 only + // ...but not arrow key "clicks" of radio inputs, which can have `button` -1 (gh-2343) + !( event.type === "click" && event.button >= 1 ) ) { + + for ( ; cur !== this; cur = cur.parentNode || this ) { + + // Don't check non-elements (#13208) + // Don't process clicks on disabled elements (#6911, #8165, #11382, #11764) + if ( cur.nodeType === 1 && !( event.type === "click" && cur.disabled === true ) ) { + matchedHandlers = []; + matchedSelectors = {}; + for ( i = 0; i < delegateCount; i++ ) { + handleObj = handlers[ i ]; + + // Don't conflict with Object.prototype properties (#13203) + sel = handleObj.selector + " "; + + if ( matchedSelectors[ sel ] === undefined ) { + matchedSelectors[ sel ] = handleObj.needsContext ? + jQuery( sel, this ).index( cur ) > -1 : + jQuery.find( sel, this, null, [ cur ] ).length; + } + if ( matchedSelectors[ sel ] ) { + matchedHandlers.push( handleObj ); + } + } + if ( matchedHandlers.length ) { + handlerQueue.push( { elem: cur, handlers: matchedHandlers } ); + } + } + } + } + + // Add the remaining (directly-bound) handlers + cur = this; + if ( delegateCount < handlers.length ) { + handlerQueue.push( { elem: cur, handlers: handlers.slice( delegateCount ) } ); + } + + return handlerQueue; + }, + + addProp: function( name, hook ) { + Object.defineProperty( jQuery.Event.prototype, name, { + enumerable: true, + configurable: true, + + get: isFunction( hook ) ? + function() { + if ( this.originalEvent ) { + return hook( this.originalEvent ); + } + } : + function() { + if ( this.originalEvent ) { + return this.originalEvent[ name ]; + } + }, + + set: function( value ) { + Object.defineProperty( this, name, { + enumerable: true, + configurable: true, + writable: true, + value: value + } ); + } + } ); + }, + + fix: function( originalEvent ) { + return originalEvent[ jQuery.expando ] ? + originalEvent : + new jQuery.Event( originalEvent ); + }, + + special: { + load: { + + // Prevent triggered image.load events from bubbling to window.load + noBubble: true + }, + click: { + + // Utilize native event to ensure correct state for checkable inputs + setup: function( data ) { + + // For mutual compressibility with _default, replace `this` access with a local var. + // `|| data` is dead code meant only to preserve the variable through minification. + var el = this || data; + + // Claim the first handler + if ( rcheckableType.test( el.type ) && + el.click && nodeName( el, "input" ) ) { + + // dataPriv.set( el, "click", ... ) + leverageNative( el, "click", returnTrue ); + } + + // Return false to allow normal processing in the caller + return false; + }, + trigger: function( data ) { + + // For mutual compressibility with _default, replace `this` access with a local var. + // `|| data` is dead code meant only to preserve the variable through minification. + var el = this || data; + + // Force setup before triggering a click + if ( rcheckableType.test( el.type ) && + el.click && nodeName( el, "input" ) ) { + + leverageNative( el, "click" ); + } + + // Return non-false to allow normal event-path propagation + return true; + }, + + // For cross-browser consistency, suppress native .click() on links + // Also prevent it if we're currently inside a leveraged native-event stack + _default: function( event ) { + var target = event.target; + return rcheckableType.test( target.type ) && + target.click && nodeName( target, "input" ) && + dataPriv.get( target, "click" ) || + nodeName( target, "a" ); + } + }, + + beforeunload: { + postDispatch: function( event ) { + + // Support: Firefox 20+ + // Firefox doesn't alert if the returnValue field is not set. + if ( event.result !== undefined && event.originalEvent ) { + event.originalEvent.returnValue = event.result; + } + } + } + } +}; + +// Ensure the presence of an event listener that handles manually-triggered +// synthetic events by interrupting progress until reinvoked in response to +// *native* events that it fires directly, ensuring that state changes have +// already occurred before other listeners are invoked. +function leverageNative( el, type, expectSync ) { + + // Missing expectSync indicates a trigger call, which must force setup through jQuery.event.add + if ( !expectSync ) { + if ( dataPriv.get( el, type ) === undefined ) { + jQuery.event.add( el, type, returnTrue ); + } + return; + } + + // Register the controller as a special universal handler for all event namespaces + dataPriv.set( el, type, false ); + jQuery.event.add( el, type, { + namespace: false, + handler: function( event ) { + var notAsync, result, + saved = dataPriv.get( this, type ); + + if ( ( event.isTrigger & 1 ) && this[ type ] ) { + + // Interrupt processing of the outer synthetic .trigger()ed event + // Saved data should be false in such cases, but might be a leftover capture object + // from an async native handler (gh-4350) + if ( !saved.length ) { + + // Store arguments for use when handling the inner native event + // There will always be at least one argument (an event object), so this array + // will not be confused with a leftover capture object. + saved = slice.call( arguments ); + dataPriv.set( this, type, saved ); + + // Trigger the native event and capture its result + // Support: IE <=9 - 11+ + // focus() and blur() are asynchronous + notAsync = expectSync( this, type ); + this[ type ](); + result = dataPriv.get( this, type ); + if ( saved !== result || notAsync ) { + dataPriv.set( this, type, false ); + } else { + result = {}; + } + if ( saved !== result ) { + + // Cancel the outer synthetic event + event.stopImmediatePropagation(); + event.preventDefault(); + + // Support: Chrome 86+ + // In Chrome, if an element having a focusout handler is blurred by + // clicking outside of it, it invokes the handler synchronously. If + // that handler calls `.remove()` on the element, the data is cleared, + // leaving `result` undefined. We need to guard against this. + return result && result.value; + } + + // If this is an inner synthetic event for an event with a bubbling surrogate + // (focus or blur), assume that the surrogate already propagated from triggering the + // native event and prevent that from happening again here. + // This technically gets the ordering wrong w.r.t. to `.trigger()` (in which the + // bubbling surrogate propagates *after* the non-bubbling base), but that seems + // less bad than duplication. + } else if ( ( jQuery.event.special[ type ] || {} ).delegateType ) { + event.stopPropagation(); + } + + // If this is a native event triggered above, everything is now in order + // Fire an inner synthetic event with the original arguments + } else if ( saved.length ) { + + // ...and capture the result + dataPriv.set( this, type, { + value: jQuery.event.trigger( + + // Support: IE <=9 - 11+ + // Extend with the prototype to reset the above stopImmediatePropagation() + jQuery.extend( saved[ 0 ], jQuery.Event.prototype ), + saved.slice( 1 ), + this + ) + } ); + + // Abort handling of the native event + event.stopImmediatePropagation(); + } + } + } ); +} + +jQuery.removeEvent = function( elem, type, handle ) { + + // This "if" is needed for plain objects + if ( elem.removeEventListener ) { + elem.removeEventListener( type, handle ); + } +}; + +jQuery.Event = function( src, props ) { + + // Allow instantiation without the 'new' keyword + if ( !( this instanceof jQuery.Event ) ) { + return new jQuery.Event( src, props ); + } + + // Event object + if ( src && src.type ) { + this.originalEvent = src; + this.type = src.type; + + // Events bubbling up the document may have been marked as prevented + // by a handler lower down the tree; reflect the correct value. + this.isDefaultPrevented = src.defaultPrevented || + src.defaultPrevented === undefined && + + // Support: Android <=2.3 only + src.returnValue === false ? + returnTrue : + returnFalse; + + // Create target properties + // Support: Safari <=6 - 7 only + // Target should not be a text node (#504, #13143) + this.target = ( src.target && src.target.nodeType === 3 ) ? + src.target.parentNode : + src.target; + + this.currentTarget = src.currentTarget; + this.relatedTarget = src.relatedTarget; + + // Event type + } else { + this.type = src; + } + + // Put explicitly provided properties onto the event object + if ( props ) { + jQuery.extend( this, props ); + } + + // Create a timestamp if incoming event doesn't have one + this.timeStamp = src && src.timeStamp || Date.now(); + + // Mark it as fixed + this[ jQuery.expando ] = true; +}; + +// jQuery.Event is based on DOM3 Events as specified by the ECMAScript Language Binding +// https://www.w3.org/TR/2003/WD-DOM-Level-3-Events-20030331/ecma-script-binding.html +jQuery.Event.prototype = { + constructor: jQuery.Event, + isDefaultPrevented: returnFalse, + isPropagationStopped: returnFalse, + isImmediatePropagationStopped: returnFalse, + isSimulated: false, + + preventDefault: function() { + var e = this.originalEvent; + + this.isDefaultPrevented = returnTrue; + + if ( e && !this.isSimulated ) { + e.preventDefault(); + } + }, + stopPropagation: function() { + var e = this.originalEvent; + + this.isPropagationStopped = returnTrue; + + if ( e && !this.isSimulated ) { + e.stopPropagation(); + } + }, + stopImmediatePropagation: function() { + var e = this.originalEvent; + + this.isImmediatePropagationStopped = returnTrue; + + if ( e && !this.isSimulated ) { + e.stopImmediatePropagation(); + } + + this.stopPropagation(); + } +}; + +// Includes all common event props including KeyEvent and MouseEvent specific props +jQuery.each( { + altKey: true, + bubbles: true, + cancelable: true, + changedTouches: true, + ctrlKey: true, + detail: true, + eventPhase: true, + metaKey: true, + pageX: true, + pageY: true, + shiftKey: true, + view: true, + "char": true, + code: true, + charCode: true, + key: true, + keyCode: true, + button: true, + buttons: true, + clientX: true, + clientY: true, + offsetX: true, + offsetY: true, + pointerId: true, + pointerType: true, + screenX: true, + screenY: true, + targetTouches: true, + toElement: true, + touches: true, + which: true +}, jQuery.event.addProp ); + +jQuery.each( { focus: "focusin", blur: "focusout" }, function( type, delegateType ) { + jQuery.event.special[ type ] = { + + // Utilize native event if possible so blur/focus sequence is correct + setup: function() { + + // Claim the first handler + // dataPriv.set( this, "focus", ... ) + // dataPriv.set( this, "blur", ... ) + leverageNative( this, type, expectSync ); + + // Return false to allow normal processing in the caller + return false; + }, + trigger: function() { + + // Force setup before trigger + leverageNative( this, type ); + + // Return non-false to allow normal event-path propagation + return true; + }, + + // Suppress native focus or blur as it's already being fired + // in leverageNative. + _default: function() { + return true; + }, + + delegateType: delegateType + }; +} ); + +// Create mouseenter/leave events using mouseover/out and event-time checks +// so that event delegation works in jQuery. +// Do the same for pointerenter/pointerleave and pointerover/pointerout +// +// Support: Safari 7 only +// Safari sends mouseenter too often; see: +// https://bugs.chromium.org/p/chromium/issues/detail?id=470258 +// for the description of the bug (it existed in older Chrome versions as well). +jQuery.each( { + mouseenter: "mouseover", + mouseleave: "mouseout", + pointerenter: "pointerover", + pointerleave: "pointerout" +}, function( orig, fix ) { + jQuery.event.special[ orig ] = { + delegateType: fix, + bindType: fix, + + handle: function( event ) { + var ret, + target = this, + related = event.relatedTarget, + handleObj = event.handleObj; + + // For mouseenter/leave call the handler if related is outside the target. + // NB: No relatedTarget if the mouse left/entered the browser window + if ( !related || ( related !== target && !jQuery.contains( target, related ) ) ) { + event.type = handleObj.origType; + ret = handleObj.handler.apply( this, arguments ); + event.type = fix; + } + return ret; + } + }; +} ); + +jQuery.fn.extend( { + + on: function( types, selector, data, fn ) { + return on( this, types, selector, data, fn ); + }, + one: function( types, selector, data, fn ) { + return on( this, types, selector, data, fn, 1 ); + }, + off: function( types, selector, fn ) { + var handleObj, type; + if ( types && types.preventDefault && types.handleObj ) { + + // ( event ) dispatched jQuery.Event + handleObj = types.handleObj; + jQuery( types.delegateTarget ).off( + handleObj.namespace ? + handleObj.origType + "." + handleObj.namespace : + handleObj.origType, + handleObj.selector, + handleObj.handler + ); + return this; + } + if ( typeof types === "object" ) { + + // ( types-object [, selector] ) + for ( type in types ) { + this.off( type, selector, types[ type ] ); + } + return this; + } + if ( selector === false || typeof selector === "function" ) { + + // ( types [, fn] ) + fn = selector; + selector = undefined; + } + if ( fn === false ) { + fn = returnFalse; + } + return this.each( function() { + jQuery.event.remove( this, types, fn, selector ); + } ); + } +} ); + + +var + + // Support: IE <=10 - 11, Edge 12 - 13 only + // In IE/Edge using regex groups here causes severe slowdowns. + // See https://connect.microsoft.com/IE/feedback/details/1736512/ + rnoInnerhtml = /\s*$/g; + +// Prefer a tbody over its parent table for containing new rows +function manipulationTarget( elem, content ) { + if ( nodeName( elem, "table" ) && + nodeName( content.nodeType !== 11 ? content : content.firstChild, "tr" ) ) { + + return jQuery( elem ).children( "tbody" )[ 0 ] || elem; + } + + return elem; +} + +// Replace/restore the type attribute of script elements for safe DOM manipulation +function disableScript( elem ) { + elem.type = ( elem.getAttribute( "type" ) !== null ) + "/" + elem.type; + return elem; +} +function restoreScript( elem ) { + if ( ( elem.type || "" ).slice( 0, 5 ) === "true/" ) { + elem.type = elem.type.slice( 5 ); + } else { + elem.removeAttribute( "type" ); + } + + return elem; +} + +function cloneCopyEvent( src, dest ) { + var i, l, type, pdataOld, udataOld, udataCur, events; + + if ( dest.nodeType !== 1 ) { + return; + } + + // 1. Copy private data: events, handlers, etc. + if ( dataPriv.hasData( src ) ) { + pdataOld = dataPriv.get( src ); + events = pdataOld.events; + + if ( events ) { + dataPriv.remove( dest, "handle events" ); + + for ( type in events ) { + for ( i = 0, l = events[ type ].length; i < l; i++ ) { + jQuery.event.add( dest, type, events[ type ][ i ] ); + } + } + } + } + + // 2. Copy user data + if ( dataUser.hasData( src ) ) { + udataOld = dataUser.access( src ); + udataCur = jQuery.extend( {}, udataOld ); + + dataUser.set( dest, udataCur ); + } +} + +// Fix IE bugs, see support tests +function fixInput( src, dest ) { + var nodeName = dest.nodeName.toLowerCase(); + + // Fails to persist the checked state of a cloned checkbox or radio button. + if ( nodeName === "input" && rcheckableType.test( src.type ) ) { + dest.checked = src.checked; + + // Fails to return the selected option to the default selected state when cloning options + } else if ( nodeName === "input" || nodeName === "textarea" ) { + dest.defaultValue = src.defaultValue; + } +} + +function domManip( collection, args, callback, ignored ) { + + // Flatten any nested arrays + args = flat( args ); + + var fragment, first, scripts, hasScripts, node, doc, + i = 0, + l = collection.length, + iNoClone = l - 1, + value = args[ 0 ], + valueIsFunction = isFunction( value ); + + // We can't cloneNode fragments that contain checked, in WebKit + if ( valueIsFunction || + ( l > 1 && typeof value === "string" && + !support.checkClone && rchecked.test( value ) ) ) { + return collection.each( function( index ) { + var self = collection.eq( index ); + if ( valueIsFunction ) { + args[ 0 ] = value.call( this, index, self.html() ); + } + domManip( self, args, callback, ignored ); + } ); + } + + if ( l ) { + fragment = buildFragment( args, collection[ 0 ].ownerDocument, false, collection, ignored ); + first = fragment.firstChild; + + if ( fragment.childNodes.length === 1 ) { + fragment = first; + } + + // Require either new content or an interest in ignored elements to invoke the callback + if ( first || ignored ) { + scripts = jQuery.map( getAll( fragment, "script" ), disableScript ); + hasScripts = scripts.length; + + // Use the original fragment for the last item + // instead of the first because it can end up + // being emptied incorrectly in certain situations (#8070). + for ( ; i < l; i++ ) { + node = fragment; + + if ( i !== iNoClone ) { + node = jQuery.clone( node, true, true ); + + // Keep references to cloned scripts for later restoration + if ( hasScripts ) { + + // Support: Android <=4.0 only, PhantomJS 1 only + // push.apply(_, arraylike) throws on ancient WebKit + jQuery.merge( scripts, getAll( node, "script" ) ); + } + } + + callback.call( collection[ i ], node, i ); + } + + if ( hasScripts ) { + doc = scripts[ scripts.length - 1 ].ownerDocument; + + // Reenable scripts + jQuery.map( scripts, restoreScript ); + + // Evaluate executable scripts on first document insertion + for ( i = 0; i < hasScripts; i++ ) { + node = scripts[ i ]; + if ( rscriptType.test( node.type || "" ) && + !dataPriv.access( node, "globalEval" ) && + jQuery.contains( doc, node ) ) { + + if ( node.src && ( node.type || "" ).toLowerCase() !== "module" ) { + + // Optional AJAX dependency, but won't run scripts if not present + if ( jQuery._evalUrl && !node.noModule ) { + jQuery._evalUrl( node.src, { + nonce: node.nonce || node.getAttribute( "nonce" ) + }, doc ); + } + } else { + DOMEval( node.textContent.replace( rcleanScript, "" ), node, doc ); + } + } + } + } + } + } + + return collection; +} + +function remove( elem, selector, keepData ) { + var node, + nodes = selector ? jQuery.filter( selector, elem ) : elem, + i = 0; + + for ( ; ( node = nodes[ i ] ) != null; i++ ) { + if ( !keepData && node.nodeType === 1 ) { + jQuery.cleanData( getAll( node ) ); + } + + if ( node.parentNode ) { + if ( keepData && isAttached( node ) ) { + setGlobalEval( getAll( node, "script" ) ); + } + node.parentNode.removeChild( node ); + } + } + + return elem; +} + +jQuery.extend( { + htmlPrefilter: function( html ) { + return html; + }, + + clone: function( elem, dataAndEvents, deepDataAndEvents ) { + var i, l, srcElements, destElements, + clone = elem.cloneNode( true ), + inPage = isAttached( elem ); + + // Fix IE cloning issues + if ( !support.noCloneChecked && ( elem.nodeType === 1 || elem.nodeType === 11 ) && + !jQuery.isXMLDoc( elem ) ) { + + // We eschew Sizzle here for performance reasons: https://jsperf.com/getall-vs-sizzle/2 + destElements = getAll( clone ); + srcElements = getAll( elem ); + + for ( i = 0, l = srcElements.length; i < l; i++ ) { + fixInput( srcElements[ i ], destElements[ i ] ); + } + } + + // Copy the events from the original to the clone + if ( dataAndEvents ) { + if ( deepDataAndEvents ) { + srcElements = srcElements || getAll( elem ); + destElements = destElements || getAll( clone ); + + for ( i = 0, l = srcElements.length; i < l; i++ ) { + cloneCopyEvent( srcElements[ i ], destElements[ i ] ); + } + } else { + cloneCopyEvent( elem, clone ); + } + } + + // Preserve script evaluation history + destElements = getAll( clone, "script" ); + if ( destElements.length > 0 ) { + setGlobalEval( destElements, !inPage && getAll( elem, "script" ) ); + } + + // Return the cloned set + return clone; + }, + + cleanData: function( elems ) { + var data, elem, type, + special = jQuery.event.special, + i = 0; + + for ( ; ( elem = elems[ i ] ) !== undefined; i++ ) { + if ( acceptData( elem ) ) { + if ( ( data = elem[ dataPriv.expando ] ) ) { + if ( data.events ) { + for ( type in data.events ) { + if ( special[ type ] ) { + jQuery.event.remove( elem, type ); + + // This is a shortcut to avoid jQuery.event.remove's overhead + } else { + jQuery.removeEvent( elem, type, data.handle ); + } + } + } + + // Support: Chrome <=35 - 45+ + // Assign undefined instead of using delete, see Data#remove + elem[ dataPriv.expando ] = undefined; + } + if ( elem[ dataUser.expando ] ) { + + // Support: Chrome <=35 - 45+ + // Assign undefined instead of using delete, see Data#remove + elem[ dataUser.expando ] = undefined; + } + } + } + } +} ); + +jQuery.fn.extend( { + detach: function( selector ) { + return remove( this, selector, true ); + }, + + remove: function( selector ) { + return remove( this, selector ); + }, + + text: function( value ) { + return access( this, function( value ) { + return value === undefined ? + jQuery.text( this ) : + this.empty().each( function() { + if ( this.nodeType === 1 || this.nodeType === 11 || this.nodeType === 9 ) { + this.textContent = value; + } + } ); + }, null, value, arguments.length ); + }, + + append: function() { + return domManip( this, arguments, function( elem ) { + if ( this.nodeType === 1 || this.nodeType === 11 || this.nodeType === 9 ) { + var target = manipulationTarget( this, elem ); + target.appendChild( elem ); + } + } ); + }, + + prepend: function() { + return domManip( this, arguments, function( elem ) { + if ( this.nodeType === 1 || this.nodeType === 11 || this.nodeType === 9 ) { + var target = manipulationTarget( this, elem ); + target.insertBefore( elem, target.firstChild ); + } + } ); + }, + + before: function() { + return domManip( this, arguments, function( elem ) { + if ( this.parentNode ) { + this.parentNode.insertBefore( elem, this ); + } + } ); + }, + + after: function() { + return domManip( this, arguments, function( elem ) { + if ( this.parentNode ) { + this.parentNode.insertBefore( elem, this.nextSibling ); + } + } ); + }, + + empty: function() { + var elem, + i = 0; + + for ( ; ( elem = this[ i ] ) != null; i++ ) { + if ( elem.nodeType === 1 ) { + + // Prevent memory leaks + jQuery.cleanData( getAll( elem, false ) ); + + // Remove any remaining nodes + elem.textContent = ""; + } + } + + return this; + }, + + clone: function( dataAndEvents, deepDataAndEvents ) { + dataAndEvents = dataAndEvents == null ? false : dataAndEvents; + deepDataAndEvents = deepDataAndEvents == null ? dataAndEvents : deepDataAndEvents; + + return this.map( function() { + return jQuery.clone( this, dataAndEvents, deepDataAndEvents ); + } ); + }, + + html: function( value ) { + return access( this, function( value ) { + var elem = this[ 0 ] || {}, + i = 0, + l = this.length; + + if ( value === undefined && elem.nodeType === 1 ) { + return elem.innerHTML; + } + + // See if we can take a shortcut and just use innerHTML + if ( typeof value === "string" && !rnoInnerhtml.test( value ) && + !wrapMap[ ( rtagName.exec( value ) || [ "", "" ] )[ 1 ].toLowerCase() ] ) { + + value = jQuery.htmlPrefilter( value ); + + try { + for ( ; i < l; i++ ) { + elem = this[ i ] || {}; + + // Remove element nodes and prevent memory leaks + if ( elem.nodeType === 1 ) { + jQuery.cleanData( getAll( elem, false ) ); + elem.innerHTML = value; + } + } + + elem = 0; + + // If using innerHTML throws an exception, use the fallback method + } catch ( e ) {} + } + + if ( elem ) { + this.empty().append( value ); + } + }, null, value, arguments.length ); + }, + + replaceWith: function() { + var ignored = []; + + // Make the changes, replacing each non-ignored context element with the new content + return domManip( this, arguments, function( elem ) { + var parent = this.parentNode; + + if ( jQuery.inArray( this, ignored ) < 0 ) { + jQuery.cleanData( getAll( this ) ); + if ( parent ) { + parent.replaceChild( elem, this ); + } + } + + // Force callback invocation + }, ignored ); + } +} ); + +jQuery.each( { + appendTo: "append", + prependTo: "prepend", + insertBefore: "before", + insertAfter: "after", + replaceAll: "replaceWith" +}, function( name, original ) { + jQuery.fn[ name ] = function( selector ) { + var elems, + ret = [], + insert = jQuery( selector ), + last = insert.length - 1, + i = 0; + + for ( ; i <= last; i++ ) { + elems = i === last ? this : this.clone( true ); + jQuery( insert[ i ] )[ original ]( elems ); + + // Support: Android <=4.0 only, PhantomJS 1 only + // .get() because push.apply(_, arraylike) throws on ancient WebKit + push.apply( ret, elems.get() ); + } + + return this.pushStack( ret ); + }; +} ); +var rnumnonpx = new RegExp( "^(" + pnum + ")(?!px)[a-z%]+$", "i" ); + +var getStyles = function( elem ) { + + // Support: IE <=11 only, Firefox <=30 (#15098, #14150) + // IE throws on elements created in popups + // FF meanwhile throws on frame elements through "defaultView.getComputedStyle" + var view = elem.ownerDocument.defaultView; + + if ( !view || !view.opener ) { + view = window; + } + + return view.getComputedStyle( elem ); + }; + +var swap = function( elem, options, callback ) { + var ret, name, + old = {}; + + // Remember the old values, and insert the new ones + for ( name in options ) { + old[ name ] = elem.style[ name ]; + elem.style[ name ] = options[ name ]; + } + + ret = callback.call( elem ); + + // Revert the old values + for ( name in options ) { + elem.style[ name ] = old[ name ]; + } + + return ret; +}; + + +var rboxStyle = new RegExp( cssExpand.join( "|" ), "i" ); + + + +( function() { + + // Executing both pixelPosition & boxSizingReliable tests require only one layout + // so they're executed at the same time to save the second computation. + function computeStyleTests() { + + // This is a singleton, we need to execute it only once + if ( !div ) { + return; + } + + container.style.cssText = "position:absolute;left:-11111px;width:60px;" + + "margin-top:1px;padding:0;border:0"; + div.style.cssText = + "position:relative;display:block;box-sizing:border-box;overflow:scroll;" + + "margin:auto;border:1px;padding:1px;" + + "width:60%;top:1%"; + documentElement.appendChild( container ).appendChild( div ); + + var divStyle = window.getComputedStyle( div ); + pixelPositionVal = divStyle.top !== "1%"; + + // Support: Android 4.0 - 4.3 only, Firefox <=3 - 44 + reliableMarginLeftVal = roundPixelMeasures( divStyle.marginLeft ) === 12; + + // Support: Android 4.0 - 4.3 only, Safari <=9.1 - 10.1, iOS <=7.0 - 9.3 + // Some styles come back with percentage values, even though they shouldn't + div.style.right = "60%"; + pixelBoxStylesVal = roundPixelMeasures( divStyle.right ) === 36; + + // Support: IE 9 - 11 only + // Detect misreporting of content dimensions for box-sizing:border-box elements + boxSizingReliableVal = roundPixelMeasures( divStyle.width ) === 36; + + // Support: IE 9 only + // Detect overflow:scroll screwiness (gh-3699) + // Support: Chrome <=64 + // Don't get tricked when zoom affects offsetWidth (gh-4029) + div.style.position = "absolute"; + scrollboxSizeVal = roundPixelMeasures( div.offsetWidth / 3 ) === 12; + + documentElement.removeChild( container ); + + // Nullify the div so it wouldn't be stored in the memory and + // it will also be a sign that checks already performed + div = null; + } + + function roundPixelMeasures( measure ) { + return Math.round( parseFloat( measure ) ); + } + + var pixelPositionVal, boxSizingReliableVal, scrollboxSizeVal, pixelBoxStylesVal, + reliableTrDimensionsVal, reliableMarginLeftVal, + container = document.createElement( "div" ), + div = document.createElement( "div" ); + + // Finish early in limited (non-browser) environments + if ( !div.style ) { + return; + } + + // Support: IE <=9 - 11 only + // Style of cloned element affects source element cloned (#8908) + div.style.backgroundClip = "content-box"; + div.cloneNode( true ).style.backgroundClip = ""; + support.clearCloneStyle = div.style.backgroundClip === "content-box"; + + jQuery.extend( support, { + boxSizingReliable: function() { + computeStyleTests(); + return boxSizingReliableVal; + }, + pixelBoxStyles: function() { + computeStyleTests(); + return pixelBoxStylesVal; + }, + pixelPosition: function() { + computeStyleTests(); + return pixelPositionVal; + }, + reliableMarginLeft: function() { + computeStyleTests(); + return reliableMarginLeftVal; + }, + scrollboxSize: function() { + computeStyleTests(); + return scrollboxSizeVal; + }, + + // Support: IE 9 - 11+, Edge 15 - 18+ + // IE/Edge misreport `getComputedStyle` of table rows with width/height + // set in CSS while `offset*` properties report correct values. + // Behavior in IE 9 is more subtle than in newer versions & it passes + // some versions of this test; make sure not to make it pass there! + // + // Support: Firefox 70+ + // Only Firefox includes border widths + // in computed dimensions. (gh-4529) + reliableTrDimensions: function() { + var table, tr, trChild, trStyle; + if ( reliableTrDimensionsVal == null ) { + table = document.createElement( "table" ); + tr = document.createElement( "tr" ); + trChild = document.createElement( "div" ); + + table.style.cssText = "position:absolute;left:-11111px;border-collapse:separate"; + tr.style.cssText = "border:1px solid"; + + // Support: Chrome 86+ + // Height set through cssText does not get applied. + // Computed height then comes back as 0. + tr.style.height = "1px"; + trChild.style.height = "9px"; + + // Support: Android 8 Chrome 86+ + // In our bodyBackground.html iframe, + // display for all div elements is set to "inline", + // which causes a problem only in Android 8 Chrome 86. + // Ensuring the div is display: block + // gets around this issue. + trChild.style.display = "block"; + + documentElement + .appendChild( table ) + .appendChild( tr ) + .appendChild( trChild ); + + trStyle = window.getComputedStyle( tr ); + reliableTrDimensionsVal = ( parseInt( trStyle.height, 10 ) + + parseInt( trStyle.borderTopWidth, 10 ) + + parseInt( trStyle.borderBottomWidth, 10 ) ) === tr.offsetHeight; + + documentElement.removeChild( table ); + } + return reliableTrDimensionsVal; + } + } ); +} )(); + + +function curCSS( elem, name, computed ) { + var width, minWidth, maxWidth, ret, + + // Support: Firefox 51+ + // Retrieving style before computed somehow + // fixes an issue with getting wrong values + // on detached elements + style = elem.style; + + computed = computed || getStyles( elem ); + + // getPropertyValue is needed for: + // .css('filter') (IE 9 only, #12537) + // .css('--customProperty) (#3144) + if ( computed ) { + ret = computed.getPropertyValue( name ) || computed[ name ]; + + if ( ret === "" && !isAttached( elem ) ) { + ret = jQuery.style( elem, name ); + } + + // A tribute to the "awesome hack by Dean Edwards" + // Android Browser returns percentage for some values, + // but width seems to be reliably pixels. + // This is against the CSSOM draft spec: + // https://drafts.csswg.org/cssom/#resolved-values + if ( !support.pixelBoxStyles() && rnumnonpx.test( ret ) && rboxStyle.test( name ) ) { + + // Remember the original values + width = style.width; + minWidth = style.minWidth; + maxWidth = style.maxWidth; + + // Put in the new values to get a computed value out + style.minWidth = style.maxWidth = style.width = ret; + ret = computed.width; + + // Revert the changed values + style.width = width; + style.minWidth = minWidth; + style.maxWidth = maxWidth; + } + } + + return ret !== undefined ? + + // Support: IE <=9 - 11 only + // IE returns zIndex value as an integer. + ret + "" : + ret; +} + + +function addGetHookIf( conditionFn, hookFn ) { + + // Define the hook, we'll check on the first run if it's really needed. + return { + get: function() { + if ( conditionFn() ) { + + // Hook not needed (or it's not possible to use it due + // to missing dependency), remove it. + delete this.get; + return; + } + + // Hook needed; redefine it so that the support test is not executed again. + return ( this.get = hookFn ).apply( this, arguments ); + } + }; +} + + +var cssPrefixes = [ "Webkit", "Moz", "ms" ], + emptyStyle = document.createElement( "div" ).style, + vendorProps = {}; + +// Return a vendor-prefixed property or undefined +function vendorPropName( name ) { + + // Check for vendor prefixed names + var capName = name[ 0 ].toUpperCase() + name.slice( 1 ), + i = cssPrefixes.length; + + while ( i-- ) { + name = cssPrefixes[ i ] + capName; + if ( name in emptyStyle ) { + return name; + } + } +} + +// Return a potentially-mapped jQuery.cssProps or vendor prefixed property +function finalPropName( name ) { + var final = jQuery.cssProps[ name ] || vendorProps[ name ]; + + if ( final ) { + return final; + } + if ( name in emptyStyle ) { + return name; + } + return vendorProps[ name ] = vendorPropName( name ) || name; +} + + +var + + // Swappable if display is none or starts with table + // except "table", "table-cell", or "table-caption" + // See here for display values: https://developer.mozilla.org/en-US/docs/CSS/display + rdisplayswap = /^(none|table(?!-c[ea]).+)/, + rcustomProp = /^--/, + cssShow = { position: "absolute", visibility: "hidden", display: "block" }, + cssNormalTransform = { + letterSpacing: "0", + fontWeight: "400" + }; + +function setPositiveNumber( _elem, value, subtract ) { + + // Any relative (+/-) values have already been + // normalized at this point + var matches = rcssNum.exec( value ); + return matches ? + + // Guard against undefined "subtract", e.g., when used as in cssHooks + Math.max( 0, matches[ 2 ] - ( subtract || 0 ) ) + ( matches[ 3 ] || "px" ) : + value; +} + +function boxModelAdjustment( elem, dimension, box, isBorderBox, styles, computedVal ) { + var i = dimension === "width" ? 1 : 0, + extra = 0, + delta = 0; + + // Adjustment may not be necessary + if ( box === ( isBorderBox ? "border" : "content" ) ) { + return 0; + } + + for ( ; i < 4; i += 2 ) { + + // Both box models exclude margin + if ( box === "margin" ) { + delta += jQuery.css( elem, box + cssExpand[ i ], true, styles ); + } + + // If we get here with a content-box, we're seeking "padding" or "border" or "margin" + if ( !isBorderBox ) { + + // Add padding + delta += jQuery.css( elem, "padding" + cssExpand[ i ], true, styles ); + + // For "border" or "margin", add border + if ( box !== "padding" ) { + delta += jQuery.css( elem, "border" + cssExpand[ i ] + "Width", true, styles ); + + // But still keep track of it otherwise + } else { + extra += jQuery.css( elem, "border" + cssExpand[ i ] + "Width", true, styles ); + } + + // If we get here with a border-box (content + padding + border), we're seeking "content" or + // "padding" or "margin" + } else { + + // For "content", subtract padding + if ( box === "content" ) { + delta -= jQuery.css( elem, "padding" + cssExpand[ i ], true, styles ); + } + + // For "content" or "padding", subtract border + if ( box !== "margin" ) { + delta -= jQuery.css( elem, "border" + cssExpand[ i ] + "Width", true, styles ); + } + } + } + + // Account for positive content-box scroll gutter when requested by providing computedVal + if ( !isBorderBox && computedVal >= 0 ) { + + // offsetWidth/offsetHeight is a rounded sum of content, padding, scroll gutter, and border + // Assuming integer scroll gutter, subtract the rest and round down + delta += Math.max( 0, Math.ceil( + elem[ "offset" + dimension[ 0 ].toUpperCase() + dimension.slice( 1 ) ] - + computedVal - + delta - + extra - + 0.5 + + // If offsetWidth/offsetHeight is unknown, then we can't determine content-box scroll gutter + // Use an explicit zero to avoid NaN (gh-3964) + ) ) || 0; + } + + return delta; +} + +function getWidthOrHeight( elem, dimension, extra ) { + + // Start with computed style + var styles = getStyles( elem ), + + // To avoid forcing a reflow, only fetch boxSizing if we need it (gh-4322). + // Fake content-box until we know it's needed to know the true value. + boxSizingNeeded = !support.boxSizingReliable() || extra, + isBorderBox = boxSizingNeeded && + jQuery.css( elem, "boxSizing", false, styles ) === "border-box", + valueIsBorderBox = isBorderBox, + + val = curCSS( elem, dimension, styles ), + offsetProp = "offset" + dimension[ 0 ].toUpperCase() + dimension.slice( 1 ); + + // Support: Firefox <=54 + // Return a confounding non-pixel value or feign ignorance, as appropriate. + if ( rnumnonpx.test( val ) ) { + if ( !extra ) { + return val; + } + val = "auto"; + } + + + // Support: IE 9 - 11 only + // Use offsetWidth/offsetHeight for when box sizing is unreliable. + // In those cases, the computed value can be trusted to be border-box. + if ( ( !support.boxSizingReliable() && isBorderBox || + + // Support: IE 10 - 11+, Edge 15 - 18+ + // IE/Edge misreport `getComputedStyle` of table rows with width/height + // set in CSS while `offset*` properties report correct values. + // Interestingly, in some cases IE 9 doesn't suffer from this issue. + !support.reliableTrDimensions() && nodeName( elem, "tr" ) || + + // Fall back to offsetWidth/offsetHeight when value is "auto" + // This happens for inline elements with no explicit setting (gh-3571) + val === "auto" || + + // Support: Android <=4.1 - 4.3 only + // Also use offsetWidth/offsetHeight for misreported inline dimensions (gh-3602) + !parseFloat( val ) && jQuery.css( elem, "display", false, styles ) === "inline" ) && + + // Make sure the element is visible & connected + elem.getClientRects().length ) { + + isBorderBox = jQuery.css( elem, "boxSizing", false, styles ) === "border-box"; + + // Where available, offsetWidth/offsetHeight approximate border box dimensions. + // Where not available (e.g., SVG), assume unreliable box-sizing and interpret the + // retrieved value as a content box dimension. + valueIsBorderBox = offsetProp in elem; + if ( valueIsBorderBox ) { + val = elem[ offsetProp ]; + } + } + + // Normalize "" and auto + val = parseFloat( val ) || 0; + + // Adjust for the element's box model + return ( val + + boxModelAdjustment( + elem, + dimension, + extra || ( isBorderBox ? "border" : "content" ), + valueIsBorderBox, + styles, + + // Provide the current computed size to request scroll gutter calculation (gh-3589) + val + ) + ) + "px"; +} + +jQuery.extend( { + + // Add in style property hooks for overriding the default + // behavior of getting and setting a style property + cssHooks: { + opacity: { + get: function( elem, computed ) { + if ( computed ) { + + // We should always get a number back from opacity + var ret = curCSS( elem, "opacity" ); + return ret === "" ? "1" : ret; + } + } + } + }, + + // Don't automatically add "px" to these possibly-unitless properties + cssNumber: { + "animationIterationCount": true, + "columnCount": true, + "fillOpacity": true, + "flexGrow": true, + "flexShrink": true, + "fontWeight": true, + "gridArea": true, + "gridColumn": true, + "gridColumnEnd": true, + "gridColumnStart": true, + "gridRow": true, + "gridRowEnd": true, + "gridRowStart": true, + "lineHeight": true, + "opacity": true, + "order": true, + "orphans": true, + "widows": true, + "zIndex": true, + "zoom": true + }, + + // Add in properties whose names you wish to fix before + // setting or getting the value + cssProps: {}, + + // Get and set the style property on a DOM Node + style: function( elem, name, value, extra ) { + + // Don't set styles on text and comment nodes + if ( !elem || elem.nodeType === 3 || elem.nodeType === 8 || !elem.style ) { + return; + } + + // Make sure that we're working with the right name + var ret, type, hooks, + origName = camelCase( name ), + isCustomProp = rcustomProp.test( name ), + style = elem.style; + + // Make sure that we're working with the right name. We don't + // want to query the value if it is a CSS custom property + // since they are user-defined. + if ( !isCustomProp ) { + name = finalPropName( origName ); + } + + // Gets hook for the prefixed version, then unprefixed version + hooks = jQuery.cssHooks[ name ] || jQuery.cssHooks[ origName ]; + + // Check if we're setting a value + if ( value !== undefined ) { + type = typeof value; + + // Convert "+=" or "-=" to relative numbers (#7345) + if ( type === "string" && ( ret = rcssNum.exec( value ) ) && ret[ 1 ] ) { + value = adjustCSS( elem, name, ret ); + + // Fixes bug #9237 + type = "number"; + } + + // Make sure that null and NaN values aren't set (#7116) + if ( value == null || value !== value ) { + return; + } + + // If a number was passed in, add the unit (except for certain CSS properties) + // The isCustomProp check can be removed in jQuery 4.0 when we only auto-append + // "px" to a few hardcoded values. + if ( type === "number" && !isCustomProp ) { + value += ret && ret[ 3 ] || ( jQuery.cssNumber[ origName ] ? "" : "px" ); + } + + // background-* props affect original clone's values + if ( !support.clearCloneStyle && value === "" && name.indexOf( "background" ) === 0 ) { + style[ name ] = "inherit"; + } + + // If a hook was provided, use that value, otherwise just set the specified value + if ( !hooks || !( "set" in hooks ) || + ( value = hooks.set( elem, value, extra ) ) !== undefined ) { + + if ( isCustomProp ) { + style.setProperty( name, value ); + } else { + style[ name ] = value; + } + } + + } else { + + // If a hook was provided get the non-computed value from there + if ( hooks && "get" in hooks && + ( ret = hooks.get( elem, false, extra ) ) !== undefined ) { + + return ret; + } + + // Otherwise just get the value from the style object + return style[ name ]; + } + }, + + css: function( elem, name, extra, styles ) { + var val, num, hooks, + origName = camelCase( name ), + isCustomProp = rcustomProp.test( name ); + + // Make sure that we're working with the right name. We don't + // want to modify the value if it is a CSS custom property + // since they are user-defined. + if ( !isCustomProp ) { + name = finalPropName( origName ); + } + + // Try prefixed name followed by the unprefixed name + hooks = jQuery.cssHooks[ name ] || jQuery.cssHooks[ origName ]; + + // If a hook was provided get the computed value from there + if ( hooks && "get" in hooks ) { + val = hooks.get( elem, true, extra ); + } + + // Otherwise, if a way to get the computed value exists, use that + if ( val === undefined ) { + val = curCSS( elem, name, styles ); + } + + // Convert "normal" to computed value + if ( val === "normal" && name in cssNormalTransform ) { + val = cssNormalTransform[ name ]; + } + + // Make numeric if forced or a qualifier was provided and val looks numeric + if ( extra === "" || extra ) { + num = parseFloat( val ); + return extra === true || isFinite( num ) ? num || 0 : val; + } + + return val; + } +} ); + +jQuery.each( [ "height", "width" ], function( _i, dimension ) { + jQuery.cssHooks[ dimension ] = { + get: function( elem, computed, extra ) { + if ( computed ) { + + // Certain elements can have dimension info if we invisibly show them + // but it must have a current display style that would benefit + return rdisplayswap.test( jQuery.css( elem, "display" ) ) && + + // Support: Safari 8+ + // Table columns in Safari have non-zero offsetWidth & zero + // getBoundingClientRect().width unless display is changed. + // Support: IE <=11 only + // Running getBoundingClientRect on a disconnected node + // in IE throws an error. + ( !elem.getClientRects().length || !elem.getBoundingClientRect().width ) ? + swap( elem, cssShow, function() { + return getWidthOrHeight( elem, dimension, extra ); + } ) : + getWidthOrHeight( elem, dimension, extra ); + } + }, + + set: function( elem, value, extra ) { + var matches, + styles = getStyles( elem ), + + // Only read styles.position if the test has a chance to fail + // to avoid forcing a reflow. + scrollboxSizeBuggy = !support.scrollboxSize() && + styles.position === "absolute", + + // To avoid forcing a reflow, only fetch boxSizing if we need it (gh-3991) + boxSizingNeeded = scrollboxSizeBuggy || extra, + isBorderBox = boxSizingNeeded && + jQuery.css( elem, "boxSizing", false, styles ) === "border-box", + subtract = extra ? + boxModelAdjustment( + elem, + dimension, + extra, + isBorderBox, + styles + ) : + 0; + + // Account for unreliable border-box dimensions by comparing offset* to computed and + // faking a content-box to get border and padding (gh-3699) + if ( isBorderBox && scrollboxSizeBuggy ) { + subtract -= Math.ceil( + elem[ "offset" + dimension[ 0 ].toUpperCase() + dimension.slice( 1 ) ] - + parseFloat( styles[ dimension ] ) - + boxModelAdjustment( elem, dimension, "border", false, styles ) - + 0.5 + ); + } + + // Convert to pixels if value adjustment is needed + if ( subtract && ( matches = rcssNum.exec( value ) ) && + ( matches[ 3 ] || "px" ) !== "px" ) { + + elem.style[ dimension ] = value; + value = jQuery.css( elem, dimension ); + } + + return setPositiveNumber( elem, value, subtract ); + } + }; +} ); + +jQuery.cssHooks.marginLeft = addGetHookIf( support.reliableMarginLeft, + function( elem, computed ) { + if ( computed ) { + return ( parseFloat( curCSS( elem, "marginLeft" ) ) || + elem.getBoundingClientRect().left - + swap( elem, { marginLeft: 0 }, function() { + return elem.getBoundingClientRect().left; + } ) + ) + "px"; + } + } +); + +// These hooks are used by animate to expand properties +jQuery.each( { + margin: "", + padding: "", + border: "Width" +}, function( prefix, suffix ) { + jQuery.cssHooks[ prefix + suffix ] = { + expand: function( value ) { + var i = 0, + expanded = {}, + + // Assumes a single number if not a string + parts = typeof value === "string" ? value.split( " " ) : [ value ]; + + for ( ; i < 4; i++ ) { + expanded[ prefix + cssExpand[ i ] + suffix ] = + parts[ i ] || parts[ i - 2 ] || parts[ 0 ]; + } + + return expanded; + } + }; + + if ( prefix !== "margin" ) { + jQuery.cssHooks[ prefix + suffix ].set = setPositiveNumber; + } +} ); + +jQuery.fn.extend( { + css: function( name, value ) { + return access( this, function( elem, name, value ) { + var styles, len, + map = {}, + i = 0; + + if ( Array.isArray( name ) ) { + styles = getStyles( elem ); + len = name.length; + + for ( ; i < len; i++ ) { + map[ name[ i ] ] = jQuery.css( elem, name[ i ], false, styles ); + } + + return map; + } + + return value !== undefined ? + jQuery.style( elem, name, value ) : + jQuery.css( elem, name ); + }, name, value, arguments.length > 1 ); + } +} ); + + +function Tween( elem, options, prop, end, easing ) { + return new Tween.prototype.init( elem, options, prop, end, easing ); +} +jQuery.Tween = Tween; + +Tween.prototype = { + constructor: Tween, + init: function( elem, options, prop, end, easing, unit ) { + this.elem = elem; + this.prop = prop; + this.easing = easing || jQuery.easing._default; + this.options = options; + this.start = this.now = this.cur(); + this.end = end; + this.unit = unit || ( jQuery.cssNumber[ prop ] ? "" : "px" ); + }, + cur: function() { + var hooks = Tween.propHooks[ this.prop ]; + + return hooks && hooks.get ? + hooks.get( this ) : + Tween.propHooks._default.get( this ); + }, + run: function( percent ) { + var eased, + hooks = Tween.propHooks[ this.prop ]; + + if ( this.options.duration ) { + this.pos = eased = jQuery.easing[ this.easing ]( + percent, this.options.duration * percent, 0, 1, this.options.duration + ); + } else { + this.pos = eased = percent; + } + this.now = ( this.end - this.start ) * eased + this.start; + + if ( this.options.step ) { + this.options.step.call( this.elem, this.now, this ); + } + + if ( hooks && hooks.set ) { + hooks.set( this ); + } else { + Tween.propHooks._default.set( this ); + } + return this; + } +}; + +Tween.prototype.init.prototype = Tween.prototype; + +Tween.propHooks = { + _default: { + get: function( tween ) { + var result; + + // Use a property on the element directly when it is not a DOM element, + // or when there is no matching style property that exists. + if ( tween.elem.nodeType !== 1 || + tween.elem[ tween.prop ] != null && tween.elem.style[ tween.prop ] == null ) { + return tween.elem[ tween.prop ]; + } + + // Passing an empty string as a 3rd parameter to .css will automatically + // attempt a parseFloat and fallback to a string if the parse fails. + // Simple values such as "10px" are parsed to Float; + // complex values such as "rotate(1rad)" are returned as-is. + result = jQuery.css( tween.elem, tween.prop, "" ); + + // Empty strings, null, undefined and "auto" are converted to 0. + return !result || result === "auto" ? 0 : result; + }, + set: function( tween ) { + + // Use step hook for back compat. + // Use cssHook if its there. + // Use .style if available and use plain properties where available. + if ( jQuery.fx.step[ tween.prop ] ) { + jQuery.fx.step[ tween.prop ]( tween ); + } else if ( tween.elem.nodeType === 1 && ( + jQuery.cssHooks[ tween.prop ] || + tween.elem.style[ finalPropName( tween.prop ) ] != null ) ) { + jQuery.style( tween.elem, tween.prop, tween.now + tween.unit ); + } else { + tween.elem[ tween.prop ] = tween.now; + } + } + } +}; + +// Support: IE <=9 only +// Panic based approach to setting things on disconnected nodes +Tween.propHooks.scrollTop = Tween.propHooks.scrollLeft = { + set: function( tween ) { + if ( tween.elem.nodeType && tween.elem.parentNode ) { + tween.elem[ tween.prop ] = tween.now; + } + } +}; + +jQuery.easing = { + linear: function( p ) { + return p; + }, + swing: function( p ) { + return 0.5 - Math.cos( p * Math.PI ) / 2; + }, + _default: "swing" +}; + +jQuery.fx = Tween.prototype.init; + +// Back compat <1.8 extension point +jQuery.fx.step = {}; + + + + +var + fxNow, inProgress, + rfxtypes = /^(?:toggle|show|hide)$/, + rrun = /queueHooks$/; + +function schedule() { + if ( inProgress ) { + if ( document.hidden === false && window.requestAnimationFrame ) { + window.requestAnimationFrame( schedule ); + } else { + window.setTimeout( schedule, jQuery.fx.interval ); + } + + jQuery.fx.tick(); + } +} + +// Animations created synchronously will run synchronously +function createFxNow() { + window.setTimeout( function() { + fxNow = undefined; + } ); + return ( fxNow = Date.now() ); +} + +// Generate parameters to create a standard animation +function genFx( type, includeWidth ) { + var which, + i = 0, + attrs = { height: type }; + + // If we include width, step value is 1 to do all cssExpand values, + // otherwise step value is 2 to skip over Left and Right + includeWidth = includeWidth ? 1 : 0; + for ( ; i < 4; i += 2 - includeWidth ) { + which = cssExpand[ i ]; + attrs[ "margin" + which ] = attrs[ "padding" + which ] = type; + } + + if ( includeWidth ) { + attrs.opacity = attrs.width = type; + } + + return attrs; +} + +function createTween( value, prop, animation ) { + var tween, + collection = ( Animation.tweeners[ prop ] || [] ).concat( Animation.tweeners[ "*" ] ), + index = 0, + length = collection.length; + for ( ; index < length; index++ ) { + if ( ( tween = collection[ index ].call( animation, prop, value ) ) ) { + + // We're done with this property + return tween; + } + } +} + +function defaultPrefilter( elem, props, opts ) { + var prop, value, toggle, hooks, oldfire, propTween, restoreDisplay, display, + isBox = "width" in props || "height" in props, + anim = this, + orig = {}, + style = elem.style, + hidden = elem.nodeType && isHiddenWithinTree( elem ), + dataShow = dataPriv.get( elem, "fxshow" ); + + // Queue-skipping animations hijack the fx hooks + if ( !opts.queue ) { + hooks = jQuery._queueHooks( elem, "fx" ); + if ( hooks.unqueued == null ) { + hooks.unqueued = 0; + oldfire = hooks.empty.fire; + hooks.empty.fire = function() { + if ( !hooks.unqueued ) { + oldfire(); + } + }; + } + hooks.unqueued++; + + anim.always( function() { + + // Ensure the complete handler is called before this completes + anim.always( function() { + hooks.unqueued--; + if ( !jQuery.queue( elem, "fx" ).length ) { + hooks.empty.fire(); + } + } ); + } ); + } + + // Detect show/hide animations + for ( prop in props ) { + value = props[ prop ]; + if ( rfxtypes.test( value ) ) { + delete props[ prop ]; + toggle = toggle || value === "toggle"; + if ( value === ( hidden ? "hide" : "show" ) ) { + + // Pretend to be hidden if this is a "show" and + // there is still data from a stopped show/hide + if ( value === "show" && dataShow && dataShow[ prop ] !== undefined ) { + hidden = true; + + // Ignore all other no-op show/hide data + } else { + continue; + } + } + orig[ prop ] = dataShow && dataShow[ prop ] || jQuery.style( elem, prop ); + } + } + + // Bail out if this is a no-op like .hide().hide() + propTween = !jQuery.isEmptyObject( props ); + if ( !propTween && jQuery.isEmptyObject( orig ) ) { + return; + } + + // Restrict "overflow" and "display" styles during box animations + if ( isBox && elem.nodeType === 1 ) { + + // Support: IE <=9 - 11, Edge 12 - 15 + // Record all 3 overflow attributes because IE does not infer the shorthand + // from identically-valued overflowX and overflowY and Edge just mirrors + // the overflowX value there. + opts.overflow = [ style.overflow, style.overflowX, style.overflowY ]; + + // Identify a display type, preferring old show/hide data over the CSS cascade + restoreDisplay = dataShow && dataShow.display; + if ( restoreDisplay == null ) { + restoreDisplay = dataPriv.get( elem, "display" ); + } + display = jQuery.css( elem, "display" ); + if ( display === "none" ) { + if ( restoreDisplay ) { + display = restoreDisplay; + } else { + + // Get nonempty value(s) by temporarily forcing visibility + showHide( [ elem ], true ); + restoreDisplay = elem.style.display || restoreDisplay; + display = jQuery.css( elem, "display" ); + showHide( [ elem ] ); + } + } + + // Animate inline elements as inline-block + if ( display === "inline" || display === "inline-block" && restoreDisplay != null ) { + if ( jQuery.css( elem, "float" ) === "none" ) { + + // Restore the original display value at the end of pure show/hide animations + if ( !propTween ) { + anim.done( function() { + style.display = restoreDisplay; + } ); + if ( restoreDisplay == null ) { + display = style.display; + restoreDisplay = display === "none" ? "" : display; + } + } + style.display = "inline-block"; + } + } + } + + if ( opts.overflow ) { + style.overflow = "hidden"; + anim.always( function() { + style.overflow = opts.overflow[ 0 ]; + style.overflowX = opts.overflow[ 1 ]; + style.overflowY = opts.overflow[ 2 ]; + } ); + } + + // Implement show/hide animations + propTween = false; + for ( prop in orig ) { + + // General show/hide setup for this element animation + if ( !propTween ) { + if ( dataShow ) { + if ( "hidden" in dataShow ) { + hidden = dataShow.hidden; + } + } else { + dataShow = dataPriv.access( elem, "fxshow", { display: restoreDisplay } ); + } + + // Store hidden/visible for toggle so `.stop().toggle()` "reverses" + if ( toggle ) { + dataShow.hidden = !hidden; + } + + // Show elements before animating them + if ( hidden ) { + showHide( [ elem ], true ); + } + + /* eslint-disable no-loop-func */ + + anim.done( function() { + + /* eslint-enable no-loop-func */ + + // The final step of a "hide" animation is actually hiding the element + if ( !hidden ) { + showHide( [ elem ] ); + } + dataPriv.remove( elem, "fxshow" ); + for ( prop in orig ) { + jQuery.style( elem, prop, orig[ prop ] ); + } + } ); + } + + // Per-property setup + propTween = createTween( hidden ? dataShow[ prop ] : 0, prop, anim ); + if ( !( prop in dataShow ) ) { + dataShow[ prop ] = propTween.start; + if ( hidden ) { + propTween.end = propTween.start; + propTween.start = 0; + } + } + } +} + +function propFilter( props, specialEasing ) { + var index, name, easing, value, hooks; + + // camelCase, specialEasing and expand cssHook pass + for ( index in props ) { + name = camelCase( index ); + easing = specialEasing[ name ]; + value = props[ index ]; + if ( Array.isArray( value ) ) { + easing = value[ 1 ]; + value = props[ index ] = value[ 0 ]; + } + + if ( index !== name ) { + props[ name ] = value; + delete props[ index ]; + } + + hooks = jQuery.cssHooks[ name ]; + if ( hooks && "expand" in hooks ) { + value = hooks.expand( value ); + delete props[ name ]; + + // Not quite $.extend, this won't overwrite existing keys. + // Reusing 'index' because we have the correct "name" + for ( index in value ) { + if ( !( index in props ) ) { + props[ index ] = value[ index ]; + specialEasing[ index ] = easing; + } + } + } else { + specialEasing[ name ] = easing; + } + } +} + +function Animation( elem, properties, options ) { + var result, + stopped, + index = 0, + length = Animation.prefilters.length, + deferred = jQuery.Deferred().always( function() { + + // Don't match elem in the :animated selector + delete tick.elem; + } ), + tick = function() { + if ( stopped ) { + return false; + } + var currentTime = fxNow || createFxNow(), + remaining = Math.max( 0, animation.startTime + animation.duration - currentTime ), + + // Support: Android 2.3 only + // Archaic crash bug won't allow us to use `1 - ( 0.5 || 0 )` (#12497) + temp = remaining / animation.duration || 0, + percent = 1 - temp, + index = 0, + length = animation.tweens.length; + + for ( ; index < length; index++ ) { + animation.tweens[ index ].run( percent ); + } + + deferred.notifyWith( elem, [ animation, percent, remaining ] ); + + // If there's more to do, yield + if ( percent < 1 && length ) { + return remaining; + } + + // If this was an empty animation, synthesize a final progress notification + if ( !length ) { + deferred.notifyWith( elem, [ animation, 1, 0 ] ); + } + + // Resolve the animation and report its conclusion + deferred.resolveWith( elem, [ animation ] ); + return false; + }, + animation = deferred.promise( { + elem: elem, + props: jQuery.extend( {}, properties ), + opts: jQuery.extend( true, { + specialEasing: {}, + easing: jQuery.easing._default + }, options ), + originalProperties: properties, + originalOptions: options, + startTime: fxNow || createFxNow(), + duration: options.duration, + tweens: [], + createTween: function( prop, end ) { + var tween = jQuery.Tween( elem, animation.opts, prop, end, + animation.opts.specialEasing[ prop ] || animation.opts.easing ); + animation.tweens.push( tween ); + return tween; + }, + stop: function( gotoEnd ) { + var index = 0, + + // If we are going to the end, we want to run all the tweens + // otherwise we skip this part + length = gotoEnd ? animation.tweens.length : 0; + if ( stopped ) { + return this; + } + stopped = true; + for ( ; index < length; index++ ) { + animation.tweens[ index ].run( 1 ); + } + + // Resolve when we played the last frame; otherwise, reject + if ( gotoEnd ) { + deferred.notifyWith( elem, [ animation, 1, 0 ] ); + deferred.resolveWith( elem, [ animation, gotoEnd ] ); + } else { + deferred.rejectWith( elem, [ animation, gotoEnd ] ); + } + return this; + } + } ), + props = animation.props; + + propFilter( props, animation.opts.specialEasing ); + + for ( ; index < length; index++ ) { + result = Animation.prefilters[ index ].call( animation, elem, props, animation.opts ); + if ( result ) { + if ( isFunction( result.stop ) ) { + jQuery._queueHooks( animation.elem, animation.opts.queue ).stop = + result.stop.bind( result ); + } + return result; + } + } + + jQuery.map( props, createTween, animation ); + + if ( isFunction( animation.opts.start ) ) { + animation.opts.start.call( elem, animation ); + } + + // Attach callbacks from options + animation + .progress( animation.opts.progress ) + .done( animation.opts.done, animation.opts.complete ) + .fail( animation.opts.fail ) + .always( animation.opts.always ); + + jQuery.fx.timer( + jQuery.extend( tick, { + elem: elem, + anim: animation, + queue: animation.opts.queue + } ) + ); + + return animation; +} + +jQuery.Animation = jQuery.extend( Animation, { + + tweeners: { + "*": [ function( prop, value ) { + var tween = this.createTween( prop, value ); + adjustCSS( tween.elem, prop, rcssNum.exec( value ), tween ); + return tween; + } ] + }, + + tweener: function( props, callback ) { + if ( isFunction( props ) ) { + callback = props; + props = [ "*" ]; + } else { + props = props.match( rnothtmlwhite ); + } + + var prop, + index = 0, + length = props.length; + + for ( ; index < length; index++ ) { + prop = props[ index ]; + Animation.tweeners[ prop ] = Animation.tweeners[ prop ] || []; + Animation.tweeners[ prop ].unshift( callback ); + } + }, + + prefilters: [ defaultPrefilter ], + + prefilter: function( callback, prepend ) { + if ( prepend ) { + Animation.prefilters.unshift( callback ); + } else { + Animation.prefilters.push( callback ); + } + } +} ); + +jQuery.speed = function( speed, easing, fn ) { + var opt = speed && typeof speed === "object" ? jQuery.extend( {}, speed ) : { + complete: fn || !fn && easing || + isFunction( speed ) && speed, + duration: speed, + easing: fn && easing || easing && !isFunction( easing ) && easing + }; + + // Go to the end state if fx are off + if ( jQuery.fx.off ) { + opt.duration = 0; + + } else { + if ( typeof opt.duration !== "number" ) { + if ( opt.duration in jQuery.fx.speeds ) { + opt.duration = jQuery.fx.speeds[ opt.duration ]; + + } else { + opt.duration = jQuery.fx.speeds._default; + } + } + } + + // Normalize opt.queue - true/undefined/null -> "fx" + if ( opt.queue == null || opt.queue === true ) { + opt.queue = "fx"; + } + + // Queueing + opt.old = opt.complete; + + opt.complete = function() { + if ( isFunction( opt.old ) ) { + opt.old.call( this ); + } + + if ( opt.queue ) { + jQuery.dequeue( this, opt.queue ); + } + }; + + return opt; +}; + +jQuery.fn.extend( { + fadeTo: function( speed, to, easing, callback ) { + + // Show any hidden elements after setting opacity to 0 + return this.filter( isHiddenWithinTree ).css( "opacity", 0 ).show() + + // Animate to the value specified + .end().animate( { opacity: to }, speed, easing, callback ); + }, + animate: function( prop, speed, easing, callback ) { + var empty = jQuery.isEmptyObject( prop ), + optall = jQuery.speed( speed, easing, callback ), + doAnimation = function() { + + // Operate on a copy of prop so per-property easing won't be lost + var anim = Animation( this, jQuery.extend( {}, prop ), optall ); + + // Empty animations, or finishing resolves immediately + if ( empty || dataPriv.get( this, "finish" ) ) { + anim.stop( true ); + } + }; + + doAnimation.finish = doAnimation; + + return empty || optall.queue === false ? + this.each( doAnimation ) : + this.queue( optall.queue, doAnimation ); + }, + stop: function( type, clearQueue, gotoEnd ) { + var stopQueue = function( hooks ) { + var stop = hooks.stop; + delete hooks.stop; + stop( gotoEnd ); + }; + + if ( typeof type !== "string" ) { + gotoEnd = clearQueue; + clearQueue = type; + type = undefined; + } + if ( clearQueue ) { + this.queue( type || "fx", [] ); + } + + return this.each( function() { + var dequeue = true, + index = type != null && type + "queueHooks", + timers = jQuery.timers, + data = dataPriv.get( this ); + + if ( index ) { + if ( data[ index ] && data[ index ].stop ) { + stopQueue( data[ index ] ); + } + } else { + for ( index in data ) { + if ( data[ index ] && data[ index ].stop && rrun.test( index ) ) { + stopQueue( data[ index ] ); + } + } + } + + for ( index = timers.length; index--; ) { + if ( timers[ index ].elem === this && + ( type == null || timers[ index ].queue === type ) ) { + + timers[ index ].anim.stop( gotoEnd ); + dequeue = false; + timers.splice( index, 1 ); + } + } + + // Start the next in the queue if the last step wasn't forced. + // Timers currently will call their complete callbacks, which + // will dequeue but only if they were gotoEnd. + if ( dequeue || !gotoEnd ) { + jQuery.dequeue( this, type ); + } + } ); + }, + finish: function( type ) { + if ( type !== false ) { + type = type || "fx"; + } + return this.each( function() { + var index, + data = dataPriv.get( this ), + queue = data[ type + "queue" ], + hooks = data[ type + "queueHooks" ], + timers = jQuery.timers, + length = queue ? queue.length : 0; + + // Enable finishing flag on private data + data.finish = true; + + // Empty the queue first + jQuery.queue( this, type, [] ); + + if ( hooks && hooks.stop ) { + hooks.stop.call( this, true ); + } + + // Look for any active animations, and finish them + for ( index = timers.length; index--; ) { + if ( timers[ index ].elem === this && timers[ index ].queue === type ) { + timers[ index ].anim.stop( true ); + timers.splice( index, 1 ); + } + } + + // Look for any animations in the old queue and finish them + for ( index = 0; index < length; index++ ) { + if ( queue[ index ] && queue[ index ].finish ) { + queue[ index ].finish.call( this ); + } + } + + // Turn off finishing flag + delete data.finish; + } ); + } +} ); + +jQuery.each( [ "toggle", "show", "hide" ], function( _i, name ) { + var cssFn = jQuery.fn[ name ]; + jQuery.fn[ name ] = function( speed, easing, callback ) { + return speed == null || typeof speed === "boolean" ? + cssFn.apply( this, arguments ) : + this.animate( genFx( name, true ), speed, easing, callback ); + }; +} ); + +// Generate shortcuts for custom animations +jQuery.each( { + slideDown: genFx( "show" ), + slideUp: genFx( "hide" ), + slideToggle: genFx( "toggle" ), + fadeIn: { opacity: "show" }, + fadeOut: { opacity: "hide" }, + fadeToggle: { opacity: "toggle" } +}, function( name, props ) { + jQuery.fn[ name ] = function( speed, easing, callback ) { + return this.animate( props, speed, easing, callback ); + }; +} ); + +jQuery.timers = []; +jQuery.fx.tick = function() { + var timer, + i = 0, + timers = jQuery.timers; + + fxNow = Date.now(); + + for ( ; i < timers.length; i++ ) { + timer = timers[ i ]; + + // Run the timer and safely remove it when done (allowing for external removal) + if ( !timer() && timers[ i ] === timer ) { + timers.splice( i--, 1 ); + } + } + + if ( !timers.length ) { + jQuery.fx.stop(); + } + fxNow = undefined; +}; + +jQuery.fx.timer = function( timer ) { + jQuery.timers.push( timer ); + jQuery.fx.start(); +}; + +jQuery.fx.interval = 13; +jQuery.fx.start = function() { + if ( inProgress ) { + return; + } + + inProgress = true; + schedule(); +}; + +jQuery.fx.stop = function() { + inProgress = null; +}; + +jQuery.fx.speeds = { + slow: 600, + fast: 200, + + // Default speed + _default: 400 +}; + + +// Based off of the plugin by Clint Helfers, with permission. +// https://web.archive.org/web/20100324014747/http://blindsignals.com/index.php/2009/07/jquery-delay/ +jQuery.fn.delay = function( time, type ) { + time = jQuery.fx ? jQuery.fx.speeds[ time ] || time : time; + type = type || "fx"; + + return this.queue( type, function( next, hooks ) { + var timeout = window.setTimeout( next, time ); + hooks.stop = function() { + window.clearTimeout( timeout ); + }; + } ); +}; + + +( function() { + var input = document.createElement( "input" ), + select = document.createElement( "select" ), + opt = select.appendChild( document.createElement( "option" ) ); + + input.type = "checkbox"; + + // Support: Android <=4.3 only + // Default value for a checkbox should be "on" + support.checkOn = input.value !== ""; + + // Support: IE <=11 only + // Must access selectedIndex to make default options select + support.optSelected = opt.selected; + + // Support: IE <=11 only + // An input loses its value after becoming a radio + input = document.createElement( "input" ); + input.value = "t"; + input.type = "radio"; + support.radioValue = input.value === "t"; +} )(); + + +var boolHook, + attrHandle = jQuery.expr.attrHandle; + +jQuery.fn.extend( { + attr: function( name, value ) { + return access( this, jQuery.attr, name, value, arguments.length > 1 ); + }, + + removeAttr: function( name ) { + return this.each( function() { + jQuery.removeAttr( this, name ); + } ); + } +} ); + +jQuery.extend( { + attr: function( elem, name, value ) { + var ret, hooks, + nType = elem.nodeType; + + // Don't get/set attributes on text, comment and attribute nodes + if ( nType === 3 || nType === 8 || nType === 2 ) { + return; + } + + // Fallback to prop when attributes are not supported + if ( typeof elem.getAttribute === "undefined" ) { + return jQuery.prop( elem, name, value ); + } + + // Attribute hooks are determined by the lowercase version + // Grab necessary hook if one is defined + if ( nType !== 1 || !jQuery.isXMLDoc( elem ) ) { + hooks = jQuery.attrHooks[ name.toLowerCase() ] || + ( jQuery.expr.match.bool.test( name ) ? boolHook : undefined ); + } + + if ( value !== undefined ) { + if ( value === null ) { + jQuery.removeAttr( elem, name ); + return; + } + + if ( hooks && "set" in hooks && + ( ret = hooks.set( elem, value, name ) ) !== undefined ) { + return ret; + } + + elem.setAttribute( name, value + "" ); + return value; + } + + if ( hooks && "get" in hooks && ( ret = hooks.get( elem, name ) ) !== null ) { + return ret; + } + + ret = jQuery.find.attr( elem, name ); + + // Non-existent attributes return null, we normalize to undefined + return ret == null ? undefined : ret; + }, + + attrHooks: { + type: { + set: function( elem, value ) { + if ( !support.radioValue && value === "radio" && + nodeName( elem, "input" ) ) { + var val = elem.value; + elem.setAttribute( "type", value ); + if ( val ) { + elem.value = val; + } + return value; + } + } + } + }, + + removeAttr: function( elem, value ) { + var name, + i = 0, + + // Attribute names can contain non-HTML whitespace characters + // https://html.spec.whatwg.org/multipage/syntax.html#attributes-2 + attrNames = value && value.match( rnothtmlwhite ); + + if ( attrNames && elem.nodeType === 1 ) { + while ( ( name = attrNames[ i++ ] ) ) { + elem.removeAttribute( name ); + } + } + } +} ); + +// Hooks for boolean attributes +boolHook = { + set: function( elem, value, name ) { + if ( value === false ) { + + // Remove boolean attributes when set to false + jQuery.removeAttr( elem, name ); + } else { + elem.setAttribute( name, name ); + } + return name; + } +}; + +jQuery.each( jQuery.expr.match.bool.source.match( /\w+/g ), function( _i, name ) { + var getter = attrHandle[ name ] || jQuery.find.attr; + + attrHandle[ name ] = function( elem, name, isXML ) { + var ret, handle, + lowercaseName = name.toLowerCase(); + + if ( !isXML ) { + + // Avoid an infinite loop by temporarily removing this function from the getter + handle = attrHandle[ lowercaseName ]; + attrHandle[ lowercaseName ] = ret; + ret = getter( elem, name, isXML ) != null ? + lowercaseName : + null; + attrHandle[ lowercaseName ] = handle; + } + return ret; + }; +} ); + + + + +var rfocusable = /^(?:input|select|textarea|button)$/i, + rclickable = /^(?:a|area)$/i; + +jQuery.fn.extend( { + prop: function( name, value ) { + return access( this, jQuery.prop, name, value, arguments.length > 1 ); + }, + + removeProp: function( name ) { + return this.each( function() { + delete this[ jQuery.propFix[ name ] || name ]; + } ); + } +} ); + +jQuery.extend( { + prop: function( elem, name, value ) { + var ret, hooks, + nType = elem.nodeType; + + // Don't get/set properties on text, comment and attribute nodes + if ( nType === 3 || nType === 8 || nType === 2 ) { + return; + } + + if ( nType !== 1 || !jQuery.isXMLDoc( elem ) ) { + + // Fix name and attach hooks + name = jQuery.propFix[ name ] || name; + hooks = jQuery.propHooks[ name ]; + } + + if ( value !== undefined ) { + if ( hooks && "set" in hooks && + ( ret = hooks.set( elem, value, name ) ) !== undefined ) { + return ret; + } + + return ( elem[ name ] = value ); + } + + if ( hooks && "get" in hooks && ( ret = hooks.get( elem, name ) ) !== null ) { + return ret; + } + + return elem[ name ]; + }, + + propHooks: { + tabIndex: { + get: function( elem ) { + + // Support: IE <=9 - 11 only + // elem.tabIndex doesn't always return the + // correct value when it hasn't been explicitly set + // https://web.archive.org/web/20141116233347/http://fluidproject.org/blog/2008/01/09/getting-setting-and-removing-tabindex-values-with-javascript/ + // Use proper attribute retrieval(#12072) + var tabindex = jQuery.find.attr( elem, "tabindex" ); + + if ( tabindex ) { + return parseInt( tabindex, 10 ); + } + + if ( + rfocusable.test( elem.nodeName ) || + rclickable.test( elem.nodeName ) && + elem.href + ) { + return 0; + } + + return -1; + } + } + }, + + propFix: { + "for": "htmlFor", + "class": "className" + } +} ); + +// Support: IE <=11 only +// Accessing the selectedIndex property +// forces the browser to respect setting selected +// on the option +// The getter ensures a default option is selected +// when in an optgroup +// eslint rule "no-unused-expressions" is disabled for this code +// since it considers such accessions noop +if ( !support.optSelected ) { + jQuery.propHooks.selected = { + get: function( elem ) { + + /* eslint no-unused-expressions: "off" */ + + var parent = elem.parentNode; + if ( parent && parent.parentNode ) { + parent.parentNode.selectedIndex; + } + return null; + }, + set: function( elem ) { + + /* eslint no-unused-expressions: "off" */ + + var parent = elem.parentNode; + if ( parent ) { + parent.selectedIndex; + + if ( parent.parentNode ) { + parent.parentNode.selectedIndex; + } + } + } + }; +} + +jQuery.each( [ + "tabIndex", + "readOnly", + "maxLength", + "cellSpacing", + "cellPadding", + "rowSpan", + "colSpan", + "useMap", + "frameBorder", + "contentEditable" +], function() { + jQuery.propFix[ this.toLowerCase() ] = this; +} ); + + + + + // Strip and collapse whitespace according to HTML spec + // https://infra.spec.whatwg.org/#strip-and-collapse-ascii-whitespace + function stripAndCollapse( value ) { + var tokens = value.match( rnothtmlwhite ) || []; + return tokens.join( " " ); + } + + +function getClass( elem ) { + return elem.getAttribute && elem.getAttribute( "class" ) || ""; +} + +function classesToArray( value ) { + if ( Array.isArray( value ) ) { + return value; + } + if ( typeof value === "string" ) { + return value.match( rnothtmlwhite ) || []; + } + return []; +} + +jQuery.fn.extend( { + addClass: function( value ) { + var classes, elem, cur, curValue, clazz, j, finalValue, + i = 0; + + if ( isFunction( value ) ) { + return this.each( function( j ) { + jQuery( this ).addClass( value.call( this, j, getClass( this ) ) ); + } ); + } + + classes = classesToArray( value ); + + if ( classes.length ) { + while ( ( elem = this[ i++ ] ) ) { + curValue = getClass( elem ); + cur = elem.nodeType === 1 && ( " " + stripAndCollapse( curValue ) + " " ); + + if ( cur ) { + j = 0; + while ( ( clazz = classes[ j++ ] ) ) { + if ( cur.indexOf( " " + clazz + " " ) < 0 ) { + cur += clazz + " "; + } + } + + // Only assign if different to avoid unneeded rendering. + finalValue = stripAndCollapse( cur ); + if ( curValue !== finalValue ) { + elem.setAttribute( "class", finalValue ); + } + } + } + } + + return this; + }, + + removeClass: function( value ) { + var classes, elem, cur, curValue, clazz, j, finalValue, + i = 0; + + if ( isFunction( value ) ) { + return this.each( function( j ) { + jQuery( this ).removeClass( value.call( this, j, getClass( this ) ) ); + } ); + } + + if ( !arguments.length ) { + return this.attr( "class", "" ); + } + + classes = classesToArray( value ); + + if ( classes.length ) { + while ( ( elem = this[ i++ ] ) ) { + curValue = getClass( elem ); + + // This expression is here for better compressibility (see addClass) + cur = elem.nodeType === 1 && ( " " + stripAndCollapse( curValue ) + " " ); + + if ( cur ) { + j = 0; + while ( ( clazz = classes[ j++ ] ) ) { + + // Remove *all* instances + while ( cur.indexOf( " " + clazz + " " ) > -1 ) { + cur = cur.replace( " " + clazz + " ", " " ); + } + } + + // Only assign if different to avoid unneeded rendering. + finalValue = stripAndCollapse( cur ); + if ( curValue !== finalValue ) { + elem.setAttribute( "class", finalValue ); + } + } + } + } + + return this; + }, + + toggleClass: function( value, stateVal ) { + var type = typeof value, + isValidValue = type === "string" || Array.isArray( value ); + + if ( typeof stateVal === "boolean" && isValidValue ) { + return stateVal ? this.addClass( value ) : this.removeClass( value ); + } + + if ( isFunction( value ) ) { + return this.each( function( i ) { + jQuery( this ).toggleClass( + value.call( this, i, getClass( this ), stateVal ), + stateVal + ); + } ); + } + + return this.each( function() { + var className, i, self, classNames; + + if ( isValidValue ) { + + // Toggle individual class names + i = 0; + self = jQuery( this ); + classNames = classesToArray( value ); + + while ( ( className = classNames[ i++ ] ) ) { + + // Check each className given, space separated list + if ( self.hasClass( className ) ) { + self.removeClass( className ); + } else { + self.addClass( className ); + } + } + + // Toggle whole class name + } else if ( value === undefined || type === "boolean" ) { + className = getClass( this ); + if ( className ) { + + // Store className if set + dataPriv.set( this, "__className__", className ); + } + + // If the element has a class name or if we're passed `false`, + // then remove the whole classname (if there was one, the above saved it). + // Otherwise bring back whatever was previously saved (if anything), + // falling back to the empty string if nothing was stored. + if ( this.setAttribute ) { + this.setAttribute( "class", + className || value === false ? + "" : + dataPriv.get( this, "__className__" ) || "" + ); + } + } + } ); + }, + + hasClass: function( selector ) { + var className, elem, + i = 0; + + className = " " + selector + " "; + while ( ( elem = this[ i++ ] ) ) { + if ( elem.nodeType === 1 && + ( " " + stripAndCollapse( getClass( elem ) ) + " " ).indexOf( className ) > -1 ) { + return true; + } + } + + return false; + } +} ); + + + + +var rreturn = /\r/g; + +jQuery.fn.extend( { + val: function( value ) { + var hooks, ret, valueIsFunction, + elem = this[ 0 ]; + + if ( !arguments.length ) { + if ( elem ) { + hooks = jQuery.valHooks[ elem.type ] || + jQuery.valHooks[ elem.nodeName.toLowerCase() ]; + + if ( hooks && + "get" in hooks && + ( ret = hooks.get( elem, "value" ) ) !== undefined + ) { + return ret; + } + + ret = elem.value; + + // Handle most common string cases + if ( typeof ret === "string" ) { + return ret.replace( rreturn, "" ); + } + + // Handle cases where value is null/undef or number + return ret == null ? "" : ret; + } + + return; + } + + valueIsFunction = isFunction( value ); + + return this.each( function( i ) { + var val; + + if ( this.nodeType !== 1 ) { + return; + } + + if ( valueIsFunction ) { + val = value.call( this, i, jQuery( this ).val() ); + } else { + val = value; + } + + // Treat null/undefined as ""; convert numbers to string + if ( val == null ) { + val = ""; + + } else if ( typeof val === "number" ) { + val += ""; + + } else if ( Array.isArray( val ) ) { + val = jQuery.map( val, function( value ) { + return value == null ? "" : value + ""; + } ); + } + + hooks = jQuery.valHooks[ this.type ] || jQuery.valHooks[ this.nodeName.toLowerCase() ]; + + // If set returns undefined, fall back to normal setting + if ( !hooks || !( "set" in hooks ) || hooks.set( this, val, "value" ) === undefined ) { + this.value = val; + } + } ); + } +} ); + +jQuery.extend( { + valHooks: { + option: { + get: function( elem ) { + + var val = jQuery.find.attr( elem, "value" ); + return val != null ? + val : + + // Support: IE <=10 - 11 only + // option.text throws exceptions (#14686, #14858) + // Strip and collapse whitespace + // https://html.spec.whatwg.org/#strip-and-collapse-whitespace + stripAndCollapse( jQuery.text( elem ) ); + } + }, + select: { + get: function( elem ) { + var value, option, i, + options = elem.options, + index = elem.selectedIndex, + one = elem.type === "select-one", + values = one ? null : [], + max = one ? index + 1 : options.length; + + if ( index < 0 ) { + i = max; + + } else { + i = one ? index : 0; + } + + // Loop through all the selected options + for ( ; i < max; i++ ) { + option = options[ i ]; + + // Support: IE <=9 only + // IE8-9 doesn't update selected after form reset (#2551) + if ( ( option.selected || i === index ) && + + // Don't return options that are disabled or in a disabled optgroup + !option.disabled && + ( !option.parentNode.disabled || + !nodeName( option.parentNode, "optgroup" ) ) ) { + + // Get the specific value for the option + value = jQuery( option ).val(); + + // We don't need an array for one selects + if ( one ) { + return value; + } + + // Multi-Selects return an array + values.push( value ); + } + } + + return values; + }, + + set: function( elem, value ) { + var optionSet, option, + options = elem.options, + values = jQuery.makeArray( value ), + i = options.length; + + while ( i-- ) { + option = options[ i ]; + + /* eslint-disable no-cond-assign */ + + if ( option.selected = + jQuery.inArray( jQuery.valHooks.option.get( option ), values ) > -1 + ) { + optionSet = true; + } + + /* eslint-enable no-cond-assign */ + } + + // Force browsers to behave consistently when non-matching value is set + if ( !optionSet ) { + elem.selectedIndex = -1; + } + return values; + } + } + } +} ); + +// Radios and checkboxes getter/setter +jQuery.each( [ "radio", "checkbox" ], function() { + jQuery.valHooks[ this ] = { + set: function( elem, value ) { + if ( Array.isArray( value ) ) { + return ( elem.checked = jQuery.inArray( jQuery( elem ).val(), value ) > -1 ); + } + } + }; + if ( !support.checkOn ) { + jQuery.valHooks[ this ].get = function( elem ) { + return elem.getAttribute( "value" ) === null ? "on" : elem.value; + }; + } +} ); + + + + +// Return jQuery for attributes-only inclusion + + +support.focusin = "onfocusin" in window; + + +var rfocusMorph = /^(?:focusinfocus|focusoutblur)$/, + stopPropagationCallback = function( e ) { + e.stopPropagation(); + }; + +jQuery.extend( jQuery.event, { + + trigger: function( event, data, elem, onlyHandlers ) { + + var i, cur, tmp, bubbleType, ontype, handle, special, lastElement, + eventPath = [ elem || document ], + type = hasOwn.call( event, "type" ) ? event.type : event, + namespaces = hasOwn.call( event, "namespace" ) ? event.namespace.split( "." ) : []; + + cur = lastElement = tmp = elem = elem || document; + + // Don't do events on text and comment nodes + if ( elem.nodeType === 3 || elem.nodeType === 8 ) { + return; + } + + // focus/blur morphs to focusin/out; ensure we're not firing them right now + if ( rfocusMorph.test( type + jQuery.event.triggered ) ) { + return; + } + + if ( type.indexOf( "." ) > -1 ) { + + // Namespaced trigger; create a regexp to match event type in handle() + namespaces = type.split( "." ); + type = namespaces.shift(); + namespaces.sort(); + } + ontype = type.indexOf( ":" ) < 0 && "on" + type; + + // Caller can pass in a jQuery.Event object, Object, or just an event type string + event = event[ jQuery.expando ] ? + event : + new jQuery.Event( type, typeof event === "object" && event ); + + // Trigger bitmask: & 1 for native handlers; & 2 for jQuery (always true) + event.isTrigger = onlyHandlers ? 2 : 3; + event.namespace = namespaces.join( "." ); + event.rnamespace = event.namespace ? + new RegExp( "(^|\\.)" + namespaces.join( "\\.(?:.*\\.|)" ) + "(\\.|$)" ) : + null; + + // Clean up the event in case it is being reused + event.result = undefined; + if ( !event.target ) { + event.target = elem; + } + + // Clone any incoming data and prepend the event, creating the handler arg list + data = data == null ? + [ event ] : + jQuery.makeArray( data, [ event ] ); + + // Allow special events to draw outside the lines + special = jQuery.event.special[ type ] || {}; + if ( !onlyHandlers && special.trigger && special.trigger.apply( elem, data ) === false ) { + return; + } + + // Determine event propagation path in advance, per W3C events spec (#9951) + // Bubble up to document, then to window; watch for a global ownerDocument var (#9724) + if ( !onlyHandlers && !special.noBubble && !isWindow( elem ) ) { + + bubbleType = special.delegateType || type; + if ( !rfocusMorph.test( bubbleType + type ) ) { + cur = cur.parentNode; + } + for ( ; cur; cur = cur.parentNode ) { + eventPath.push( cur ); + tmp = cur; + } + + // Only add window if we got to document (e.g., not plain obj or detached DOM) + if ( tmp === ( elem.ownerDocument || document ) ) { + eventPath.push( tmp.defaultView || tmp.parentWindow || window ); + } + } + + // Fire handlers on the event path + i = 0; + while ( ( cur = eventPath[ i++ ] ) && !event.isPropagationStopped() ) { + lastElement = cur; + event.type = i > 1 ? + bubbleType : + special.bindType || type; + + // jQuery handler + handle = ( dataPriv.get( cur, "events" ) || Object.create( null ) )[ event.type ] && + dataPriv.get( cur, "handle" ); + if ( handle ) { + handle.apply( cur, data ); + } + + // Native handler + handle = ontype && cur[ ontype ]; + if ( handle && handle.apply && acceptData( cur ) ) { + event.result = handle.apply( cur, data ); + if ( event.result === false ) { + event.preventDefault(); + } + } + } + event.type = type; + + // If nobody prevented the default action, do it now + if ( !onlyHandlers && !event.isDefaultPrevented() ) { + + if ( ( !special._default || + special._default.apply( eventPath.pop(), data ) === false ) && + acceptData( elem ) ) { + + // Call a native DOM method on the target with the same name as the event. + // Don't do default actions on window, that's where global variables be (#6170) + if ( ontype && isFunction( elem[ type ] ) && !isWindow( elem ) ) { + + // Don't re-trigger an onFOO event when we call its FOO() method + tmp = elem[ ontype ]; + + if ( tmp ) { + elem[ ontype ] = null; + } + + // Prevent re-triggering of the same event, since we already bubbled it above + jQuery.event.triggered = type; + + if ( event.isPropagationStopped() ) { + lastElement.addEventListener( type, stopPropagationCallback ); + } + + elem[ type ](); + + if ( event.isPropagationStopped() ) { + lastElement.removeEventListener( type, stopPropagationCallback ); + } + + jQuery.event.triggered = undefined; + + if ( tmp ) { + elem[ ontype ] = tmp; + } + } + } + } + + return event.result; + }, + + // Piggyback on a donor event to simulate a different one + // Used only for `focus(in | out)` events + simulate: function( type, elem, event ) { + var e = jQuery.extend( + new jQuery.Event(), + event, + { + type: type, + isSimulated: true + } + ); + + jQuery.event.trigger( e, null, elem ); + } + +} ); + +jQuery.fn.extend( { + + trigger: function( type, data ) { + return this.each( function() { + jQuery.event.trigger( type, data, this ); + } ); + }, + triggerHandler: function( type, data ) { + var elem = this[ 0 ]; + if ( elem ) { + return jQuery.event.trigger( type, data, elem, true ); + } + } +} ); + + +// Support: Firefox <=44 +// Firefox doesn't have focus(in | out) events +// Related ticket - https://bugzilla.mozilla.org/show_bug.cgi?id=687787 +// +// Support: Chrome <=48 - 49, Safari <=9.0 - 9.1 +// focus(in | out) events fire after focus & blur events, +// which is spec violation - http://www.w3.org/TR/DOM-Level-3-Events/#events-focusevent-event-order +// Related ticket - https://bugs.chromium.org/p/chromium/issues/detail?id=449857 +if ( !support.focusin ) { + jQuery.each( { focus: "focusin", blur: "focusout" }, function( orig, fix ) { + + // Attach a single capturing handler on the document while someone wants focusin/focusout + var handler = function( event ) { + jQuery.event.simulate( fix, event.target, jQuery.event.fix( event ) ); + }; + + jQuery.event.special[ fix ] = { + setup: function() { + + // Handle: regular nodes (via `this.ownerDocument`), window + // (via `this.document`) & document (via `this`). + var doc = this.ownerDocument || this.document || this, + attaches = dataPriv.access( doc, fix ); + + if ( !attaches ) { + doc.addEventListener( orig, handler, true ); + } + dataPriv.access( doc, fix, ( attaches || 0 ) + 1 ); + }, + teardown: function() { + var doc = this.ownerDocument || this.document || this, + attaches = dataPriv.access( doc, fix ) - 1; + + if ( !attaches ) { + doc.removeEventListener( orig, handler, true ); + dataPriv.remove( doc, fix ); + + } else { + dataPriv.access( doc, fix, attaches ); + } + } + }; + } ); +} +var location = window.location; + +var nonce = { guid: Date.now() }; + +var rquery = ( /\?/ ); + + + +// Cross-browser xml parsing +jQuery.parseXML = function( data ) { + var xml, parserErrorElem; + if ( !data || typeof data !== "string" ) { + return null; + } + + // Support: IE 9 - 11 only + // IE throws on parseFromString with invalid input. + try { + xml = ( new window.DOMParser() ).parseFromString( data, "text/xml" ); + } catch ( e ) {} + + parserErrorElem = xml && xml.getElementsByTagName( "parsererror" )[ 0 ]; + if ( !xml || parserErrorElem ) { + jQuery.error( "Invalid XML: " + ( + parserErrorElem ? + jQuery.map( parserErrorElem.childNodes, function( el ) { + return el.textContent; + } ).join( "\n" ) : + data + ) ); + } + return xml; +}; + + +var + rbracket = /\[\]$/, + rCRLF = /\r?\n/g, + rsubmitterTypes = /^(?:submit|button|image|reset|file)$/i, + rsubmittable = /^(?:input|select|textarea|keygen)/i; + +function buildParams( prefix, obj, traditional, add ) { + var name; + + if ( Array.isArray( obj ) ) { + + // Serialize array item. + jQuery.each( obj, function( i, v ) { + if ( traditional || rbracket.test( prefix ) ) { + + // Treat each array item as a scalar. + add( prefix, v ); + + } else { + + // Item is non-scalar (array or object), encode its numeric index. + buildParams( + prefix + "[" + ( typeof v === "object" && v != null ? i : "" ) + "]", + v, + traditional, + add + ); + } + } ); + + } else if ( !traditional && toType( obj ) === "object" ) { + + // Serialize object item. + for ( name in obj ) { + buildParams( prefix + "[" + name + "]", obj[ name ], traditional, add ); + } + + } else { + + // Serialize scalar item. + add( prefix, obj ); + } +} + +// Serialize an array of form elements or a set of +// key/values into a query string +jQuery.param = function( a, traditional ) { + var prefix, + s = [], + add = function( key, valueOrFunction ) { + + // If value is a function, invoke it and use its return value + var value = isFunction( valueOrFunction ) ? + valueOrFunction() : + valueOrFunction; + + s[ s.length ] = encodeURIComponent( key ) + "=" + + encodeURIComponent( value == null ? "" : value ); + }; + + if ( a == null ) { + return ""; + } + + // If an array was passed in, assume that it is an array of form elements. + if ( Array.isArray( a ) || ( a.jquery && !jQuery.isPlainObject( a ) ) ) { + + // Serialize the form elements + jQuery.each( a, function() { + add( this.name, this.value ); + } ); + + } else { + + // If traditional, encode the "old" way (the way 1.3.2 or older + // did it), otherwise encode params recursively. + for ( prefix in a ) { + buildParams( prefix, a[ prefix ], traditional, add ); + } + } + + // Return the resulting serialization + return s.join( "&" ); +}; + +jQuery.fn.extend( { + serialize: function() { + return jQuery.param( this.serializeArray() ); + }, + serializeArray: function() { + return this.map( function() { + + // Can add propHook for "elements" to filter or add form elements + var elements = jQuery.prop( this, "elements" ); + return elements ? jQuery.makeArray( elements ) : this; + } ).filter( function() { + var type = this.type; + + // Use .is( ":disabled" ) so that fieldset[disabled] works + return this.name && !jQuery( this ).is( ":disabled" ) && + rsubmittable.test( this.nodeName ) && !rsubmitterTypes.test( type ) && + ( this.checked || !rcheckableType.test( type ) ); + } ).map( function( _i, elem ) { + var val = jQuery( this ).val(); + + if ( val == null ) { + return null; + } + + if ( Array.isArray( val ) ) { + return jQuery.map( val, function( val ) { + return { name: elem.name, value: val.replace( rCRLF, "\r\n" ) }; + } ); + } + + return { name: elem.name, value: val.replace( rCRLF, "\r\n" ) }; + } ).get(); + } +} ); + + +var + r20 = /%20/g, + rhash = /#.*$/, + rantiCache = /([?&])_=[^&]*/, + rheaders = /^(.*?):[ \t]*([^\r\n]*)$/mg, + + // #7653, #8125, #8152: local protocol detection + rlocalProtocol = /^(?:about|app|app-storage|.+-extension|file|res|widget):$/, + rnoContent = /^(?:GET|HEAD)$/, + rprotocol = /^\/\//, + + /* Prefilters + * 1) They are useful to introduce custom dataTypes (see ajax/jsonp.js for an example) + * 2) These are called: + * - BEFORE asking for a transport + * - AFTER param serialization (s.data is a string if s.processData is true) + * 3) key is the dataType + * 4) the catchall symbol "*" can be used + * 5) execution will start with transport dataType and THEN continue down to "*" if needed + */ + prefilters = {}, + + /* Transports bindings + * 1) key is the dataType + * 2) the catchall symbol "*" can be used + * 3) selection will start with transport dataType and THEN go to "*" if needed + */ + transports = {}, + + // Avoid comment-prolog char sequence (#10098); must appease lint and evade compression + allTypes = "*/".concat( "*" ), + + // Anchor tag for parsing the document origin + originAnchor = document.createElement( "a" ); + +originAnchor.href = location.href; + +// Base "constructor" for jQuery.ajaxPrefilter and jQuery.ajaxTransport +function addToPrefiltersOrTransports( structure ) { + + // dataTypeExpression is optional and defaults to "*" + return function( dataTypeExpression, func ) { + + if ( typeof dataTypeExpression !== "string" ) { + func = dataTypeExpression; + dataTypeExpression = "*"; + } + + var dataType, + i = 0, + dataTypes = dataTypeExpression.toLowerCase().match( rnothtmlwhite ) || []; + + if ( isFunction( func ) ) { + + // For each dataType in the dataTypeExpression + while ( ( dataType = dataTypes[ i++ ] ) ) { + + // Prepend if requested + if ( dataType[ 0 ] === "+" ) { + dataType = dataType.slice( 1 ) || "*"; + ( structure[ dataType ] = structure[ dataType ] || [] ).unshift( func ); + + // Otherwise append + } else { + ( structure[ dataType ] = structure[ dataType ] || [] ).push( func ); + } + } + } + }; +} + +// Base inspection function for prefilters and transports +function inspectPrefiltersOrTransports( structure, options, originalOptions, jqXHR ) { + + var inspected = {}, + seekingTransport = ( structure === transports ); + + function inspect( dataType ) { + var selected; + inspected[ dataType ] = true; + jQuery.each( structure[ dataType ] || [], function( _, prefilterOrFactory ) { + var dataTypeOrTransport = prefilterOrFactory( options, originalOptions, jqXHR ); + if ( typeof dataTypeOrTransport === "string" && + !seekingTransport && !inspected[ dataTypeOrTransport ] ) { + + options.dataTypes.unshift( dataTypeOrTransport ); + inspect( dataTypeOrTransport ); + return false; + } else if ( seekingTransport ) { + return !( selected = dataTypeOrTransport ); + } + } ); + return selected; + } + + return inspect( options.dataTypes[ 0 ] ) || !inspected[ "*" ] && inspect( "*" ); +} + +// A special extend for ajax options +// that takes "flat" options (not to be deep extended) +// Fixes #9887 +function ajaxExtend( target, src ) { + var key, deep, + flatOptions = jQuery.ajaxSettings.flatOptions || {}; + + for ( key in src ) { + if ( src[ key ] !== undefined ) { + ( flatOptions[ key ] ? target : ( deep || ( deep = {} ) ) )[ key ] = src[ key ]; + } + } + if ( deep ) { + jQuery.extend( true, target, deep ); + } + + return target; +} + +/* Handles responses to an ajax request: + * - finds the right dataType (mediates between content-type and expected dataType) + * - returns the corresponding response + */ +function ajaxHandleResponses( s, jqXHR, responses ) { + + var ct, type, finalDataType, firstDataType, + contents = s.contents, + dataTypes = s.dataTypes; + + // Remove auto dataType and get content-type in the process + while ( dataTypes[ 0 ] === "*" ) { + dataTypes.shift(); + if ( ct === undefined ) { + ct = s.mimeType || jqXHR.getResponseHeader( "Content-Type" ); + } + } + + // Check if we're dealing with a known content-type + if ( ct ) { + for ( type in contents ) { + if ( contents[ type ] && contents[ type ].test( ct ) ) { + dataTypes.unshift( type ); + break; + } + } + } + + // Check to see if we have a response for the expected dataType + if ( dataTypes[ 0 ] in responses ) { + finalDataType = dataTypes[ 0 ]; + } else { + + // Try convertible dataTypes + for ( type in responses ) { + if ( !dataTypes[ 0 ] || s.converters[ type + " " + dataTypes[ 0 ] ] ) { + finalDataType = type; + break; + } + if ( !firstDataType ) { + firstDataType = type; + } + } + + // Or just use first one + finalDataType = finalDataType || firstDataType; + } + + // If we found a dataType + // We add the dataType to the list if needed + // and return the corresponding response + if ( finalDataType ) { + if ( finalDataType !== dataTypes[ 0 ] ) { + dataTypes.unshift( finalDataType ); + } + return responses[ finalDataType ]; + } +} + +/* Chain conversions given the request and the original response + * Also sets the responseXXX fields on the jqXHR instance + */ +function ajaxConvert( s, response, jqXHR, isSuccess ) { + var conv2, current, conv, tmp, prev, + converters = {}, + + // Work with a copy of dataTypes in case we need to modify it for conversion + dataTypes = s.dataTypes.slice(); + + // Create converters map with lowercased keys + if ( dataTypes[ 1 ] ) { + for ( conv in s.converters ) { + converters[ conv.toLowerCase() ] = s.converters[ conv ]; + } + } + + current = dataTypes.shift(); + + // Convert to each sequential dataType + while ( current ) { + + if ( s.responseFields[ current ] ) { + jqXHR[ s.responseFields[ current ] ] = response; + } + + // Apply the dataFilter if provided + if ( !prev && isSuccess && s.dataFilter ) { + response = s.dataFilter( response, s.dataType ); + } + + prev = current; + current = dataTypes.shift(); + + if ( current ) { + + // There's only work to do if current dataType is non-auto + if ( current === "*" ) { + + current = prev; + + // Convert response if prev dataType is non-auto and differs from current + } else if ( prev !== "*" && prev !== current ) { + + // Seek a direct converter + conv = converters[ prev + " " + current ] || converters[ "* " + current ]; + + // If none found, seek a pair + if ( !conv ) { + for ( conv2 in converters ) { + + // If conv2 outputs current + tmp = conv2.split( " " ); + if ( tmp[ 1 ] === current ) { + + // If prev can be converted to accepted input + conv = converters[ prev + " " + tmp[ 0 ] ] || + converters[ "* " + tmp[ 0 ] ]; + if ( conv ) { + + // Condense equivalence converters + if ( conv === true ) { + conv = converters[ conv2 ]; + + // Otherwise, insert the intermediate dataType + } else if ( converters[ conv2 ] !== true ) { + current = tmp[ 0 ]; + dataTypes.unshift( tmp[ 1 ] ); + } + break; + } + } + } + } + + // Apply converter (if not an equivalence) + if ( conv !== true ) { + + // Unless errors are allowed to bubble, catch and return them + if ( conv && s.throws ) { + response = conv( response ); + } else { + try { + response = conv( response ); + } catch ( e ) { + return { + state: "parsererror", + error: conv ? e : "No conversion from " + prev + " to " + current + }; + } + } + } + } + } + } + + return { state: "success", data: response }; +} + +jQuery.extend( { + + // Counter for holding the number of active queries + active: 0, + + // Last-Modified header cache for next request + lastModified: {}, + etag: {}, + + ajaxSettings: { + url: location.href, + type: "GET", + isLocal: rlocalProtocol.test( location.protocol ), + global: true, + processData: true, + async: true, + contentType: "application/x-www-form-urlencoded; charset=UTF-8", + + /* + timeout: 0, + data: null, + dataType: null, + username: null, + password: null, + cache: null, + throws: false, + traditional: false, + headers: {}, + */ + + accepts: { + "*": allTypes, + text: "text/plain", + html: "text/html", + xml: "application/xml, text/xml", + json: "application/json, text/javascript" + }, + + contents: { + xml: /\bxml\b/, + html: /\bhtml/, + json: /\bjson\b/ + }, + + responseFields: { + xml: "responseXML", + text: "responseText", + json: "responseJSON" + }, + + // Data converters + // Keys separate source (or catchall "*") and destination types with a single space + converters: { + + // Convert anything to text + "* text": String, + + // Text to html (true = no transformation) + "text html": true, + + // Evaluate text as a json expression + "text json": JSON.parse, + + // Parse text as xml + "text xml": jQuery.parseXML + }, + + // For options that shouldn't be deep extended: + // you can add your own custom options here if + // and when you create one that shouldn't be + // deep extended (see ajaxExtend) + flatOptions: { + url: true, + context: true + } + }, + + // Creates a full fledged settings object into target + // with both ajaxSettings and settings fields. + // If target is omitted, writes into ajaxSettings. + ajaxSetup: function( target, settings ) { + return settings ? + + // Building a settings object + ajaxExtend( ajaxExtend( target, jQuery.ajaxSettings ), settings ) : + + // Extending ajaxSettings + ajaxExtend( jQuery.ajaxSettings, target ); + }, + + ajaxPrefilter: addToPrefiltersOrTransports( prefilters ), + ajaxTransport: addToPrefiltersOrTransports( transports ), + + // Main method + ajax: function( url, options ) { + + // If url is an object, simulate pre-1.5 signature + if ( typeof url === "object" ) { + options = url; + url = undefined; + } + + // Force options to be an object + options = options || {}; + + var transport, + + // URL without anti-cache param + cacheURL, + + // Response headers + responseHeadersString, + responseHeaders, + + // timeout handle + timeoutTimer, + + // Url cleanup var + urlAnchor, + + // Request state (becomes false upon send and true upon completion) + completed, + + // To know if global events are to be dispatched + fireGlobals, + + // Loop variable + i, + + // uncached part of the url + uncached, + + // Create the final options object + s = jQuery.ajaxSetup( {}, options ), + + // Callbacks context + callbackContext = s.context || s, + + // Context for global events is callbackContext if it is a DOM node or jQuery collection + globalEventContext = s.context && + ( callbackContext.nodeType || callbackContext.jquery ) ? + jQuery( callbackContext ) : + jQuery.event, + + // Deferreds + deferred = jQuery.Deferred(), + completeDeferred = jQuery.Callbacks( "once memory" ), + + // Status-dependent callbacks + statusCode = s.statusCode || {}, + + // Headers (they are sent all at once) + requestHeaders = {}, + requestHeadersNames = {}, + + // Default abort message + strAbort = "canceled", + + // Fake xhr + jqXHR = { + readyState: 0, + + // Builds headers hashtable if needed + getResponseHeader: function( key ) { + var match; + if ( completed ) { + if ( !responseHeaders ) { + responseHeaders = {}; + while ( ( match = rheaders.exec( responseHeadersString ) ) ) { + responseHeaders[ match[ 1 ].toLowerCase() + " " ] = + ( responseHeaders[ match[ 1 ].toLowerCase() + " " ] || [] ) + .concat( match[ 2 ] ); + } + } + match = responseHeaders[ key.toLowerCase() + " " ]; + } + return match == null ? null : match.join( ", " ); + }, + + // Raw string + getAllResponseHeaders: function() { + return completed ? responseHeadersString : null; + }, + + // Caches the header + setRequestHeader: function( name, value ) { + if ( completed == null ) { + name = requestHeadersNames[ name.toLowerCase() ] = + requestHeadersNames[ name.toLowerCase() ] || name; + requestHeaders[ name ] = value; + } + return this; + }, + + // Overrides response content-type header + overrideMimeType: function( type ) { + if ( completed == null ) { + s.mimeType = type; + } + return this; + }, + + // Status-dependent callbacks + statusCode: function( map ) { + var code; + if ( map ) { + if ( completed ) { + + // Execute the appropriate callbacks + jqXHR.always( map[ jqXHR.status ] ); + } else { + + // Lazy-add the new callbacks in a way that preserves old ones + for ( code in map ) { + statusCode[ code ] = [ statusCode[ code ], map[ code ] ]; + } + } + } + return this; + }, + + // Cancel the request + abort: function( statusText ) { + var finalText = statusText || strAbort; + if ( transport ) { + transport.abort( finalText ); + } + done( 0, finalText ); + return this; + } + }; + + // Attach deferreds + deferred.promise( jqXHR ); + + // Add protocol if not provided (prefilters might expect it) + // Handle falsy url in the settings object (#10093: consistency with old signature) + // We also use the url parameter if available + s.url = ( ( url || s.url || location.href ) + "" ) + .replace( rprotocol, location.protocol + "//" ); + + // Alias method option to type as per ticket #12004 + s.type = options.method || options.type || s.method || s.type; + + // Extract dataTypes list + s.dataTypes = ( s.dataType || "*" ).toLowerCase().match( rnothtmlwhite ) || [ "" ]; + + // A cross-domain request is in order when the origin doesn't match the current origin. + if ( s.crossDomain == null ) { + urlAnchor = document.createElement( "a" ); + + // Support: IE <=8 - 11, Edge 12 - 15 + // IE throws exception on accessing the href property if url is malformed, + // e.g. http://example.com:80x/ + try { + urlAnchor.href = s.url; + + // Support: IE <=8 - 11 only + // Anchor's host property isn't correctly set when s.url is relative + urlAnchor.href = urlAnchor.href; + s.crossDomain = originAnchor.protocol + "//" + originAnchor.host !== + urlAnchor.protocol + "//" + urlAnchor.host; + } catch ( e ) { + + // If there is an error parsing the URL, assume it is crossDomain, + // it can be rejected by the transport if it is invalid + s.crossDomain = true; + } + } + + // Convert data if not already a string + if ( s.data && s.processData && typeof s.data !== "string" ) { + s.data = jQuery.param( s.data, s.traditional ); + } + + // Apply prefilters + inspectPrefiltersOrTransports( prefilters, s, options, jqXHR ); + + // If request was aborted inside a prefilter, stop there + if ( completed ) { + return jqXHR; + } + + // We can fire global events as of now if asked to + // Don't fire events if jQuery.event is undefined in an AMD-usage scenario (#15118) + fireGlobals = jQuery.event && s.global; + + // Watch for a new set of requests + if ( fireGlobals && jQuery.active++ === 0 ) { + jQuery.event.trigger( "ajaxStart" ); + } + + // Uppercase the type + s.type = s.type.toUpperCase(); + + // Determine if request has content + s.hasContent = !rnoContent.test( s.type ); + + // Save the URL in case we're toying with the If-Modified-Since + // and/or If-None-Match header later on + // Remove hash to simplify url manipulation + cacheURL = s.url.replace( rhash, "" ); + + // More options handling for requests with no content + if ( !s.hasContent ) { + + // Remember the hash so we can put it back + uncached = s.url.slice( cacheURL.length ); + + // If data is available and should be processed, append data to url + if ( s.data && ( s.processData || typeof s.data === "string" ) ) { + cacheURL += ( rquery.test( cacheURL ) ? "&" : "?" ) + s.data; + + // #9682: remove data so that it's not used in an eventual retry + delete s.data; + } + + // Add or update anti-cache param if needed + if ( s.cache === false ) { + cacheURL = cacheURL.replace( rantiCache, "$1" ); + uncached = ( rquery.test( cacheURL ) ? "&" : "?" ) + "_=" + ( nonce.guid++ ) + + uncached; + } + + // Put hash and anti-cache on the URL that will be requested (gh-1732) + s.url = cacheURL + uncached; + + // Change '%20' to '+' if this is encoded form body content (gh-2658) + } else if ( s.data && s.processData && + ( s.contentType || "" ).indexOf( "application/x-www-form-urlencoded" ) === 0 ) { + s.data = s.data.replace( r20, "+" ); + } + + // Set the If-Modified-Since and/or If-None-Match header, if in ifModified mode. + if ( s.ifModified ) { + if ( jQuery.lastModified[ cacheURL ] ) { + jqXHR.setRequestHeader( "If-Modified-Since", jQuery.lastModified[ cacheURL ] ); + } + if ( jQuery.etag[ cacheURL ] ) { + jqXHR.setRequestHeader( "If-None-Match", jQuery.etag[ cacheURL ] ); + } + } + + // Set the correct header, if data is being sent + if ( s.data && s.hasContent && s.contentType !== false || options.contentType ) { + jqXHR.setRequestHeader( "Content-Type", s.contentType ); + } + + // Set the Accepts header for the server, depending on the dataType + jqXHR.setRequestHeader( + "Accept", + s.dataTypes[ 0 ] && s.accepts[ s.dataTypes[ 0 ] ] ? + s.accepts[ s.dataTypes[ 0 ] ] + + ( s.dataTypes[ 0 ] !== "*" ? ", " + allTypes + "; q=0.01" : "" ) : + s.accepts[ "*" ] + ); + + // Check for headers option + for ( i in s.headers ) { + jqXHR.setRequestHeader( i, s.headers[ i ] ); + } + + // Allow custom headers/mimetypes and early abort + if ( s.beforeSend && + ( s.beforeSend.call( callbackContext, jqXHR, s ) === false || completed ) ) { + + // Abort if not done already and return + return jqXHR.abort(); + } + + // Aborting is no longer a cancellation + strAbort = "abort"; + + // Install callbacks on deferreds + completeDeferred.add( s.complete ); + jqXHR.done( s.success ); + jqXHR.fail( s.error ); + + // Get transport + transport = inspectPrefiltersOrTransports( transports, s, options, jqXHR ); + + // If no transport, we auto-abort + if ( !transport ) { + done( -1, "No Transport" ); + } else { + jqXHR.readyState = 1; + + // Send global event + if ( fireGlobals ) { + globalEventContext.trigger( "ajaxSend", [ jqXHR, s ] ); + } + + // If request was aborted inside ajaxSend, stop there + if ( completed ) { + return jqXHR; + } + + // Timeout + if ( s.async && s.timeout > 0 ) { + timeoutTimer = window.setTimeout( function() { + jqXHR.abort( "timeout" ); + }, s.timeout ); + } + + try { + completed = false; + transport.send( requestHeaders, done ); + } catch ( e ) { + + // Rethrow post-completion exceptions + if ( completed ) { + throw e; + } + + // Propagate others as results + done( -1, e ); + } + } + + // Callback for when everything is done + function done( status, nativeStatusText, responses, headers ) { + var isSuccess, success, error, response, modified, + statusText = nativeStatusText; + + // Ignore repeat invocations + if ( completed ) { + return; + } + + completed = true; + + // Clear timeout if it exists + if ( timeoutTimer ) { + window.clearTimeout( timeoutTimer ); + } + + // Dereference transport for early garbage collection + // (no matter how long the jqXHR object will be used) + transport = undefined; + + // Cache response headers + responseHeadersString = headers || ""; + + // Set readyState + jqXHR.readyState = status > 0 ? 4 : 0; + + // Determine if successful + isSuccess = status >= 200 && status < 300 || status === 304; + + // Get response data + if ( responses ) { + response = ajaxHandleResponses( s, jqXHR, responses ); + } + + // Use a noop converter for missing script but not if jsonp + if ( !isSuccess && + jQuery.inArray( "script", s.dataTypes ) > -1 && + jQuery.inArray( "json", s.dataTypes ) < 0 ) { + s.converters[ "text script" ] = function() {}; + } + + // Convert no matter what (that way responseXXX fields are always set) + response = ajaxConvert( s, response, jqXHR, isSuccess ); + + // If successful, handle type chaining + if ( isSuccess ) { + + // Set the If-Modified-Since and/or If-None-Match header, if in ifModified mode. + if ( s.ifModified ) { + modified = jqXHR.getResponseHeader( "Last-Modified" ); + if ( modified ) { + jQuery.lastModified[ cacheURL ] = modified; + } + modified = jqXHR.getResponseHeader( "etag" ); + if ( modified ) { + jQuery.etag[ cacheURL ] = modified; + } + } + + // if no content + if ( status === 204 || s.type === "HEAD" ) { + statusText = "nocontent"; + + // if not modified + } else if ( status === 304 ) { + statusText = "notmodified"; + + // If we have data, let's convert it + } else { + statusText = response.state; + success = response.data; + error = response.error; + isSuccess = !error; + } + } else { + + // Extract error from statusText and normalize for non-aborts + error = statusText; + if ( status || !statusText ) { + statusText = "error"; + if ( status < 0 ) { + status = 0; + } + } + } + + // Set data for the fake xhr object + jqXHR.status = status; + jqXHR.statusText = ( nativeStatusText || statusText ) + ""; + + // Success/Error + if ( isSuccess ) { + deferred.resolveWith( callbackContext, [ success, statusText, jqXHR ] ); + } else { + deferred.rejectWith( callbackContext, [ jqXHR, statusText, error ] ); + } + + // Status-dependent callbacks + jqXHR.statusCode( statusCode ); + statusCode = undefined; + + if ( fireGlobals ) { + globalEventContext.trigger( isSuccess ? "ajaxSuccess" : "ajaxError", + [ jqXHR, s, isSuccess ? success : error ] ); + } + + // Complete + completeDeferred.fireWith( callbackContext, [ jqXHR, statusText ] ); + + if ( fireGlobals ) { + globalEventContext.trigger( "ajaxComplete", [ jqXHR, s ] ); + + // Handle the global AJAX counter + if ( !( --jQuery.active ) ) { + jQuery.event.trigger( "ajaxStop" ); + } + } + } + + return jqXHR; + }, + + getJSON: function( url, data, callback ) { + return jQuery.get( url, data, callback, "json" ); + }, + + getScript: function( url, callback ) { + return jQuery.get( url, undefined, callback, "script" ); + } +} ); + +jQuery.each( [ "get", "post" ], function( _i, method ) { + jQuery[ method ] = function( url, data, callback, type ) { + + // Shift arguments if data argument was omitted + if ( isFunction( data ) ) { + type = type || callback; + callback = data; + data = undefined; + } + + // The url can be an options object (which then must have .url) + return jQuery.ajax( jQuery.extend( { + url: url, + type: method, + dataType: type, + data: data, + success: callback + }, jQuery.isPlainObject( url ) && url ) ); + }; +} ); + +jQuery.ajaxPrefilter( function( s ) { + var i; + for ( i in s.headers ) { + if ( i.toLowerCase() === "content-type" ) { + s.contentType = s.headers[ i ] || ""; + } + } +} ); + + +jQuery._evalUrl = function( url, options, doc ) { + return jQuery.ajax( { + url: url, + + // Make this explicit, since user can override this through ajaxSetup (#11264) + type: "GET", + dataType: "script", + cache: true, + async: false, + global: false, + + // Only evaluate the response if it is successful (gh-4126) + // dataFilter is not invoked for failure responses, so using it instead + // of the default converter is kludgy but it works. + converters: { + "text script": function() {} + }, + dataFilter: function( response ) { + jQuery.globalEval( response, options, doc ); + } + } ); +}; + + +jQuery.fn.extend( { + wrapAll: function( html ) { + var wrap; + + if ( this[ 0 ] ) { + if ( isFunction( html ) ) { + html = html.call( this[ 0 ] ); + } + + // The elements to wrap the target around + wrap = jQuery( html, this[ 0 ].ownerDocument ).eq( 0 ).clone( true ); + + if ( this[ 0 ].parentNode ) { + wrap.insertBefore( this[ 0 ] ); + } + + wrap.map( function() { + var elem = this; + + while ( elem.firstElementChild ) { + elem = elem.firstElementChild; + } + + return elem; + } ).append( this ); + } + + return this; + }, + + wrapInner: function( html ) { + if ( isFunction( html ) ) { + return this.each( function( i ) { + jQuery( this ).wrapInner( html.call( this, i ) ); + } ); + } + + return this.each( function() { + var self = jQuery( this ), + contents = self.contents(); + + if ( contents.length ) { + contents.wrapAll( html ); + + } else { + self.append( html ); + } + } ); + }, + + wrap: function( html ) { + var htmlIsFunction = isFunction( html ); + + return this.each( function( i ) { + jQuery( this ).wrapAll( htmlIsFunction ? html.call( this, i ) : html ); + } ); + }, + + unwrap: function( selector ) { + this.parent( selector ).not( "body" ).each( function() { + jQuery( this ).replaceWith( this.childNodes ); + } ); + return this; + } +} ); + + +jQuery.expr.pseudos.hidden = function( elem ) { + return !jQuery.expr.pseudos.visible( elem ); +}; +jQuery.expr.pseudos.visible = function( elem ) { + return !!( elem.offsetWidth || elem.offsetHeight || elem.getClientRects().length ); +}; + + + + +jQuery.ajaxSettings.xhr = function() { + try { + return new window.XMLHttpRequest(); + } catch ( e ) {} +}; + +var xhrSuccessStatus = { + + // File protocol always yields status code 0, assume 200 + 0: 200, + + // Support: IE <=9 only + // #1450: sometimes IE returns 1223 when it should be 204 + 1223: 204 + }, + xhrSupported = jQuery.ajaxSettings.xhr(); + +support.cors = !!xhrSupported && ( "withCredentials" in xhrSupported ); +support.ajax = xhrSupported = !!xhrSupported; + +jQuery.ajaxTransport( function( options ) { + var callback, errorCallback; + + // Cross domain only allowed if supported through XMLHttpRequest + if ( support.cors || xhrSupported && !options.crossDomain ) { + return { + send: function( headers, complete ) { + var i, + xhr = options.xhr(); + + xhr.open( + options.type, + options.url, + options.async, + options.username, + options.password + ); + + // Apply custom fields if provided + if ( options.xhrFields ) { + for ( i in options.xhrFields ) { + xhr[ i ] = options.xhrFields[ i ]; + } + } + + // Override mime type if needed + if ( options.mimeType && xhr.overrideMimeType ) { + xhr.overrideMimeType( options.mimeType ); + } + + // X-Requested-With header + // For cross-domain requests, seeing as conditions for a preflight are + // akin to a jigsaw puzzle, we simply never set it to be sure. + // (it can always be set on a per-request basis or even using ajaxSetup) + // For same-domain requests, won't change header if already provided. + if ( !options.crossDomain && !headers[ "X-Requested-With" ] ) { + headers[ "X-Requested-With" ] = "XMLHttpRequest"; + } + + // Set headers + for ( i in headers ) { + xhr.setRequestHeader( i, headers[ i ] ); + } + + // Callback + callback = function( type ) { + return function() { + if ( callback ) { + callback = errorCallback = xhr.onload = + xhr.onerror = xhr.onabort = xhr.ontimeout = + xhr.onreadystatechange = null; + + if ( type === "abort" ) { + xhr.abort(); + } else if ( type === "error" ) { + + // Support: IE <=9 only + // On a manual native abort, IE9 throws + // errors on any property access that is not readyState + if ( typeof xhr.status !== "number" ) { + complete( 0, "error" ); + } else { + complete( + + // File: protocol always yields status 0; see #8605, #14207 + xhr.status, + xhr.statusText + ); + } + } else { + complete( + xhrSuccessStatus[ xhr.status ] || xhr.status, + xhr.statusText, + + // Support: IE <=9 only + // IE9 has no XHR2 but throws on binary (trac-11426) + // For XHR2 non-text, let the caller handle it (gh-2498) + ( xhr.responseType || "text" ) !== "text" || + typeof xhr.responseText !== "string" ? + { binary: xhr.response } : + { text: xhr.responseText }, + xhr.getAllResponseHeaders() + ); + } + } + }; + }; + + // Listen to events + xhr.onload = callback(); + errorCallback = xhr.onerror = xhr.ontimeout = callback( "error" ); + + // Support: IE 9 only + // Use onreadystatechange to replace onabort + // to handle uncaught aborts + if ( xhr.onabort !== undefined ) { + xhr.onabort = errorCallback; + } else { + xhr.onreadystatechange = function() { + + // Check readyState before timeout as it changes + if ( xhr.readyState === 4 ) { + + // Allow onerror to be called first, + // but that will not handle a native abort + // Also, save errorCallback to a variable + // as xhr.onerror cannot be accessed + window.setTimeout( function() { + if ( callback ) { + errorCallback(); + } + } ); + } + }; + } + + // Create the abort callback + callback = callback( "abort" ); + + try { + + // Do send the request (this may raise an exception) + xhr.send( options.hasContent && options.data || null ); + } catch ( e ) { + + // #14683: Only rethrow if this hasn't been notified as an error yet + if ( callback ) { + throw e; + } + } + }, + + abort: function() { + if ( callback ) { + callback(); + } + } + }; + } +} ); + + + + +// Prevent auto-execution of scripts when no explicit dataType was provided (See gh-2432) +jQuery.ajaxPrefilter( function( s ) { + if ( s.crossDomain ) { + s.contents.script = false; + } +} ); + +// Install script dataType +jQuery.ajaxSetup( { + accepts: { + script: "text/javascript, application/javascript, " + + "application/ecmascript, application/x-ecmascript" + }, + contents: { + script: /\b(?:java|ecma)script\b/ + }, + converters: { + "text script": function( text ) { + jQuery.globalEval( text ); + return text; + } + } +} ); + +// Handle cache's special case and crossDomain +jQuery.ajaxPrefilter( "script", function( s ) { + if ( s.cache === undefined ) { + s.cache = false; + } + if ( s.crossDomain ) { + s.type = "GET"; + } +} ); + +// Bind script tag hack transport +jQuery.ajaxTransport( "script", function( s ) { + + // This transport only deals with cross domain or forced-by-attrs requests + if ( s.crossDomain || s.scriptAttrs ) { + var script, callback; + return { + send: function( _, complete ) { + script = jQuery( " + + + + + + + + + + + + + + + + + + + + + +
        +
        + +
        + + + +
        +

        20  Spreadsheets

        +
        + + + +
        + + + + +
        + + +

        +20.1 Introduction

        +

        In Capítulo 7 you learned about importing data from plain text files like .csv and .tsv. Now it’s time to learn how to get data out of a spreadsheet, either an Excel spreadsheet or a Google Sheet. This will build on much of what you’ve learned in Capítulo 7, but we will also discuss additional considerations and complexities when working with data from spreadsheets.

        +

        If you or your collaborators are using spreadsheets for organizing data, we strongly recommend reading the paper “Data Organization in Spreadsheets” by Karl Broman and Kara Woo: https://doi.org/10.1080/00031305.2017.1375989. The best practices presented in this paper will save you much headache when you import data from a spreadsheet into R to analyze and visualize.

        +

        +20.2 Excel

        +

        Microsoft Excel is a widely used spreadsheet software program where data are organized in worksheets inside of spreadsheet files.

        +

        +20.2.1 Prerequisites

        +

        In this section, you’ll learn how to load data from Excel spreadsheets in R with the readxl package. This package is non-core tidyverse, so you need to load it explicitly, but it is installed automatically when you install the tidyverse package. Later, we’ll also use the writexl package, which allows us to create Excel spreadsheets.

        + +

        +20.2.2 Getting started

        +

        Most of readxl’s functions allow you to load Excel spreadsheets into R:

        +
          +
        • +read_xls() reads Excel files with xls format.
        • +
        • +read_xlsx() read Excel files with xlsx format.
        • +
        • +read_excel() can read files with both xls and xlsx format. It guesses the file type based on the input.
        • +
        +

        These functions all have similar syntax just like other functions we have previously introduced for reading other types of files, e.g., read_csv(), read_table(), etc. For the rest of the chapter we will focus on using read_excel().

        +

        +20.2.3 Reading Excel spreadsheets

        +

        Figura 20.1 shows what the spreadsheet we’re going to read into R looks like in Excel. This spreadsheet can be downloaded an Excel file from https://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w/.

        +
        +
        +
        +

        A look at the students spreadsheet in Excel. The spreadsheet contains information on 6 students, their ID, full name, favourite food, meal plan, and age.

        +
        Figura 20.1: Spreadsheet called students.xlsx in Excel.
        +
        +
        +
        +

        The first argument to read_excel() is the path to the file to read.

        +
        +
        students <- read_excel("data/students.xlsx")
        +
        +

        read_excel() will read the file in as a tibble.

        +
        +
        students
        +#> # A tibble: 6 × 5
        +#>   `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
        +#>          <dbl> <chr>            <chr>              <chr>               <chr>
        +#> 1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
        +#> 2            2 Barclay Lynn     French fries       Lunch only          5    
        +#> 3            3 Jayendra Lyne    N/A                Breakfast and lunch 7    
        +#> 4            4 Leon Rossini     Anchovies          Lunch only          <NA> 
        +#> 5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
        +#> 6            6 Güvenç Attila    Ice cream          Lunch only          6
        +
        +

        We have six students in the data and five variables on each student. However there are a few things we might want to address in this dataset:

        +
          +
        1. +

          The column names are all over the place. You can provide column names that follow a consistent format; we recommend snake_case using the col_names argument.

          +
          +
          read_excel(
          +  "data/students.xlsx",
          +  col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age")
          +)
          +#> # A tibble: 7 × 5
          +#>   student_id full_name        favourite_food     meal_plan           age  
          +#>   <chr>      <chr>            <chr>              <chr>               <chr>
          +#> 1 Student ID Full Name        favourite.food     mealPlan            AGE  
          +#> 2 1          Sunil Huffmann   Strawberry yoghurt Lunch only          4    
          +#> 3 2          Barclay Lynn     French fries       Lunch only          5    
          +#> 4 3          Jayendra Lyne    N/A                Breakfast and lunch 7    
          +#> 5 4          Leon Rossini     Anchovies          Lunch only          <NA> 
          +#> 6 5          Chidiegwu Dunkel Pizza              Breakfast and lunch five 
          +#> 7 6          Güvenç Attila    Ice cream          Lunch only          6
          +
          +

          Unfortunately, this didn’t quite do the trick. We now have the variable names we want, but what was previously the header row now shows up as the first observation in the data. You can explicitly skip that row using the skip argument.

          +
          +
          read_excel(
          +  "data/students.xlsx",
          +  col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
          +  skip = 1
          +)
          +#> # A tibble: 6 × 5
          +#>   student_id full_name        favourite_food     meal_plan           age  
          +#>        <dbl> <chr>            <chr>              <chr>               <chr>
          +#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
          +#> 2          2 Barclay Lynn     French fries       Lunch only          5    
          +#> 3          3 Jayendra Lyne    N/A                Breakfast and lunch 7    
          +#> 4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
          +#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
          +#> 6          6 Güvenç Attila    Ice cream          Lunch only          6
          +
          +
        2. +
        3. +

          In the favourite_food column, one of the observations is N/A, which stands for “not available” but it’s currently not recognized as an NA (note the contrast between this N/A and the age of the fourth student in the list). You can specify which character strings should be recognized as NAs with the na argument. By default, only "" (empty string, or, in the case of reading from a spreadsheet, an empty cell or a cell with the formula =NA()) is recognized as an NA.

          +
          +
          read_excel(
          +  "data/students.xlsx",
          +  col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
          +  skip = 1,
          +  na = c("", "N/A")
          +)
          +#> # A tibble: 6 × 5
          +#>   student_id full_name        favourite_food     meal_plan           age  
          +#>        <dbl> <chr>            <chr>              <chr>               <chr>
          +#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
          +#> 2          2 Barclay Lynn     French fries       Lunch only          5    
          +#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
          +#> 4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
          +#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
          +#> 6          6 Güvenç Attila    Ice cream          Lunch only          6
          +
          +
        4. +
        5. +

          One other remaining issue is that age is read in as a character variable, but it really should be numeric. Just like with read_csv() and friends for reading data from flat files, you can supply a col_types argument to read_excel() and specify the column types for the variables you read in. The syntax is a bit different, though. Your options are "skip", "guess", "logical", "numeric", "date", "text" or "list".

          +
          +
          read_excel(
          +  "data/students.xlsx",
          +  col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
          +  skip = 1,
          +  na = c("", "N/A"),
          +  col_types = c("numeric", "text", "text", "text", "numeric")
          +)
          +#> Warning: Expecting numeric in E6 / R6C5: got 'five'
          +#> # A tibble: 6 × 5
          +#>   student_id full_name        favourite_food     meal_plan             age
          +#>        <dbl> <chr>            <chr>              <chr>               <dbl>
          +#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
          +#> 2          2 Barclay Lynn     French fries       Lunch only              5
          +#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7
          +#> 4          4 Leon Rossini     Anchovies          Lunch only             NA
          +#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch    NA
          +#> 6          6 Güvenç Attila    Ice cream          Lunch only              6
          +
          +

          However, this didn’t quite produce the desired result either. By specifying that age should be numeric, we have turned the one cell with the non-numeric entry (which had the value five) into an NA. In this case, we should read age in as "text" and then make the change once the data is loaded in R.

          +
          +
          students <- read_excel(
          +  "data/students.xlsx",
          +  col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
          +  skip = 1,
          +  na = c("", "N/A"),
          +  col_types = c("numeric", "text", "text", "text", "text")
          +)
          +
          +students <- students |>
          +  mutate(
          +    age = if_else(age == "five", "5", age),
          +    age = parse_number(age)
          +  )
          +
          +students
          +#> # A tibble: 6 × 5
          +#>   student_id full_name        favourite_food     meal_plan             age
          +#>        <dbl> <chr>            <chr>              <chr>               <dbl>
          +#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
          +#> 2          2 Barclay Lynn     French fries       Lunch only              5
          +#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7
          +#> 4          4 Leon Rossini     Anchovies          Lunch only             NA
          +#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
          +#> 6          6 Güvenç Attila    Ice cream          Lunch only              6
          +
          +
        6. +
        +

        It took us multiple steps and trial-and-error to load the data in exactly the format we want, and this is not unexpected. Data science is an iterative process, and the process of iteration can be even more tedious when reading data in from spreadsheets compared to other plain text, rectangular data files because humans tend to input data into spreadsheets and use them not just for data storage but also for sharing and communication.

        +

        There is no way to know exactly what the data will look like until you load it and take a look at it. Well, there is one way, actually. You can open the file in Excel and take a peek. If you’re going to do so, we recommend making a copy of the Excel file to open and browse interactively while leaving the original data file untouched and reading into R from the untouched file. This will ensure you don’t accidentally overwrite anything in the spreadsheet while inspecting it. You should also not be afraid of doing what we did here: load the data, take a peek, make adjustments to your code, load it again, and repeat until you’re happy with the result.

        +

        +20.2.4 Reading worksheets

        +

        An important feature that distinguishes spreadsheets from flat files is the notion of multiple sheets, called worksheets. Figura 20.2 shows an Excel spreadsheet with multiple worksheets. The data come from the palmerpenguins package, and you can download this spreadsheet as an Excel file from https://docs.google.com/spreadsheets/d/1aFu8lnD_g0yjF5O-K6SFgSEWiHPpgvFCF0NY9D6LXnY/. Each worksheet contains information on penguins from a different island where data were collected.

        +
        +
        +
        +

        A look at the penguins spreadsheet in Excel. The spreadsheet contains has three worksheets: Torgersen Island, Biscoe Island, and Dream Island.

        +
        Figura 20.2: Spreadsheet called penguins.xlsx in Excel containing three worksheets.
        +
        +
        +
        +

        You can read a single worksheet from a spreadsheet with the sheet argument in read_excel(). The default, which we’ve been relying on up until now, is the first sheet.

        +
        +
        read_excel("data/penguins.xlsx", sheet = "Torgersen Island")
        +#> # A tibble: 52 × 8
        +#>   species island    bill_length_mm     bill_depth_mm      flipper_length_mm
        +#>   <chr>   <chr>     <chr>              <chr>              <chr>            
        +#> 1 Adelie  Torgersen 39.1               18.7               181              
        +#> 2 Adelie  Torgersen 39.5               17.399999999999999 186              
        +#> 3 Adelie  Torgersen 40.299999999999997 18                 195              
        +#> 4 Adelie  Torgersen NA                 NA                 NA               
        +#> 5 Adelie  Torgersen 36.700000000000003 19.3               193              
        +#> 6 Adelie  Torgersen 39.299999999999997 20.6               190              
        +#> # ℹ 46 more rows
        +#> # ℹ 3 more variables: body_mass_g <chr>, sex <chr>, year <dbl>
        +
        +

        Some variables that appear to contain numerical data are read in as characters due to the character string "NA" not being recognized as a true NA.

        +
        +
        penguins_torgersen <- read_excel("data/penguins.xlsx", sheet = "Torgersen Island", na = "NA")
        +
        +penguins_torgersen
        +#> # A tibble: 52 × 8
        +#>   species island    bill_length_mm bill_depth_mm flipper_length_mm
        +#>   <chr>   <chr>              <dbl>         <dbl>             <dbl>
        +#> 1 Adelie  Torgersen           39.1          18.7               181
        +#> 2 Adelie  Torgersen           39.5          17.4               186
        +#> 3 Adelie  Torgersen           40.3          18                 195
        +#> 4 Adelie  Torgersen           NA            NA                  NA
        +#> 5 Adelie  Torgersen           36.7          19.3               193
        +#> 6 Adelie  Torgersen           39.3          20.6               190
        +#> # ℹ 46 more rows
        +#> # ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>
        +
        +

        Alternatively, you can use excel_sheets() to get information on all worksheets in an Excel spreadsheet, and then read the one(s) you’re interested in.

        +
        +
        excel_sheets("data/penguins.xlsx")
        +#> [1] "Torgersen Island" "Biscoe Island"    "Dream Island"
        +
        +

        Once you know the names of the worksheets, you can read them in individually with read_excel().

        +
        +
        penguins_biscoe <- read_excel("data/penguins.xlsx", sheet = "Biscoe Island", na = "NA")
        +penguins_dream  <- read_excel("data/penguins.xlsx", sheet = "Dream Island", na = "NA")
        +
        +

        In this case the full penguins dataset is spread across three worksheets in the spreadsheet. Each worksheet has the same number of columns but different numbers of rows.

        +
        +
        dim(penguins_torgersen)
        +#> [1] 52  8
        +dim(penguins_biscoe)
        +#> [1] 168   8
        +dim(penguins_dream)
        +#> [1] 124   8
        +
        +

        We can put them together with bind_rows().

        +
        +
        penguins <- bind_rows(penguins_torgersen, penguins_biscoe, penguins_dream)
        +penguins
        +#> # A tibble: 344 × 8
        +#>   species island    bill_length_mm bill_depth_mm flipper_length_mm
        +#>   <chr>   <chr>              <dbl>         <dbl>             <dbl>
        +#> 1 Adelie  Torgersen           39.1          18.7               181
        +#> 2 Adelie  Torgersen           39.5          17.4               186
        +#> 3 Adelie  Torgersen           40.3          18                 195
        +#> 4 Adelie  Torgersen           NA            NA                  NA
        +#> 5 Adelie  Torgersen           36.7          19.3               193
        +#> 6 Adelie  Torgersen           39.3          20.6               190
        +#> # ℹ 338 more rows
        +#> # ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>
        +
        +

        In Capítulo 26 we’ll talk about ways of doing this sort of task without repetitive code.

        +

        +20.2.5 Reading part of a sheet

        +

        Since many use Excel spreadsheets for presentation as well as for data storage, it’s quite common to find cell entries in a spreadsheet that are not part of the data you want to read into R. Figura 20.3 shows such a spreadsheet: in the middle of the sheet is what looks like a data frame but there is extraneous text in cells above and below the data.

        +
        +
        +
        +

        A look at the deaths spreadsheet in Excel. The spreadsheet has four rows on top that contain non-data information; the text 'For the same of consistency in the data layout, which is really a beautiful thing, I will keep making notes up here.' is spread across cells in these top four rows. Then, there is a data frame that includes information on deaths of 10 famous people, including their names, professions, ages, whether they have kids or not, date of birth and death. At the bottom, there are four more rows of non-data information; the text 'This has been really fun, but we're signing off now!' is spread across cells in these bottom four rows.

        +
        Figura 20.3: Spreadsheet called deaths.xlsx in Excel.
        +
        +
        +
        +

        This spreadsheet is one of the example spreadsheets provided in the readxl package. You can use the readxl_example() function to locate the spreadsheet on your system in the directory where the package is installed. This function returns the path to the spreadsheet, which you can use in read_excel() as usual.

        +
        +
        deaths_path <- readxl_example("deaths.xlsx")
        +deaths <- read_excel(deaths_path)
        +#> New names:
        +#> • `` -> `...2`
        +#> • `` -> `...3`
        +#> • `` -> `...4`
        +#> • `` -> `...5`
        +#> • `` -> `...6`
        +deaths
        +#> # A tibble: 18 × 6
        +#>   `Lots of people`    ...2       ...3  ...4     ...5          ...6           
        +#>   <chr>               <chr>      <chr> <chr>    <chr>         <chr>          
        +#> 1 simply cannot resi… <NA>       <NA>  <NA>     <NA>          some notes     
        +#> 2 at                  the        top   <NA>     of            their spreadsh…
        +#> 3 or                  merging    <NA>  <NA>     <NA>          cells          
        +#> 4 Name                Profession Age   Has kids Date of birth Date of death  
        +#> 5 David Bowie         musician   69    TRUE     17175         42379          
        +#> 6 Carrie Fisher       actor      60    TRUE     20749         42731          
        +#> # ℹ 12 more rows
        +
        +

        The top three rows and the bottom four rows are not part of the data frame. It’s possible to eliminate these extraneous rows using the skip and n_max arguments, but we recommend using cell ranges. In Excel, the top left cell is A1. As you move across columns to the right, the cell label moves down the alphabet, i.e. B1, C1, etc. And as you move down a column, the number in the cell label increases, i.e. A2, A3, etc.

        +

        Here the data we want to read in starts in cell A5 and ends in cell F15. In spreadsheet notation, this is A5:F15, which we supply to the range argument:

        +
        +
        read_excel(deaths_path, range = "A5:F15")
        +#> # A tibble: 10 × 6
        +#>   Name          Profession   Age `Has kids` `Date of birth`    
        +#>   <chr>         <chr>      <dbl> <lgl>      <dttm>             
        +#> 1 David Bowie   musician      69 TRUE       1947-01-08 00:00:00
        +#> 2 Carrie Fisher actor         60 TRUE       1956-10-21 00:00:00
        +#> 3 Chuck Berry   musician      90 TRUE       1926-10-18 00:00:00
        +#> 4 Bill Paxton   actor         61 TRUE       1955-05-17 00:00:00
        +#> 5 Prince        musician      57 TRUE       1958-06-07 00:00:00
        +#> 6 Alan Rickman  actor         69 FALSE      1946-02-21 00:00:00
        +#> # ℹ 4 more rows
        +#> # ℹ 1 more variable: `Date of death` <dttm>
        +
        +

        +20.2.6 Data types

        +

        In CSV files, all values are strings. This is not particularly true to the data, but it is simple: everything is a string.

        +

        The underlying data in Excel spreadsheets is more complex. A cell can be one of four things:

        +
          +
        • A boolean, like TRUE, FALSE, or NA.

        • +
        • A number, like “10” or “10.5”.

        • +
        • A datetime, which can also include time like “11/1/21” or “11/1/21 3:00 PM”.

        • +
        • A text string, like “ten”.

        • +
        +

        When working with spreadsheet data, it’s important to keep in mind that the underlying data can be very different than what you see in the cell. For example, Excel has no notion of an integer. All numbers are stored as floating points, but you can choose to display the data with a customizable number of decimal points. Similarly, dates are actually stored as numbers, specifically the number of seconds since January 1, 1970. You can customize how you display the date by applying formatting in Excel. Confusingly, it’s also possible to have something that looks like a number but is actually a string (e.g., type '10 into a cell in Excel).

        +

        These differences between how the underlying data are stored vs. how they’re displayed can cause surprises when the data are loaded into R. By default readxl will guess the data type in a given column. A recommended workflow is to let readxl guess the column types, confirm that you’re happy with the guessed column types, and if not, go back and re-import specifying col_types as shown in Seção 20.2.3.

        +

        Another challenge is when you have a column in your Excel spreadsheet that has a mix of these types, e.g., some cells are numeric, others text, others dates. When importing the data into R readxl has to make some decisions. In these cases you can set the type for this column to "list", which will load the column as a list of length 1 vectors, where the type of each element of the vector is guessed.

        +
        +
        +
        + +
        +
        +

        Sometimes data is stored in more exotic ways, like the color of the cell background, or whether or not the text is bold. In such cases, you might find the tidyxl package useful. See https://nacnudus.github.io/spreadsheet-munging-strategies/ for more on strategies for working with non-tabular data from Excel.

        +
        +
        +
        +

        +20.2.7 Writing to Excel

        +

        Let’s create a small data frame that we can then write out. Note that item is a factor and quantity is an integer.

        +
        +
        bake_sale <- tibble(
        +  item     = factor(c("brownie", "cupcake", "cookie")),
        +  quantity = c(10, 5, 8)
        +)
        +
        +bake_sale
        +#> # A tibble: 3 × 2
        +#>   item    quantity
        +#>   <fct>      <dbl>
        +#> 1 brownie       10
        +#> 2 cupcake        5
        +#> 3 cookie         8
        +
        +

        You can write data back to disk as an Excel file using the write_xlsx() from the writexl package:

        +
        +
        write_xlsx(bake_sale, path = "data/bake-sale.xlsx")
        +
        +

        Figura 20.4 shows what the data looks like in Excel. Note that column names are included and bolded. These can be turned off by setting col_names and format_headers arguments to FALSE.

        +
        +
        +
        +

        Bake sale data frame created earlier in Excel.

        +
        Figura 20.4: Spreadsheet called bake_sale.xlsx in Excel.
        +
        +
        +
        +

        Just like reading from a CSV, information on data type is lost when we read the data back in. This makes Excel files unreliable for caching interim results as well. For alternatives, see Seção 7.5.

        +
        +
        read_excel("data/bake-sale.xlsx")
        +#> # A tibble: 3 × 2
        +#>   item    quantity
        +#>   <chr>      <dbl>
        +#> 1 brownie       10
        +#> 2 cupcake        5
        +#> 3 cookie         8
        +
        +

        +20.2.8 Formatted output

        +

        The writexl package is a light-weight solution for writing a simple Excel spreadsheet, but if you’re interested in additional features like writing to sheets within a spreadsheet and styling, you will want to use the openxlsx package. We won’t go into the details of using this package here, but we recommend reading https://ycphs.github.io/openxlsx/articles/Formatting.html for an extensive discussion on further formatting functionality for data written from R to Excel with openxlsx.

        +

        Note that this package is not part of the tidyverse so the functions and workflows may feel unfamiliar. For example, function names are camelCase, multiple functions can’t be composed in pipelines, and arguments are in a different order than they tend to be in the tidyverse. However, this is ok. As your R learning and usage expands outside of this book you will encounter lots of different styles used in various R packages that you might use to accomplish specific goals in R. A good way of familiarizing yourself with the coding style used in a new package is to run the examples provided in function documentation to get a feel for the syntax and the output formats as well as reading any vignettes that might come with the package.

        +

        +20.2.9 Exercises

        +
          +
        1. +

          In an Excel file, create the following dataset and save it as survey.xlsx. Alternatively, you can download it as an Excel file from here.

          +
          +
          +

          A spreadsheet with 3 columns (group, subgroup, and id) and 12 rows. The group column has two values: 1 (spanning 7 merged rows) and 2 (spanning 5 merged rows). The subgroup column has four values: A (spanning 3 merged rows), B (spanning 4 merged rows), A (spanning 2 merged rows), and B (spanning 3 merged rows). The id column has twelve values, numbers 1 through 12.

          +
          +
          +

          Then, read it into R, with survey_id as a character variable and n_pets as a numerical variable.

          +
          +
          #> # A tibble: 6 × 2
          +#>   survey_id n_pets
          +#>   <chr>      <dbl>
          +#> 1 1              0
          +#> 2 2              1
          +#> 3 3             NA
          +#> 4 4              2
          +#> 5 5              2
          +#> 6 6             NA
          +
          +
        2. +
        3. +

          In another Excel file, create the following dataset and save it as roster.xlsx. Alternatively, you can download it as an Excel file from here.

          +
          +
          +

          A spreadsheet with 3 columns (group, subgroup, and id) and 12 rows. The group column has two values: 1 (spanning 7 merged rows) and 2 (spanning 5 merged rows). The subgroup column has four values: A (spanning 3 merged rows), B (spanning 4 merged rows), A (spanning 2 merged rows), and B (spanning 3 merged rows). The id column has twelve values, numbers 1 through 12.

          +
          +
          +

          Then, read it into R. The resulting data frame should be called roster and should look like the following.

          +
          +
          #> # A tibble: 12 × 3
          +#>    group subgroup    id
          +#>    <dbl> <chr>    <dbl>
          +#>  1     1 A            1
          +#>  2     1 A            2
          +#>  3     1 A            3
          +#>  4     1 B            4
          +#>  5     1 B            5
          +#>  6     1 B            6
          +#>  7     1 B            7
          +#>  8     2 A            8
          +#>  9     2 A            9
          +#> 10     2 B           10
          +#> 11     2 B           11
          +#> 12     2 B           12
          +
          +
        4. +
        5. +

          In a new Excel file, create the following dataset and save it as sales.xlsx. Alternatively, you can download it as an Excel file from here.

          +
          +
          +

          A spreadsheet with 2 columns and 13 rows. The first two rows have text containing information about the sheet. Row 1 says "This file contains information on sales". Row 2 says "Data are organized by brand name, and for each brand, we have the ID number for the item sold, and how many are sold.". Then there are two empty rows, and then 9 rows of data.

          +
          +
          +

          a. Read sales.xlsx in and save as sales. The data frame should look like the following, with id and n as column names and with 9 rows.

          +
          +
          #> # A tibble: 9 × 2
          +#>   id      n    
          +#>   <chr>   <chr>
          +#> 1 Brand 1 n    
          +#> 2 1234    8    
          +#> 3 8721    2    
          +#> 4 1822    3    
          +#> 5 Brand 2 n    
          +#> 6 3333    1    
          +#> 7 2156    3    
          +#> 8 3987    6    
          +#> 9 3216    5
          +
          +

          b. Modify sales further to get it into the following tidy format with three columns (brand, id, and n) and 7 rows of data. Note that id and n are numeric, brand is a character variable.

          +
          +
          #> # A tibble: 7 × 3
          +#>   brand      id     n
          +#>   <chr>   <dbl> <dbl>
          +#> 1 Brand 1  1234     8
          +#> 2 Brand 1  8721     2
          +#> 3 Brand 1  1822     3
          +#> 4 Brand 2  3333     1
          +#> 5 Brand 2  2156     3
          +#> 6 Brand 2  3987     6
          +#> 7 Brand 2  3216     5
          +
          +
        6. +
        7. Recreate the bake_sale data frame, write it out to an Excel file using the write.xlsx() function from the openxlsx package.

        8. +
        9. In Capítulo 7 you learned about the janitor::clean_names() function to turn column names into snake case. Read the students.xlsx file that we introduced earlier in this section and use this function to “clean” the column names.

        10. +
        11. What happens if you try to read in a file with .xlsx extension with read_xls()?

        12. +

        +20.3 Google Sheets

        +

        Google Sheets is another widely used spreadsheet program. It’s free and web-based. Just like with Excel, in Google Sheets data are organized in worksheets (also called sheets) inside of spreadsheet files.

        +

        +20.3.1 Prerequisites

        +

        This section will also focus on spreadsheets, but this time you’ll be loading data from a Google Sheet with the googlesheets4 package. This package is non-core tidyverse as well, you need to load it explicitly.

        + +

        A quick note about the name of the package: googlesheets4 uses v4 of the Sheets API v4 to provide an R interface to Google Sheets, hence the name.

        +

        +20.3.2 Getting started

        +

        The main function of the googlesheets4 package is read_sheet(), which reads a Google Sheet from a URL or a file id. This function also goes by the name range_read().

        +

        You can also create a brand new sheet with gs4_create() or write to an existing sheet with sheet_write() and friends.

        +

        In this section we’ll work with the same datasets as the ones in the Excel section to highlight similarities and differences between workflows for reading data from Excel and Google Sheets. readxl and googlesheets4 packages are both designed to mimic the functionality of the readr package, which provides the read_csv() function you’ve seen in Capítulo 7. Therefore, many of the tasks can be accomplished with simply swapping out read_excel() for read_sheet(). However you’ll also see that Excel and Google Sheets don’t behave in exactly the same way, therefore other tasks may require further updates to the function calls.

        +

        +20.3.3 Reading Google Sheets

        +

        Figura 20.5 shows what the spreadsheet we’re going to read into R looks like in Google Sheets. This is the same dataset as in Figura 20.1, except it’s stored in a Google Sheet instead of Excel.

        +
        +
        +
        +

        A look at the students spreadsheet in Google Sheets. The spreadsheet contains information on 6 students, their ID, full name, favourite food, meal plan, and age.

        +
        Figura 20.5: Google Sheet called students in a browser window.
        +
        +
        +
        +

        The first argument to read_sheet() is the URL of the file to read, and it returns a tibble:
        https://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w. These URLs are not pleasant to work with, so you’ll often want to identify a sheet by its ID.

        + +
        +
        students_sheet_id <- "1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w"
        +students <- read_sheet(students_sheet_id)
        +#> ✔ Reading from students.
        +#> ✔ Range Sheet1.
        +students
        +#> # A tibble: 6 × 5
        +#>   `Student ID` `Full Name`      favourite.food     mealPlan            AGE   
        +#>          <dbl> <chr>            <chr>              <chr>               <list>
        +#> 1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          <dbl> 
        +#> 2            2 Barclay Lynn     French fries       Lunch only          <dbl> 
        +#> 3            3 Jayendra Lyne    N/A                Breakfast and lunch <dbl> 
        +#> 4            4 Leon Rossini     Anchovies          Lunch only          <NULL>
        +#> 5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch <chr> 
        +#> 6            6 Güvenç Attila    Ice cream          Lunch only          <dbl>
        +
        +

        Just like we did with read_excel(), we can supply column names, NA strings, and column types to read_sheet().

        +
        +
        students <- read_sheet(
        +  students_sheet_id,
        +  col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
        +  skip = 1,
        +  na = c("", "N/A"),
        +  col_types = "dcccc"
        +)
        +#> ✔ Reading from students.
        +#> ✔ Range 2:10000000.
        +
        +students
        +#> # A tibble: 6 × 5
        +#>   student_id full_name        favourite_food     meal_plan           age  
        +#>        <dbl> <chr>            <chr>              <chr>               <chr>
        +#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
        +#> 2          2 Barclay Lynn     French fries       Lunch only          5    
        +#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
        +#> 4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
        +#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
        +#> 6          6 Güvenç Attila    Ice cream          Lunch only          6
        +
        +

        Note that we defined column types a bit differently here, using short codes. For example, “dcccc” stands for “double, character, character, character, character”.

        +

        It’s also possible to read individual sheets from Google Sheets as well. Let’s read the “Torgersen Island” sheet from the penguins Google Sheet:

        +
        +
        penguins_sheet_id <- "1aFu8lnD_g0yjF5O-K6SFgSEWiHPpgvFCF0NY9D6LXnY"
        +read_sheet(penguins_sheet_id, sheet = "Torgersen Island")
        +#> ✔ Reading from penguins.
        +#> ✔ Range ''Torgersen Island''.
        +#> # A tibble: 52 × 8
        +#>   species island    bill_length_mm bill_depth_mm flipper_length_mm
        +#>   <chr>   <chr>     <list>         <list>        <list>           
        +#> 1 Adelie  Torgersen <dbl [1]>      <dbl [1]>     <dbl [1]>        
        +#> 2 Adelie  Torgersen <dbl [1]>      <dbl [1]>     <dbl [1]>        
        +#> 3 Adelie  Torgersen <dbl [1]>      <dbl [1]>     <dbl [1]>        
        +#> 4 Adelie  Torgersen <chr [1]>      <chr [1]>     <chr [1]>        
        +#> 5 Adelie  Torgersen <dbl [1]>      <dbl [1]>     <dbl [1]>        
        +#> 6 Adelie  Torgersen <dbl [1]>      <dbl [1]>     <dbl [1]>        
        +#> # ℹ 46 more rows
        +#> # ℹ 3 more variables: body_mass_g <list>, sex <chr>, year <dbl>
        +
        +

        You can obtain a list of all sheets within a Google Sheet with sheet_names():

        +
        +
        sheet_names(penguins_sheet_id)
        +#> [1] "Torgersen Island" "Biscoe Island"    "Dream Island"
        +
        +

        Finally, just like with read_excel(), we can read in a portion of a Google Sheet by defining a range in read_sheet(). Note that we’re also using the gs4_example() function below to locate an example Google Sheet that comes with the googlesheets4 package.

        +
        +
        deaths_url <- gs4_example("deaths")
        +deaths <- read_sheet(deaths_url, range = "A5:F15")
        +#> ✔ Reading from deaths.
        +#> ✔ Range A5:F15.
        +deaths
        +#> # A tibble: 10 × 6
        +#>   Name          Profession   Age `Has kids` `Date of birth`    
        +#>   <chr>         <chr>      <dbl> <lgl>      <dttm>             
        +#> 1 David Bowie   musician      69 TRUE       1947-01-08 00:00:00
        +#> 2 Carrie Fisher actor         60 TRUE       1956-10-21 00:00:00
        +#> 3 Chuck Berry   musician      90 TRUE       1926-10-18 00:00:00
        +#> 4 Bill Paxton   actor         61 TRUE       1955-05-17 00:00:00
        +#> 5 Prince        musician      57 TRUE       1958-06-07 00:00:00
        +#> 6 Alan Rickman  actor         69 FALSE      1946-02-21 00:00:00
        +#> # ℹ 4 more rows
        +#> # ℹ 1 more variable: `Date of death` <dttm>
        +
        +

        +20.3.4 Writing to Google Sheets

        +

        You can write from R to Google Sheets with write_sheet(). The first argument is the data frame to write, and the second argument is the name (or other identifier) of the Google Sheet to write to:

        +
        +
        write_sheet(bake_sale, ss = "bake-sale")
        +
        +

        If you’d like to write your data to a specific (work)sheet inside a Google Sheet, you can specify that with the sheet argument as well.

        +
        +
        write_sheet(bake_sale, ss = "bake-sale", sheet = "Sales")
        +
        +

        +20.3.5 Authentication

        +

        While you can read from a public Google Sheet without authenticating with your Google account and with gs4_deauth(), reading a private sheet or writing to a sheet requires authentication so that googlesheets4 can view and manage your Google Sheets.

        +

        When you attempt to read in a sheet that requires authentication, googlesheets4 will direct you to a web browser with a prompt to sign in to your Google account and grant permission to operate on your behalf with Google Sheets. However, if you want to specify a specific Google account, authentication scope, etc. you can do so with gs4_auth(), e.g., gs4_auth(email = "mine@example.com"), which will force the use of a token associated with a specific email. For further authentication details, we recommend reading the documentation googlesheets4 auth vignette: https://googlesheets4.tidyverse.org/articles/auth.html.

        +

        +20.3.6 Exercises

        +
          +
        1. Read the students dataset from earlier in the chapter from Excel and also from Google Sheets, with no additional arguments supplied to the read_excel() and read_sheet() functions. Are the resulting data frames in R exactly the same? If not, how are they different?

        2. +
        3. Read the Google Sheet titled survey from https://pos.it/r4ds-survey, with survey_id as a character variable and n_pets as a numerical variable.

        4. +
        5. +

          Read the Google Sheet titled roster from https://pos.it/r4ds-roster. The resulting data frame should be called roster and should look like the following.

          +
          +
          #> # A tibble: 12 × 3
          +#>    group subgroup    id
          +#>    <dbl> <chr>    <dbl>
          +#>  1     1 A            1
          +#>  2     1 A            2
          +#>  3     1 A            3
          +#>  4     1 B            4
          +#>  5     1 B            5
          +#>  6     1 B            6
          +#>  7     1 B            7
          +#>  8     2 A            8
          +#>  9     2 A            9
          +#> 10     2 B           10
          +#> 11     2 B           11
          +#> 12     2 B           12
          +
          +
        6. +

        +20.4 Summary

        +

        Microsoft Excel and Google Sheets are two of the most popular spreadsheet systems. Being able to interact with data stored in Excel and Google Sheets files directly from R is a superpower! In this chapter you learned how to read data into R from spreadsheets from Excel with read_excel() from the readxl package and from Google Sheets with read_sheet() from the googlesheets4 package. These functions work very similarly to each other and have similar arguments for specifying column names, NA strings, rows to skip on top of the file you’re reading in, etc. Additionally, both functions make it possible to read a single sheet from a spreadsheet as well.

        +

        On the other hand, writing to an Excel file requires a different package and function (writexl::write_xlsx()) while you can write to a Google Sheet with the googlesheets4 package, with write_sheet().

        +

        In the next chapter, you’ll learn about a different data source and how to read data from that source into R: databases.

        + + +
        +
        +
        +
        + + + \ No newline at end of file diff --git a/strings.html b/strings.html new file mode 100644 index 000000000..eeeca7449 --- /dev/null +++ b/strings.html @@ -0,0 +1,1349 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 14  Strings + + + + + + + + + + + + + + + + + + + + + + + + +
        +
        + +
        + + + +
        +

        14  Strings

        +
        + + + +
        + + + + +
        + + +

        +14.1 Introduction

        +

        So far, you’ve used a bunch of strings without learning much about the details. Now it’s time to dive into them, learn what makes strings tick, and master some of the powerful string manipulation tools you have at your disposal.

        +

        We’ll begin with the details of creating strings and character vectors. You’ll then dive into creating strings from data, then the opposite: extracting strings from data. We’ll then discuss tools that work with individual letters. The chapter finishes with functions that work with individual letters and a brief discussion of where your expectations from English might steer you wrong when working with other languages.

        +

        We’ll keep working with strings in the next chapter, where you’ll learn more about the power of regular expressions.

        +

        +14.1.1 Prerequisites

        +

        In this chapter, we’ll use functions from the stringr package, which is part of the core tidyverse. We’ll also use the babynames data since it provides some fun strings to manipulate.

        + +

        You can quickly tell when you’re using a stringr function because all stringr functions start with str_. This is particularly useful if you use RStudio because typing str_ will trigger autocomplete, allowing you to jog your memory of the available functions.

        +
        +
        +

        str_c typed into the RStudio console with the autocomplete tooltip shown on top, which lists functions beginning with str_c. The funtion signature and beginning of the man page for the highlighted function from the autocomplete list are shown in a panel to its right.

        +
        +
        +

        +14.2 Creating a string

        +

        We’ve created strings in passing earlier in the book but didn’t discuss the details. Firstly, you can create a string using either single quotes (') or double quotes ("). There’s no difference in behavior between the two, so in the interests of consistency, the tidyverse style guide recommends using ", unless the string contains multiple ".

        +
        +
        string1 <- "This is a string"
        +string2 <- 'If I want to include a "quote" inside a string, I use single quotes'
        +
        +

        If you forget to close a quote, you’ll see +, the continuation prompt:

        +
        > "This is a string without a closing quote
        ++ 
        ++ 
        ++ HELP I'M STUCK IN A STRING
        +

        If this happens to you and you can’t figure out which quote to close, press Escape to cancel and try again.

        +

        +14.2.1 Escapes

        +

        To include a literal single or double quote in a string, you can use \ to “escape” it:

        +
        +
        double_quote <- "\"" # or '"'
        +single_quote <- '\'' # or "'"
        +
        +

        So if you want to include a literal backslash in your string, you’ll need to escape it: "\\":

        +
        +
        backslash <- "\\"
        +
        +

        Beware that the printed representation of a string is not the same as the string itself because the printed representation shows the escapes (in other words, when you print a string, you can copy and paste the output to recreate that string). To see the raw contents of the string, use str_view()1:

        +
        +
        x <- c(single_quote, double_quote, backslash)
        +x
        +#> [1] "'"  "\"" "\\"
        +
        +str_view(x)
        +#> [1] │ '
        +#> [2] │ "
        +#> [3] │ \
        +
        +

        +14.2.2 Raw strings

        +

        Creating a string with multiple quotes or backslashes gets confusing quickly. To illustrate the problem, let’s create a string that contains the contents of the code block where we define the double_quote and single_quote variables:

        +
        +
        tricky <- "double_quote <- \"\\\"\" # or '\"'
        +single_quote <- '\\'' # or \"'\""
        +str_view(tricky)
        +#> [1] │ double_quote <- "\"" # or '"'
        +#>     │ single_quote <- '\'' # or "'"
        +
        +

        That’s a lot of backslashes! (This is sometimes called leaning toothpick syndrome.) To eliminate the escaping, you can instead use a raw string2:

        +
        +
        tricky <- r"(double_quote <- "\"" # or '"'
        +single_quote <- '\'' # or "'")"
        +str_view(tricky)
        +#> [1] │ double_quote <- "\"" # or '"'
        +#>     │ single_quote <- '\'' # or "'"
        +
        +

        A raw string usually starts with r"( and finishes with )". But if your string contains )" you can instead use r"[]" or r"{}", and if that’s still not enough, you can insert any number of dashes to make the opening and closing pairs unique, e.g., r"--()--", r"---()---", etc. Raw strings are flexible enough to handle any text.

        +

        +14.2.3 Other special characters

        +

        As well as \", \', and \\, there are a handful of other special characters that may come in handy. The most common are \n, a new line, and \t, tab. You’ll also sometimes see strings containing Unicode escapes that start with \u or \U. This is a way of writing non-English characters that work on all systems. You can see the complete list of other special characters in ?Quotes.

        +
        +
        x <- c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604")
        +x
        +#> [1] "one\ntwo" "one\ttwo" "µ"        "😄"
        +str_view(x)
        +#> [1] │ one
        +#>     │ two
        +#> [2] │ one{\t}two
        +#> [3] │ µ
        +#> [4] │ 😄
        +
        +

        Note that str_view() uses curly braces for tabs to make them easier to spot3. One of the challenges of working with text is that there’s a variety of ways that white space can end up in the text, so this background helps you recognize that something strange is going on.

        +

        +14.2.4 Exercises

        +
          +
        1. +

          Create strings that contain the following values:

          +
            +
          1. He said "That's amazing!"

          2. +
          3. \a\b\c\d

          4. +
          5. \\\\\\

          6. +
          +
        2. +
        3. +

          Create the string in your R session and print it. What happens to the special “\u00a0”? How does str_view() display it? Can you do a little googling to figure out what this special character is?

          +
          +
          x <- "This\u00a0is\u00a0tricky"
          +
          +
        4. +

        +14.3 Creating many strings from data

        +

        Now that you’ve learned the basics of creating a string or two by “hand”, we’ll go into the details of creating strings from other strings. This will help you solve the common problem where you have some text you wrote that you want to combine with strings from a data frame. For example, you might combine “Hello” with a name variable to create a greeting. We’ll show you how to do this with str_c() and str_glue() and how you can use them with mutate(). That naturally raises the question of what stringr functions you might use with summarize(), so we’ll finish this section with a discussion of str_flatten(), which is a summary function for strings.

        +

        +14.3.1 str_c() +

        +

        str_c() takes any number of vectors as arguments and returns a character vector:

        +
        +
        str_c("x", "y")
        +#> [1] "xy"
        +str_c("x", "y", "z")
        +#> [1] "xyz"
        +str_c("Hello ", c("John", "Susan"))
        +#> [1] "Hello John"  "Hello Susan"
        +
        +

        str_c() is very similar to the base paste0(), but is designed to be used with mutate() by obeying the usual tidyverse rules for recycling and propagating missing values:

        +
        +
        df <- tibble(name = c("Flora", "David", "Terra", NA))
        +df |> mutate(greeting = str_c("Hi ", name, "!"))
        +#> # A tibble: 4 × 2
        +#>   name  greeting 
        +#>   <chr> <chr>    
        +#> 1 Flora Hi Flora!
        +#> 2 David Hi David!
        +#> 3 Terra Hi Terra!
        +#> 4 <NA>  <NA>
        +
        +

        If you want missing values to display in another way, use coalesce() to replace them. Depending on what you want, you might use it either inside or outside of str_c():

        +
        +
        df |> 
        +  mutate(
        +    greeting1 = str_c("Hi ", coalesce(name, "you"), "!"),
        +    greeting2 = coalesce(str_c("Hi ", name, "!"), "Hi!")
        +  )
        +#> # A tibble: 4 × 3
        +#>   name  greeting1 greeting2
        +#>   <chr> <chr>     <chr>    
        +#> 1 Flora Hi Flora! Hi Flora!
        +#> 2 David Hi David! Hi David!
        +#> 3 Terra Hi Terra! Hi Terra!
        +#> 4 <NA>  Hi you!   Hi!
        +
        +

        +14.3.2 str_glue() +

        +

        If you are mixing many fixed and variable strings with str_c(), you’ll notice that you type a lot of "s, making it hard to see the overall goal of the code. An alternative approach is provided by the glue package via str_glue()4. You give it a single string that has a special feature: anything inside {} will be evaluated like it’s outside of the quotes:

        +
        +
        df |> mutate(greeting = str_glue("Hi {name}!"))
        +#> # A tibble: 4 × 2
        +#>   name  greeting 
        +#>   <chr> <glue>   
        +#> 1 Flora Hi Flora!
        +#> 2 David Hi David!
        +#> 3 Terra Hi Terra!
        +#> 4 <NA>  Hi NA!
        +
        +

        As you can see, str_glue() currently converts missing values to the string "NA", unfortunately making it inconsistent with str_c().

        +

        You also might wonder what happens if you need to include a regular { or } in your string. You’re on the right track if you guess you’ll need to escape it somehow. The trick is that glue uses a slightly different escaping technique: instead of prefixing with special character like \, you double up the special characters:

        +
        +
        df |> mutate(greeting = str_glue("{{Hi {name}!}}"))
        +#> # A tibble: 4 × 2
        +#>   name  greeting   
        +#>   <chr> <glue>     
        +#> 1 Flora {Hi Flora!}
        +#> 2 David {Hi David!}
        +#> 3 Terra {Hi Terra!}
        +#> 4 <NA>  {Hi NA!}
        +
        +

        +14.3.3 str_flatten() +

        +

        str_c() and str_glue() work well with mutate() because their output is the same length as their inputs. What if you want a function that works well with summarize(), i.e. something that always returns a single string? That’s the job of str_flatten()5: it takes a character vector and combines each element of the vector into a single string:

        +
        +
        str_flatten(c("x", "y", "z"))
        +#> [1] "xyz"
        +str_flatten(c("x", "y", "z"), ", ")
        +#> [1] "x, y, z"
        +str_flatten(c("x", "y", "z"), ", ", last = ", and ")
        +#> [1] "x, y, and z"
        +
        +

        This makes it work well with summarize():

        +
        +
        df <- tribble(
        +  ~ name, ~ fruit,
        +  "Carmen", "banana",
        +  "Carmen", "apple",
        +  "Marvin", "nectarine",
        +  "Terence", "cantaloupe",
        +  "Terence", "papaya",
        +  "Terence", "mandarin"
        +)
        +df |>
        +  group_by(name) |> 
        +  summarize(fruits = str_flatten(fruit, ", "))
        +#> # A tibble: 3 × 2
        +#>   name    fruits                      
        +#>   <chr>   <chr>                       
        +#> 1 Carmen  banana, apple               
        +#> 2 Marvin  nectarine                   
        +#> 3 Terence cantaloupe, papaya, mandarin
        +
        +

        +14.3.4 Exercises

        +
          +
        1. +

          Compare and contrast the results of paste0() with str_c() for the following inputs:

          +
          +
          str_c("hi ", NA)
          +str_c(letters[1:2], letters[1:3])
          +
          +
        2. +
        3. What’s the difference between paste() and paste0()? How can you recreate the equivalent of paste() with str_c()?

        4. +
        5. +

          Convert the following expressions from str_c() to str_glue() or vice versa:

          +
            +
          1. str_c("The price of ", food, " is ", price)

          2. +
          3. str_glue("I'm {age} years old and live in {country}")

          4. +
          5. str_c("\\section{", title, "}")

          6. +
          +
        6. +

        +14.4 Extracting data from strings

        +

        It’s very common for multiple variables to be crammed together into a single string. In this section, you’ll learn how to use four tidyr functions to extract them:

        +
          +
        • df |> separate_longer_delim(col, delim)
        • +
        • df |> separate_longer_position(col, width)
        • +
        • df |> separate_wider_delim(col, delim, names)
        • +
        • df |> separate_wider_position(col, widths)
        • +
        +

        If you look closely, you can see there’s a common pattern here: separate_, then longer or wider, then _, then by delim or position. That’s because these four functions are composed of two simpler primitives:

        +
          +
        • Just like with pivot_longer() and pivot_wider(), _longer functions make the input data frame longer by creating new rows and _wider functions make the input data frame wider by generating new columns.
        • +
        • +delim splits up a string with a delimiter like ", " or " "; position splits at specified widths, like c(3, 5, 2).
        • +
        +

        We’ll return to the last member of this family, separate_wider_regex(), in Capítulo 15. It’s the most flexible of the wider functions, but you need to know something about regular expressions before you can use it.

        +

        The following two sections will give you the basic idea behind these separate functions, first separating into rows (which is a little simpler) and then separating into columns. We’ll finish off by discussing the tools that the wider functions give you to diagnose problems.

        +

        +14.4.1 Separating into rows

        +

        Separating a string into rows tends to be most useful when the number of components varies from row to row. The most common case is requiring separate_longer_delim() to split based on a delimiter:

        +
        +
        df1 <- tibble(x = c("a,b,c", "d,e", "f"))
        +df1 |> 
        +  separate_longer_delim(x, delim = ",")
        +#> # A tibble: 6 × 1
        +#>   x    
        +#>   <chr>
        +#> 1 a    
        +#> 2 b    
        +#> 3 c    
        +#> 4 d    
        +#> 5 e    
        +#> 6 f
        +
        +

        It’s rarer to see separate_longer_position() in the wild, but some older datasets do use a very compact format where each character is used to record a value:

        +
        +
        df2 <- tibble(x = c("1211", "131", "21"))
        +df2 |> 
        +  separate_longer_position(x, width = 1)
        +#> # A tibble: 9 × 1
        +#>   x    
        +#>   <chr>
        +#> 1 1    
        +#> 2 2    
        +#> 3 1    
        +#> 4 1    
        +#> 5 1    
        +#> 6 3    
        +#> # ℹ 3 more rows
        +
        +

        +14.4.2 Separating into columns

        +

        Separating a string into columns tends to be most useful when there are a fixed number of components in each string, and you want to spread them into columns. They are slightly more complicated than their longer equivalents because you need to name the columns. For example, in this following dataset, x is made up of a code, an edition number, and a year, separated by ".". To use separate_wider_delim(), we supply the delimiter and the names in two arguments:

        +
        +
        df3 <- tibble(x = c("a10.1.2022", "b10.2.2011", "e15.1.2015"))
        +df3 |> 
        +  separate_wider_delim(
        +    x,
        +    delim = ".",
        +    names = c("code", "edition", "year")
        +  )
        +#> # A tibble: 3 × 3
        +#>   code  edition year 
        +#>   <chr> <chr>   <chr>
        +#> 1 a10   1       2022 
        +#> 2 b10   2       2011 
        +#> 3 e15   1       2015
        +
        +

        If a specific piece is not useful you can use an NA name to omit it from the results:

        +
        +
        df3 |> 
        +  separate_wider_delim(
        +    x,
        +    delim = ".",
        +    names = c("code", NA, "year")
        +  )
        +#> # A tibble: 3 × 2
        +#>   code  year 
        +#>   <chr> <chr>
        +#> 1 a10   2022 
        +#> 2 b10   2011 
        +#> 3 e15   2015
        +
        +

        separate_wider_position() works a little differently because you typically want to specify the width of each column. So you give it a named integer vector, where the name gives the name of the new column, and the value is the number of characters it occupies. You can omit values from the output by not naming them:

        +
        +
        df4 <- tibble(x = c("202215TX", "202122LA", "202325CA")) 
        +df4 |> 
        +  separate_wider_position(
        +    x,
        +    widths = c(year = 4, age = 2, state = 2)
        +  )
        +#> # A tibble: 3 × 3
        +#>   year  age   state
        +#>   <chr> <chr> <chr>
        +#> 1 2022  15    TX   
        +#> 2 2021  22    LA   
        +#> 3 2023  25    CA
        +
        +

        +14.4.3 Diagnosing widening problems

        +

        separate_wider_delim()6 requires a fixed and known set of columns. What happens if some of the rows don’t have the expected number of pieces? There are two possible problems, too few or too many pieces, so separate_wider_delim() provides two arguments to help: too_few and too_many. Let’s first look at the too_few case with the following sample dataset:

        +
        +
        df <- tibble(x = c("1-1-1", "1-1-2", "1-3", "1-3-2", "1"))
        +
        +df |> 
        +  separate_wider_delim(
        +    x,
        +    delim = "-",
        +    names = c("x", "y", "z")
        +  )
        +#> Error in `separate_wider_delim()`:
        +#> ! Expected 3 pieces in each element of `x`.
        +#> ! 2 values were too short.
        +#> ℹ Use `too_few = "debug"` to diagnose the problem.
        +#> ℹ Use `too_few = "align_start"/"align_end"` to silence this message.
        +
        +

        You’ll notice that we get an error, but the error gives us some suggestions on how you might proceed. Let’s start by debugging the problem:

        +
        +
        debug <- df |> 
        +  separate_wider_delim(
        +    x,
        +    delim = "-",
        +    names = c("x", "y", "z"),
        +    too_few = "debug"
        +  )
        +#> Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and
        +#> `x_remainder`.
        +debug
        +#> # A tibble: 5 × 6
        +#>   x     y     z     x_ok  x_pieces x_remainder
        +#>   <chr> <chr> <chr> <lgl>    <int> <chr>      
        +#> 1 1-1-1 1     1     TRUE         3 ""         
        +#> 2 1-1-2 1     2     TRUE         3 ""         
        +#> 3 1-3   3     <NA>  FALSE        2 ""         
        +#> 4 1-3-2 3     2     TRUE         3 ""         
        +#> 5 1     <NA>  <NA>  FALSE        1 ""
        +
        +

        When you use the debug mode, you get three extra columns added to the output: x_ok, x_pieces, and x_remainder (if you separate a variable with a different name, you’ll get a different prefix). Here, x_ok lets you quickly find the inputs that failed:

        +
        +
        debug |> filter(!x_ok)
        +#> # A tibble: 2 × 6
        +#>   x     y     z     x_ok  x_pieces x_remainder
        +#>   <chr> <chr> <chr> <lgl>    <int> <chr>      
        +#> 1 1-3   3     <NA>  FALSE        2 ""         
        +#> 2 1     <NA>  <NA>  FALSE        1 ""
        +
        +

        x_pieces tells us how many pieces were found, compared to the expected 3 (the length of names). x_remainder isn’t useful when there are too few pieces, but we’ll see it again shortly.

        +

        Sometimes looking at this debugging information will reveal a problem with your delimiter strategy or suggest that you need to do more preprocessing before separating. In that case, fix the problem upstream and make sure to remove too_few = "debug" to ensure that new problems become errors.

        +

        In other cases, you may want to fill in the missing pieces with NAs and move on. That’s the job of too_few = "align_start" and too_few = "align_end" which allow you to control where the NAs should go:

        +
        +
        df |> 
        +  separate_wider_delim(
        +    x,
        +    delim = "-",
        +    names = c("x", "y", "z"),
        +    too_few = "align_start"
        +  )
        +#> # A tibble: 5 × 3
        +#>   x     y     z    
        +#>   <chr> <chr> <chr>
        +#> 1 1     1     1    
        +#> 2 1     1     2    
        +#> 3 1     3     <NA> 
        +#> 4 1     3     2    
        +#> 5 1     <NA>  <NA>
        +
        +

        The same principles apply if you have too many pieces:

        +
        +
        df <- tibble(x = c("1-1-1", "1-1-2", "1-3-5-6", "1-3-2", "1-3-5-7-9"))
        +
        +df |> 
        +  separate_wider_delim(
        +    x,
        +    delim = "-",
        +    names = c("x", "y", "z")
        +  )
        +#> Error in `separate_wider_delim()`:
        +#> ! Expected 3 pieces in each element of `x`.
        +#> ! 2 values were too long.
        +#> ℹ Use `too_many = "debug"` to diagnose the problem.
        +#> ℹ Use `too_many = "drop"/"merge"` to silence this message.
        +
        +

        But now, when we debug the result, you can see the purpose of x_remainder:

        +
        +
        debug <- df |> 
        +  separate_wider_delim(
        +    x,
        +    delim = "-",
        +    names = c("x", "y", "z"),
        +    too_many = "debug"
        +  )
        +#> Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and
        +#> `x_remainder`.
        +debug |> filter(!x_ok)
        +#> # A tibble: 2 × 6
        +#>   x         y     z     x_ok  x_pieces x_remainder
        +#>   <chr>     <chr> <chr> <lgl>    <int> <chr>      
        +#> 1 1-3-5-6   3     5     FALSE        4 -6         
        +#> 2 1-3-5-7-9 3     5     FALSE        5 -7-9
        +
        +

        You have a slightly different set of options for handling too many pieces: you can either silently “drop” any additional pieces or “merge” them all into the final column:

        +
        +
        df |> 
        +  separate_wider_delim(
        +    x,
        +    delim = "-",
        +    names = c("x", "y", "z"),
        +    too_many = "drop"
        +  )
        +#> # A tibble: 5 × 3
        +#>   x     y     z    
        +#>   <chr> <chr> <chr>
        +#> 1 1     1     1    
        +#> 2 1     1     2    
        +#> 3 1     3     5    
        +#> 4 1     3     2    
        +#> 5 1     3     5
        +
        +
        +df |> 
        +  separate_wider_delim(
        +    x,
        +    delim = "-",
        +    names = c("x", "y", "z"),
        +    too_many = "merge"
        +  )
        +#> # A tibble: 5 × 3
        +#>   x     y     z    
        +#>   <chr> <chr> <chr>
        +#> 1 1     1     1    
        +#> 2 1     1     2    
        +#> 3 1     3     5-6  
        +#> 4 1     3     2    
        +#> 5 1     3     5-7-9
        +
        +

        +14.5 Letters

        +

        In this section, we’ll introduce you to functions that allow you to work with the individual letters within a string. You’ll learn how to find the length of a string, extract substrings, and handle long strings in plots and tables.

        +

        +14.5.1 Length

        +

        str_length() tells you the number of letters in the string:

        +
        +
        str_length(c("a", "R for data science", NA))
        +#> [1]  1 18 NA
        +
        +

        You could use this with count() to find the distribution of lengths of US babynames and then with filter() to look at the longest names, which happen to have 15 letters7:

        +
        +
        babynames |>
        +  count(length = str_length(name), wt = n)
        +#> # A tibble: 14 × 2
        +#>   length        n
        +#>    <int>    <int>
        +#> 1      2   338150
        +#> 2      3  8589596
        +#> 3      4 48506739
        +#> 4      5 87011607
        +#> 5      6 90749404
        +#> 6      7 72120767
        +#> # ℹ 8 more rows
        +
        +babynames |> 
        +  filter(str_length(name) == 15) |> 
        +  count(name, wt = n, sort = TRUE)
        +#> # A tibble: 34 × 2
        +#>   name                n
        +#>   <chr>           <int>
        +#> 1 Franciscojavier   123
        +#> 2 Christopherjohn   118
        +#> 3 Johnchristopher   118
        +#> 4 Christopherjame   108
        +#> 5 Christophermich    52
        +#> 6 Ryanchristopher    45
        +#> # ℹ 28 more rows
        +
        +

        +14.5.2 Subsetting

        +

        You can extract parts of a string using str_sub(string, start, end), where start and end are the positions where the substring should start and end. The start and end arguments are inclusive, so the length of the returned string will be end - start + 1:

        +
        +
        x <- c("Apple", "Banana", "Pear")
        +str_sub(x, 1, 3)
        +#> [1] "App" "Ban" "Pea"
        +
        +

        You can use negative values to count back from the end of the string: -1 is the last character, -2 is the second to last character, etc.

        +
        +
        str_sub(x, -3, -1)
        +#> [1] "ple" "ana" "ear"
        +
        +

        Note that str_sub() won’t fail if the string is too short: it will just return as much as possible:

        +
        +
        str_sub("a", 1, 5)
        +#> [1] "a"
        +
        +

        We could use str_sub() with mutate() to find the first and last letter of each name:

        +
        +
        babynames |> 
        +  mutate(
        +    first = str_sub(name, 1, 1),
        +    last = str_sub(name, -1, -1)
        +  )
        +#> # A tibble: 1,924,665 × 7
        +#>    year sex   name          n   prop first last 
        +#>   <dbl> <chr> <chr>     <int>  <dbl> <chr> <chr>
        +#> 1  1880 F     Mary       7065 0.0724 M     y    
        +#> 2  1880 F     Anna       2604 0.0267 A     a    
        +#> 3  1880 F     Emma       2003 0.0205 E     a    
        +#> 4  1880 F     Elizabeth  1939 0.0199 E     h    
        +#> 5  1880 F     Minnie     1746 0.0179 M     e    
        +#> 6  1880 F     Margaret   1578 0.0162 M     t    
        +#> # ℹ 1,924,659 more rows
        +
        +

        +14.5.3 Exercises

        +
          +
        1. When computing the distribution of the length of babynames, why did we use wt = n?
        2. +
        3. Use str_length() and str_sub() to extract the middle letter from each baby name. What will you do if the string has an even number of characters?
        4. +
        5. Are there any major trends in the length of babynames over time? What about the popularity of first and last letters?
        6. +

        +14.6 Non-English text

        +

        So far, we’ve focused on English language text which is particularly easy to work with for two reasons. Firstly, the English alphabet is relatively simple: there are just 26 letters. Secondly (and maybe more importantly), the computing infrastructure we use today was predominantly designed by English speakers. Unfortunately, we don’t have room for a full treatment of non-English languages. Still, we wanted to draw your attention to some of the biggest challenges you might encounter: encoding, letter variations, and locale-dependent functions.

        +

        +14.6.1 Encoding

        +

        When working with non-English text, the first challenge is often the encoding. To understand what’s going on, we need to dive into how computers represent strings. In R, we can get at the underlying representation of a string using charToRaw():

        +
        +
        charToRaw("Hadley")
        +#> [1] 48 61 64 6c 65 79
        +
        +

        Each of these six hexadecimal numbers represents one letter: 48 is H, 61 is a, and so on. The mapping from hexadecimal number to character is called the encoding, and in this case, the encoding is called ASCII. ASCII does a great job of representing English characters because it’s the American Standard Code for Information Interchange.

        +

        Things aren’t so easy for languages other than English. In the early days of computing, there were many competing standards for encoding non-English characters. For example, there were two different encodings for Europe: Latin1 (aka ISO-8859-1) was used for Western European languages, and Latin2 (aka ISO-8859-2) was used for Central European languages. In Latin1, the byte b1 is “±”, but in Latin2, it’s “ą”! Fortunately, today there is one standard that is supported almost everywhere: UTF-8. UTF-8 can encode just about every character used by humans today and many extra symbols like emojis.

        +

        readr uses UTF-8 everywhere. This is a good default but will fail for data produced by older systems that don’t use UTF-8. If this happens, your strings will look weird when you print them. Sometimes just one or two characters might be messed up; other times, you’ll get complete gibberish. For example here are two inline CSVs with unusual encodings8:

        +
        +
        x1 <- "text\nEl Ni\xf1o was particularly bad this year"
        +read_csv(x1)$text
        +#> [1] "El Ni\xf1o was particularly bad this year"
        +
        +x2 <- "text\n\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
        +read_csv(x2)$text
        +#> [1] "\x82\xb1\x82\xf1\x82ɂ\xbf\x82\xcd"
        +
        +

        To read these correctly, you specify the encoding via the locale argument:

        +
        +
        read_csv(x1, locale = locale(encoding = "Latin1"))$text
        +#> [1] "El Niño was particularly bad this year"
        +
        +read_csv(x2, locale = locale(encoding = "Shift-JIS"))$text
        +#> [1] "こんにちは"
        +
        +

        How do you find the correct encoding? If you’re lucky, it’ll be included somewhere in the data documentation. Unfortunately, that’s rarely the case, so readr provides guess_encoding() to help you figure it out. It’s not foolproof and works better when you have lots of text (unlike here), but it’s a reasonable place to start. Expect to try a few different encodings before you find the right one.

        +

        Encodings are a rich and complex topic; we’ve only scratched the surface here. If you’d like to learn more, we recommend reading the detailed explanation at http://kunststube.net/encoding/.

        +

        +14.6.2 Letter variations

        +

        Working in languages with accents poses a significant challenge when determining the position of letters (e.g., with str_length() and str_sub()) as accented letters might be encoded as a single individual character (e.g., ü) or as two characters by combining an unaccented letter (e.g., u) with a diacritic mark (e.g., ¨). For example, this code shows two ways of representing ü that look identical:

        +
        +
        u <- c("\u00fc", "u\u0308")
        +str_view(u)
        +#> [1] │ ü
        +#> [2] │ ü
        +
        +

        But both strings differ in length, and their first characters are different:

        +
        +
        str_length(u)
        +#> [1] 1 2
        +str_sub(u, 1, 1)
        +#> [1] "ü" "u"
        +
        +

        Finally, note that a comparison of these strings with == interprets these strings as different, while the handy str_equal() function in stringr recognizes that both have the same appearance:

        +
        +
        u[[1]] == u[[2]]
        +#> [1] FALSE
        +
        +str_equal(u[[1]], u[[2]])
        +#> [1] TRUE
        +
        +

        +14.6.3 Locale-dependent functions

        +

        Finally, there are a handful of stringr functions whose behavior depends on your locale. A locale is similar to a language but includes an optional region specifier to handle regional variations within a language. A locale is specified by a lower-case language abbreviation, optionally followed by a _ and an upper-case region identifier. For example, “en” is English, “en_GB” is British English, and “en_US” is American English. If you don’t already know the code for your language, Wikipedia has a good list, and you can see which are supported in stringr by looking at stringi::stri_locale_list().

        +

        Base R string functions automatically use the locale set by your operating system. This means that base R string functions do what you expect for your language, but your code might work differently if you share it with someone who lives in a different country. To avoid this problem, stringr defaults to English rules by using the “en” locale and requires you to specify the locale argument to override it. Fortunately, there are two sets of functions where the locale really matters: changing case and sorting.

        +

        The rules for changing cases differ among languages. For example, Turkish has two i’s: with and without a dot. Since they’re two distinct letters, they’re capitalized differently:

        +
        +
        str_to_upper(c("i", "ı"))
        +#> [1] "I" "I"
        +str_to_upper(c("i", "ı"), locale = "tr")
        +#> [1] "İ" "I"
        +
        +

        Sorting strings depends on the order of the alphabet, and the order of the alphabet is not the same in every language9! Here’s an example: in Czech, “ch” is a compound letter that appears after h in the alphabet.

        +
        +
        str_sort(c("a", "c", "ch", "h", "z"))
        +#> [1] "a"  "c"  "ch" "h"  "z"
        +str_sort(c("a", "c", "ch", "h", "z"), locale = "cs")
        +#> [1] "a"  "c"  "h"  "ch" "z"
        +
        +

        This also comes up when sorting strings with dplyr::arrange(), which is why it also has a locale argument.

        +

        +14.7 Summary

        +

        In this chapter, you’ve learned about some of the power of the stringr package: how to create, combine, and extract strings, and about some of the challenges you might face with non-English strings. Now it’s time to learn one of the most important and powerful tools for working with strings: regular expressions. Regular expressions are a very concise but very expressive language for describing patterns within strings and are the topic of the next chapter.

        + + +

        +
          +
        1. Or use the base R function writeLines().↩︎

        2. +
        3. Available in R 4.0.0 and above.↩︎

        4. +
        5. str_view() also uses color to bring tabs, spaces, matches, etc. to your attention. The colors don’t currently show up in the book, but you’ll notice them when running code interactively.↩︎

        6. +
        7. If you’re not using stringr, you can also access it directly with glue::glue().↩︎

        8. +
        9. The base R equivalent is paste() used with the collapse argument.↩︎

        10. +
        11. The same principles apply to separate_wider_position() and separate_wider_regex().↩︎

        12. +
        13. Looking at these entries, we’d guess that the babynames data drops spaces or hyphens and truncates after 15 letters.↩︎

        14. +
        15. Here I’m using the special \x to encode binary data directly into a string.↩︎

        16. +
        17. Sorting in languages that don’t have an alphabet, like Chinese, is more complicated still.↩︎

        18. +
        +
        +
        +
        + + + \ No newline at end of file diff --git a/transform.html b/transform.html index 2b49ec8ee..83ed6fb01 100644 --- a/transform.html +++ b/transform.html @@ -27,8 +27,8 @@ - - + + @@ -133,29 +133,226 @@ 2  Fluxo de Trabalho: básico + + + + + + + + + + + + + + + + + + + + + +
        @@ -189,14 +386,14 @@

        ?sec-logicals teaches you about logical vectors. These are the simplest types of vectors, but are extremely powerful. You’ll learn how to create them with numeric comparisons, how to combine them with Boolean algebra, how to use them in summaries, and how to use them for condition transformations.

        -
      8. ?sec-numbers dives into tools for vectors of numbers, the powerhouse of data science. You’ll learn more about counting and a bunch of important transformation and summary functions.

      9. -
      10. ?sec-strings will give you the tools to work with strings: you’ll slice them, you’ll dice them, and you’ll stick them back together again. This chapter mostly focuses on the stringr package, but you’ll also learn some more tidyr functions devoted to extracting data from character strings.

      11. -
      12. ?sec-regular-expressions introduces you to regular expressions, a powerful tool for manipulating strings. This chapter will take you from thinking that a cat walked over your keyboard to reading and writing complex string patterns.

      13. -
      14. ?sec-factors introduces factors: the data type that R uses to store categorical data. You use a factor when variable has a fixed set of possible values, or when you want to use a non-alphabetical ordering of a string.

      15. -
      16. ?sec-dates-and-times will give you the key tools for working with dates and date-times. Unfortunately, the more you learn about date-times, the more complicated they seem to get, but with the help of the lubridate package, you’ll learn to how to overcome the most common challenges.

      17. -
      18. ?sec-missing-values discusses missing values in depth. We’ve discussed them a couple of times in isolation, but now it’s time to discuss them holistically, helping you come to grips with the difference between implicit and explicit missing values, and how and why you might convert between them.

      19. -
      20. ?sec-joins finishes up this part of the book by giving you tools to join two (or more) data frames together. Learning about joins will force you to grapple with the idea of keys, and think about how you identify each row in a dataset.

      21. +
      22. Capítulo 12 teaches you about logical vectors. These are the simplest types of vectors, but are extremely powerful. You’ll learn how to create them with numeric comparisons, how to combine them with Boolean algebra, how to use them in summaries, and how to use them for condition transformations.

      23. +
      24. Capítulo 13 dives into tools for vectors of numbers, the powerhouse of data science. You’ll learn more about counting and a bunch of important transformation and summary functions.

      25. +
      26. Capítulo 14 will give you the tools to work with strings: you’ll slice them, you’ll dice them, and you’ll stick them back together again. This chapter mostly focuses on the stringr package, but you’ll also learn some more tidyr functions devoted to extracting data from character strings.

      27. +
      28. Capítulo 15 introduces you to regular expressions, a powerful tool for manipulating strings. This chapter will take you from thinking that a cat walked over your keyboard to reading and writing complex string patterns.

      29. +
      30. Capítulo 16 introduces factors: the data type that R uses to store categorical data. You use a factor when variable has a fixed set of possible values, or when you want to use a non-alphabetical ordering of a string.

      31. +
      32. Capítulo 17 will give you the key tools for working with dates and date-times. Unfortunately, the more you learn about date-times, the more complicated they seem to get, but with the help of the lubridate package, you’ll learn to how to overcome the most common challenges.

      33. +
      34. Capítulo 18 discusses missing values in depth. We’ve discussed them a couple of times in isolation, but now it’s time to discuss them holistically, helping you come to grips with the difference between implicit and explicit missing values, and how and why you might convert between them.

      35. +
      36. Capítulo 19 finishes up this part of the book by giving you tools to join two (or more) data frames together. Learning about joins will force you to grapple with the idea of keys, and think about how you identify each row in a dataset.

      37. @@ -434,13 +631,13 @@

        diff --git a/visualize.html b/visualize.html index 286d08b7d..153336eac 100644 --- a/visualize.html +++ b/visualize.html @@ -27,8 +27,8 @@ - - + + @@ -133,29 +133,226 @@ 2  Fluxo de Trabalho: básico + + + + + + + + + + + + + + + + + + + + + +
        @@ -189,9 +386,9 @@

        Vis

        Cada capítulo aborda um ou mais aspectos da criação de uma visualização de dados.

          -
        • No ?sec-layers, você irá conhecer a gramática dos gráficos.

        • -
        • No ?sec-exploratory-data-analysis, você irá combinar a visualização com a sua curiosidade e ceticismo para fazer e responder perguntas interessantes sobre os dados.

        • -
        • Por fim, no ?sec-communication, você irá aprender a usar seus gráficos exploratórios, melhorá-los e transformá-los em gráficos expositivos, gráficos que ajudam o recém-chegado à sua análise a entender o que está acontecendo da maneira mais rápida e fácil possível.

        • +
        • No Capítulo 9, você irá conhecer a gramática dos gráficos.

        • +
        • No Capítulo 10, você irá combinar a visualização com a sua curiosidade e ceticismo para fazer e responder perguntas interessantes sobre os dados.

        • +
        • Por fim, no Capítulo 11, você irá aprender a usar seus gráficos exploratórios, melhorá-los e transformá-los em gráficos expositivos, gráficos que ajudam o recém-chegado à sua análise a entender o que está acontecendo da maneira mais rápida e fácil possível.

        Estes três capítulos te permitem iniciar no mundo da visualização, mas há muito mais para aprender. O melhor lugar para aprender mais é o livro sobre o ggplot2: ggplot2: Elegant graphics for data analysis. Este livro aprofunda muito mais a teoria subjacente e tem muitos exemplos de como combinar as diversas funções do pacote para resolver problemas práticos. Outro grande recurso é a galeria de extensões do ggplot2 https://exts.ggplot2.tidyverse.org/gallery/. Este site lista diversos pacotes que expandem o ggplot2 com novas geometrias e escalas. É um ótimo lugar para começar se estiver tentando fazer algo que parece difícil com o ggplot2.

        @@ -430,13 +627,13 @@

        Vis } }); diff --git a/webscraping.html b/webscraping.html new file mode 100644 index 000000000..c56f2abc5 --- /dev/null +++ b/webscraping.html @@ -0,0 +1,1150 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 24  Web scraping + + + + + + + + + + + + + + + + + + + + + + + + +
        +
        + +
        + + + +
        +

        24  Web scraping

        +
        + + + +
        + + + + +
        + + +

        +24.1 Introduction

        +

        This chapter introduces you to the basics of web scraping with rvest. Web scraping is a very useful tool for extracting data from web pages. Some websites will offer an API, a set of structured HTTP requests that return data as JSON, which you handle using the techniques from Capítulo 23. Where possible, you should use the API1, because typically it will give you more reliable data. Unfortunately, however, programming with web APIs is out of scope for this book. Instead, we are teaching scraping, a technique that works whether or not a site provides an API.

        +

        In this chapter, we’ll first discuss the ethics and legalities of scraping before we dive into the basics of HTML. You’ll then learn the basics of CSS selectors to locate specific elements on the page, and how to use rvest functions to get data from text and attributes out of HTML and into R. We’ll then discuss some techniques to figure out what CSS selector you need for the page you’re scraping, before finishing up with a couple of case studies, and a brief discussion of dynamic websites.

        +

        +24.1.1 Prerequisites

        +

        In this chapter, we’ll focus on tools provided by rvest. rvest is a member of the tidyverse, but is not a core member so you’ll need to load it explicitly. We’ll also load the full tidyverse since we’ll find it generally useful working with the data we’ve scraped.

        + +

        +24.2 Scraping ethics and legalities

        +

        Before we get started discussing the code you’ll need to perform web scraping, we need to talk about whether it’s legal and ethical for you to do so. Overall, the situation is complicated with regards to both of these.

        +

        Legalities depend a lot on where you live. However, as a general principle, if the data is public, non-personal, and factual, you’re likely to be ok2. These three factors are important because they’re connected to the site’s terms and conditions, personally identifiable information, and copyright, as we’ll discuss below.

        +

        If the data isn’t public, non-personal, or factual or you’re scraping the data specifically to make money with it, you’ll need to talk to a lawyer. In any case, you should be respectful of the resources of the server hosting the pages you are scraping. Most importantly, this means that if you’re scraping many pages, you should make sure to wait a little between each request. One easy way to do so is to use the polite package by Dmytro Perepolkin. It will automatically pause between requests and cache the results so you never ask for the same page twice.

        +

        +24.2.1 Terms of service

        +

        If you look closely, you’ll find many websites include a “terms and conditions” or “terms of service” link somewhere on the page, and if you read that page closely you’ll often discover that the site specifically prohibits web scraping. These pages tend to be a legal land grab where companies make very broad claims. It’s polite to respect these terms of service where possible, but take any claims with a grain of salt.

        +

        US courts have generally found that simply putting the terms of service in the footer of the website isn’t sufficient for you to be bound by them, e.g., HiQ Labs v. LinkedIn. Generally, to be bound to the terms of service, you must have taken some explicit action like creating an account or checking a box. This is why whether or not the data is public is important; if you don’t need an account to access them, it is unlikely that you are bound to the terms of service. Note, however, the situation is rather different in Europe where courts have found that terms of service are enforceable even if you don’t explicitly agree to them.

        +

        +24.2.2 Personally identifiable information

        +

        Even if the data is public, you should be extremely careful about scraping personally identifiable information like names, email addresses, phone numbers, dates of birth, etc. Europe has particularly strict laws about the collection or storage of such data (GDPR), and regardless of where you live you’re likely to be entering an ethical quagmire. For example, in 2016, a group of researchers scraped public profile information (e.g., usernames, age, gender, location, etc.) about 70,000 people on the dating site OkCupid and they publicly released these data without any attempts for anonymization. While the researchers felt that there was nothing wrong with this since the data were already public, this work was widely condemned due to ethics concerns around identifiability of users whose information was released in the dataset. If your work involves scraping personally identifiable information, we strongly recommend reading about the OkCupid study3 as well as similar studies with questionable research ethics involving the acquisition and release of personally identifiable information.

        +

        +24.3 HTML basics

        +

        To scrape webpages, you need to first understand a little bit about HTML, the language that describes web pages. HTML stands for HyperText Markup Language and looks something like this:

        +
        <html>
        +<head>
        +  <title>Page title</title>
        +</head>
        +<body>
        +  <h1 id='first'>A heading</h1>
        +  <p>Some text &amp; <b>some bold text.</b></p>
        +  <img src='myimg.png' width='100' height='100'>
        +</body>
        +

        HTML has a hierarchical structure formed by elements which consist of a start tag (e.g., <tag>), optional attributes (id='first'), an end tag4 (like </tag>), and contents (everything in between the start and end tag).

        +

        Since < and > are used for start and end tags, you can’t write them directly. Instead you have to use the HTML escapes &gt; (greater than) and &lt; (less than). And since those escapes use &, if you want a literal ampersand you have to escape it as &amp;. There are a wide range of possible HTML escapes but you don’t need to worry about them too much because rvest automatically handles them for you.

        +

        Web scraping is possible because most pages that contain data that you want to scrape generally have a consistent structure.

        +

        +24.3.1 Elements

        +

        There are over 100 HTML elements. Some of the most important are:

        +
          +
        • Every HTML page must be in an <html> element, and it must have two children: <head>, which contains document metadata like the page title, and <body>, which contains the content you see in the browser.

        • +
        • Block tags like <h1> (heading 1), <section> (section), <p> (paragraph), and <ol> (ordered list) form the overall structure of the page.

        • +
        • Inline tags like <b> (bold), <i> (italics), and <a> (link) format text inside block tags.

        • +
        +

        If you encounter a tag that you’ve never seen before, you can find out what it does with a little googling. Another good place to start are the MDN Web Docs which describe just about every aspect of web programming.

        +

        Most elements can have content in between their start and end tags. This content can either be text or more elements. For example, the following HTML contains paragraph of text, with one word in bold.

        +
        <p>
        +  Hi! My <b>name</b> is Hadley.
        +</p>
        +

        The children are the elements it contains, so the <p> element above has one child, the <b> element. The <b> element has no children, but it does have contents (the text “name”).

        +

        +24.3.2 Attributes

        +

        Tags can have named attributes which look like name1='value1' name2='value2'. Two of the most important attributes are id and class, which are used in conjunction with CSS (Cascading Style Sheets) to control the visual appearance of the page. These are often useful when scraping data off a page. Attributes are also used to record the destination of links (the href attribute of <a> elements) and the source of images (the src attribute of the <img> element).

        +

        +24.4 Extracting data

        +

        To get started scraping, you’ll need the URL of the page you want to scrape, which you can usually copy from your web browser. You’ll then need to read the HTML for that page into R with read_html(). This returns an xml_document5 object which you’ll then manipulate using rvest functions:

        +
        +
        html <- read_html("http://rvest.tidyverse.org/")
        +html
        +#> {html_document}
        +#> <html lang="en">
        +#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UT ...
        +#> [2] <body>\n    <a href="#container" class="visually-hidden-focusable">Ski ...
        +
        +

        rvest also includes a function that lets you write HTML inline. We’ll use this a bunch in this chapter as we teach how the various rvest functions work with simple examples.

        +
        +
        html <- minimal_html("
        +  <p>This is a paragraph</p>
        +  <ul>
        +    <li>This is a bulleted list</li>
        +  </ul>
        +")
        +html
        +#> {html_document}
        +#> <html>
        +#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UT ...
        +#> [2] <body>\n<p>This is a paragraph</p>\n  <ul>\n<li>This is a bulleted lis ...
        +
        +

        Now that you have the HTML in R, it’s time to extract the data of interest. You’ll first learn about the CSS selectors that allow you to identify the elements of interest and the rvest functions that you can use to extract data from them. Then we’ll briefly cover HTML tables, which have some special tools.

        +

        +24.4.1 Find elements

        +

        CSS is short for cascading style sheets, and is a tool for defining the visual styling of HTML documents. CSS includes a miniature language for selecting elements on a page called CSS selectors. CSS selectors define patterns for locating HTML elements, and are useful for scraping because they provide a concise way of describing which elements you want to extract.

        +

        We’ll come back to CSS selectors in more detail in Seção 24.5, but luckily you can get a long way with just three:

        +
          +
        • p selects all <p> elements.

        • +
        • .title selects all elements with class “title”.

        • +
        • #title selects the element with the id attribute that equals “title”. Id attributes must be unique within a document, so this will only ever select a single element.

        • +
        +

        Let’s try out these selectors with a simple example:

        +
        +
        html <- minimal_html("
        +  <h1>This is a heading</h1>
        +  <p id='first'>This is a paragraph</p>
        +  <p class='important'>This is an important paragraph</p>
        +")
        +
        +

        Use html_elements() to find all elements that match the selector:

        +
        +
        html |> html_elements("p")
        +#> {xml_nodeset (2)}
        +#> [1] <p id="first">This is a paragraph</p>
        +#> [2] <p class="important">This is an important paragraph</p>
        +html |> html_elements(".important")
        +#> {xml_nodeset (1)}
        +#> [1] <p class="important">This is an important paragraph</p>
        +html |> html_elements("#first")
        +#> {xml_nodeset (1)}
        +#> [1] <p id="first">This is a paragraph</p>
        +
        +

        Another important function is html_element() which always returns the same number of outputs as inputs. If you apply it to a whole document it’ll give you the first match:

        +
        +
        html |> html_element("p")
        +#> {html_node}
        +#> <p id="first">
        +
        +

        There’s an important difference between html_element() and html_elements() when you use a selector that doesn’t match any elements. html_elements() returns a vector of length 0, where html_element() returns a missing value. This will be important shortly.

        +
        +
        html |> html_elements("b")
        +#> {xml_nodeset (0)}
        +html |> html_element("b")
        +#> {xml_missing}
        +#> <NA>
        +
        +

        +24.4.2 Nesting selections

        +

        In most cases, you’ll use html_elements() and html_element() together, typically using html_elements() to identify elements that will become observations then using html_element() to find elements that will become variables. Let’s see this in action using a simple example. Here we have an unordered list (<ul>) where each list item (<li>) contains some information about four characters from StarWars:

        +
        +
        html <- minimal_html("
        +  <ul>
        +    <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li>
        +    <li><b>R4-P17</b> is a <i>droid</i></li>
        +    <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li>
        +    <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li>
        +  </ul>
        +  ")
        +
        +

        We can use html_elements() to make a vector where each element corresponds to a different character:

        +
        +
        characters <- html |> html_elements("li")
        +characters
        +#> {xml_nodeset (4)}
        +#> [1] <li>\n<b>C-3PO</b> is a <i>droid</i> that weighs <span class="weight"> ...
        +#> [2] <li>\n<b>R4-P17</b> is a <i>droid</i>\n</li>
        +#> [3] <li>\n<b>R2-D2</b> is a <i>droid</i> that weighs <span class="weight"> ...
        +#> [4] <li>\n<b>Yoda</b> weighs <span class="weight">66 kg</span>\n</li>
        +
        +

        To extract the name of each character, we use html_element(), because when applied to the output of html_elements() it’s guaranteed to return one response per element:

        +
        +
        characters |> html_element("b")
        +#> {xml_nodeset (4)}
        +#> [1] <b>C-3PO</b>
        +#> [2] <b>R4-P17</b>
        +#> [3] <b>R2-D2</b>
        +#> [4] <b>Yoda</b>
        +
        +

        The distinction between html_element() and html_elements() isn’t important for name, but it is important for weight. We want to get one weight for each character, even if there’s no weight <span>. That’s what html_element() does:

        +
        +
        characters |> html_element(".weight")
        +#> {xml_nodeset (4)}
        +#> [1] <span class="weight">167 kg</span>
        +#> [2] <NA>
        +#> [3] <span class="weight">96 kg</span>
        +#> [4] <span class="weight">66 kg</span>
        +
        +

        html_elements() finds all weight <span>s that are children of characters. There’s only three of these, so we lose the connection between names and weights:

        +
        +
        characters |> html_elements(".weight")
        +#> {xml_nodeset (3)}
        +#> [1] <span class="weight">167 kg</span>
        +#> [2] <span class="weight">96 kg</span>
        +#> [3] <span class="weight">66 kg</span>
        +
        +

        Now that you’ve selected the elements of interest, you’ll need to extract the data, either from the text contents or some attributes.

        +

        +24.4.3 Text and attributes

        +

        html_text2()6 extracts the plain text contents of an HTML element:

        +
        +
        characters |> 
        +  html_element("b") |> 
        +  html_text2()
        +#> [1] "C-3PO"  "R4-P17" "R2-D2"  "Yoda"
        +
        +characters |> 
        +  html_element(".weight") |> 
        +  html_text2()
        +#> [1] "167 kg" NA       "96 kg"  "66 kg"
        +
        +

        Note that any escapes will be automatically handled; you’ll only ever see HTML escapes in the source HTML, not in the data returned by rvest.

        +

        html_attr() extracts data from attributes:

        +
        +
        html <- minimal_html("
        +  <p><a href='https://en.wikipedia.org/wiki/Cat'>cats</a></p>
        +  <p><a href='https://en.wikipedia.org/wiki/Dog'>dogs</a></p>
        +")
        +
        +html |> 
        +  html_elements("p") |> 
        +  html_element("a") |> 
        +  html_attr("href")
        +#> [1] "https://en.wikipedia.org/wiki/Cat" "https://en.wikipedia.org/wiki/Dog"
        +
        +

        html_attr() always returns a string, so if you’re extracting numbers or dates, you’ll need to do some post-processing.

        +

        +24.4.4 Tables

        +

        If you’re lucky, your data will be already stored in an HTML table, and it’ll be a matter of just reading it from that table. It’s usually straightforward to recognize a table in your browser: it’ll have a rectangular structure of rows and columns, and you can copy and paste it into a tool like Excel.

        +

        HTML tables are built up from four main elements: <table>, <tr> (table row), <th> (table heading), and <td> (table data). Here’s a simple HTML table with two columns and three rows:

        +
        +
        html <- minimal_html("
        +  <table class='mytable'>
        +    <tr><th>x</th>   <th>y</th></tr>
        +    <tr><td>1.5</td> <td>2.7</td></tr>
        +    <tr><td>4.9</td> <td>1.3</td></tr>
        +    <tr><td>7.2</td> <td>8.1</td></tr>
        +  </table>
        +  ")
        +
        +

        rvest provides a function that knows how to read this sort of data: html_table(). It returns a list containing one tibble for each table found on the page. Use html_element() to identify the table you want to extract:

        +
        +
        html |> 
        +  html_element(".mytable") |> 
        +  html_table()
        +#> # A tibble: 3 × 2
        +#>       x     y
        +#>   <dbl> <dbl>
        +#> 1   1.5   2.7
        +#> 2   4.9   1.3
        +#> 3   7.2   8.1
        +
        +

        Note that x and y have automatically been converted to numbers. This automatic conversion doesn’t always work, so in more complex scenarios you may want to turn it off with convert = FALSE and then do your own conversion.

        +

        +24.5 Finding the right selectors

        +

        Figuring out the selector you need for your data is typically the hardest part of the problem. You’ll often need to do some experimenting to find a selector that is both specific (i.e. it doesn’t select things you don’t care about) and sensitive (i.e. it does select everything you care about). Lots of trial and error is a normal part of the process! There are two main tools that are available to help you with this process: SelectorGadget and your browser’s developer tools.

        +

        SelectorGadget is a javascript bookmarklet that automatically generates CSS selectors based on the positive and negative examples that you provide. It doesn’t always work, but when it does, it’s magic! You can learn how to install and use SelectorGadget either by reading https://rvest.tidyverse.org/articles/selectorgadget.html or watching Mine’s video at https://www.youtube.com/watch?v=PetWV5g1Xsc.

        +

        Every modern browser comes with some toolkit for developers, but we recommend Chrome, even if it isn’t your regular browser: its web developer tools are some of the best and they’re immediately available. Right click on an element on the page and click Inspect. This will open an expandable view of the complete HTML page, centered on the element that you just clicked. You can use this to explore the page and get a sense of what selectors might work. Pay particular attention to the class and id attributes, since these are often used to form the visual structure of the page, and hence make for good tools to extract the data that you’re looking for.

        +

        Inside the Elements view, you can also right click on an element and choose Copy as Selector to generate a selector that will uniquely identify the element of interest.

        +

        If either SelectorGadget or Chrome DevTools have generated a CSS selector that you don’t understand, try Selectors Explained which translates CSS selectors into plain English. If you find yourself doing this a lot, you might want to learn more about CSS selectors generally. We recommend starting with the fun CSS dinner tutorial and then referring to the MDN web docs.

        +

        +24.6 Putting it all together

        +

        Let’s put this all together to scrape some websites. There’s some risk that these examples may no longer work when you run them — that’s the fundamental challenge of web scraping; if the structure of the site changes, then you’ll have to change your scraping code.

        +

        +24.6.1 StarWars

        +

        rvest includes a very simple example in vignette("starwars"). This is a simple page with minimal HTML so it’s a good place to start. I’d encourage you to navigate to that page now and use “Inspect Element” to inspect one of the headings that’s the title of a Star Wars movie. Use the keyboard or mouse to explore the hierarchy of the HTML and see if you can get a sense of the shared structure used by each movie.

        +

        You should be able to see that each movie has a shared structure that looks like this:

        +
        <section>
        +  <h2 data-id="1">The Phantom Menace</h2>
        +  <p>Released: 1999-05-19</p>
        +  <p>Director: <span class="director">George Lucas</span></p>
        +  
        +  <div class="crawl">
        +    <p>...</p>
        +    <p>...</p>
        +    <p>...</p>
        +  </div>
        +</section>
        +

        Our goal is to turn this data into a 7 row data frame with variables title, year, director, and intro. We’ll start by reading the HTML and extracting all the <section> elements:

        +
        +
        url <- "https://rvest.tidyverse.org/articles/starwars.html"
        +html <- read_html(url)
        +
        +section <- html |> html_elements("section")
        +section
        +#> {xml_nodeset (7)}
        +#> [1] <section><h2 data-id="1">\nThe Phantom Menace\n</h2>\n<p>\nReleased: 1 ...
        +#> [2] <section><h2 data-id="2">\nAttack of the Clones\n</h2>\n<p>\nReleased: ...
        +#> [3] <section><h2 data-id="3">\nRevenge of the Sith\n</h2>\n<p>\nReleased:  ...
        +#> [4] <section><h2 data-id="4">\nA New Hope\n</h2>\n<p>\nReleased: 1977-05-2 ...
        +#> [5] <section><h2 data-id="5">\nThe Empire Strikes Back\n</h2>\n<p>\nReleas ...
        +#> [6] <section><h2 data-id="6">\nReturn of the Jedi\n</h2>\n<p>\nReleased: 1 ...
        +#> [7] <section><h2 data-id="7">\nThe Force Awakens\n</h2>\n<p>\nReleased: 20 ...
        +
        +

        This retrieves seven elements matching the seven movies found on that page, suggesting that using section as a selector is good. Extracting the individual elements is straightforward since the data is always found in the text. It’s just a matter of finding the right selector:

        +
        +
        section |> html_element("h2") |> html_text2()
        +#> [1] "The Phantom Menace"      "Attack of the Clones"   
        +#> [3] "Revenge of the Sith"     "A New Hope"             
        +#> [5] "The Empire Strikes Back" "Return of the Jedi"     
        +#> [7] "The Force Awakens"
        +
        +section |> html_element(".director") |> html_text2()
        +#> [1] "George Lucas"     "George Lucas"     "George Lucas"    
        +#> [4] "George Lucas"     "Irvin Kershner"   "Richard Marquand"
        +#> [7] "J. J. Abrams"
        +
        +

        Once we’ve done that for each component, we can wrap all the results up into a tibble:

        +
        +
        tibble(
        +  title = section |> 
        +    html_element("h2") |> 
        +    html_text2(),
        +  released = section |> 
        +    html_element("p") |> 
        +    html_text2() |> 
        +    str_remove("Released: ") |> 
        +    parse_date(),
        +  director = section |> 
        +    html_element(".director") |> 
        +    html_text2(),
        +  intro = section |> 
        +    html_element(".crawl") |> 
        +    html_text2()
        +)
        +#> # A tibble: 7 × 4
        +#>   title                   released   director         intro                  
        +#>   <chr>                   <date>     <chr>            <chr>                  
        +#> 1 The Phantom Menace      1999-05-19 George Lucas     "Turmoil has engulfed …
        +#> 2 Attack of the Clones    2002-05-16 George Lucas     "There is unrest in th…
        +#> 3 Revenge of the Sith     2005-05-19 George Lucas     "War! The Republic is …
        +#> 4 A New Hope              1977-05-25 George Lucas     "It is a period of civ…
        +#> 5 The Empire Strikes Back 1980-05-17 Irvin Kershner   "It is a dark time for…
        +#> 6 Return of the Jedi      1983-05-25 Richard Marquand "Luke Skywalker has re…
        +#> # ℹ 1 more row
        +
        +

        We did a little more processing of released to get a variable that will be easy to use later in our analysis.

        +

        +24.6.2 IMDB top films

        +

        For our next task we’ll tackle something a little trickier, extracting the top 250 movies from the internet movie database (IMDb). At the time we wrote this chapter, the page looked like Figura 24.1.

        +
        +
        +
        +

        The screenshot shows a table with columns "Rank and Title", "IMDb Rating", and "Your Rating". 9 movies out of the top 250 are shown. The top 5 are the Shawshank Redemption, The Godfather, The Dark Knight, The Godfather: Part II, and 12 Angry Men.

        +
        Figura 24.1: Screenshot of the IMDb top movies web page taken on 2022-12-05.
        +
        +
        +
        +

        This data has a clear tabular structure so it’s worth starting with html_table():

        +
        +
        url <- "https://web.archive.org/web/20220201012049/https://www.imdb.com/chart/top/"
        +html <- read_html(url)
        +
        +table <- html |> 
        +  html_element("table") |> 
        +  html_table()
        +table
        +#> # A tibble: 250 × 5
        +#>   ``    `Rank & Title`                    `IMDb Rating` `Your Rating`   ``   
        +#>   <lgl> <chr>                                     <dbl> <chr>           <lgl>
        +#> 1 NA    "1.\n      The Shawshank Redempt…           9.2 "12345678910\n… NA   
        +#> 2 NA    "2.\n      The Godfather\n      …           9.1 "12345678910\n… NA   
        +#> 3 NA    "3.\n      The Godfather: Part I…           9   "12345678910\n… NA   
        +#> 4 NA    "4.\n      The Dark Knight\n    …           9   "12345678910\n… NA   
        +#> 5 NA    "5.\n      12 Angry Men\n       …           8.9 "12345678910\n… NA   
        +#> 6 NA    "6.\n      Schindler's List\n   …           8.9 "12345678910\n… NA   
        +#> # ℹ 244 more rows
        +
        +

        This includes a few empty columns, but overall does a good job of capturing the information from the table. However, we need to do some more processing to make it easier to use. First, we’ll rename the columns to be easier to work with, and remove the extraneous whitespace in rank and title. We will do this with select() (instead of rename()) to do the renaming and selecting of just these two columns in one step. Then we’ll remove the new lines and extra spaces, and then apply separate_wider_regex() (from Seção 15.3.4) to pull out the title, year, and rank into their own variables.

        +
        +
        ratings <- table |>
        +  select(
        +    rank_title_year = `Rank & Title`,
        +    rating = `IMDb Rating`
        +  ) |> 
        +  mutate(
        +    rank_title_year = str_replace_all(rank_title_year, "\n +", " ")
        +  ) |> 
        +  separate_wider_regex(
        +    rank_title_year,
        +    patterns = c(
        +      rank = "\\d+", "\\. ",
        +      title = ".+", " +\\(",
        +      year = "\\d+", "\\)"
        +    )
        +  )
        +ratings
        +#> # A tibble: 250 × 4
        +#>   rank  title                    year  rating
        +#>   <chr> <chr>                    <chr>  <dbl>
        +#> 1 1     The Shawshank Redemption 1994     9.2
        +#> 2 2     The Godfather            1972     9.1
        +#> 3 3     The Godfather: Part II   1974     9  
        +#> 4 4     The Dark Knight          2008     9  
        +#> 5 5     12 Angry Men             1957     8.9
        +#> 6 6     Schindler's List         1993     8.9
        +#> # ℹ 244 more rows
        +
        +

        Even in this case where most of the data comes from table cells, it’s still worth looking at the raw HTML. If you do so, you’ll discover that we can add a little extra data by using one of the attributes. This is one of the reasons it’s worth spending a little time spelunking the source of the page; you might find extra data, or might find a parsing route that’s slightly easier.

        +
        +
        html |> 
        +  html_elements("td strong") |> 
        +  head() |> 
        +  html_attr("title")
        +#> [1] "9.2 based on 2,536,415 user ratings"
        +#> [2] "9.1 based on 1,745,675 user ratings"
        +#> [3] "9.0 based on 1,211,032 user ratings"
        +#> [4] "9.0 based on 2,486,931 user ratings"
        +#> [5] "8.9 based on 749,563 user ratings"  
        +#> [6] "8.9 based on 1,295,705 user ratings"
        +
        +

        We can combine this with the tabular data and again apply separate_wider_regex() to extract out the bit of data we care about:

        +
        +
        ratings |>
        +  mutate(
        +    rating_n = html |> html_elements("td strong") |> html_attr("title")
        +  ) |> 
        +  separate_wider_regex(
        +    rating_n,
        +    patterns = c(
        +      "[0-9.]+ based on ",
        +      number = "[0-9,]+",
        +      " user ratings"
        +    )
        +  ) |> 
        +  mutate(
        +    number = parse_number(number)
        +  )
        +#> # A tibble: 250 × 5
        +#>   rank  title                    year  rating  number
        +#>   <chr> <chr>                    <chr>  <dbl>   <dbl>
        +#> 1 1     The Shawshank Redemption 1994     9.2 2536415
        +#> 2 2     The Godfather            1972     9.1 1745675
        +#> 3 3     The Godfather: Part II   1974     9   1211032
        +#> 4 4     The Dark Knight          2008     9   2486931
        +#> 5 5     12 Angry Men             1957     8.9  749563
        +#> 6 6     Schindler's List         1993     8.9 1295705
        +#> # ℹ 244 more rows
        +
        +

        +24.7 Dynamic sites

        +

        So far we have focused on websites where html_elements() returns what you see in the browser and discussed how to parse what it returns and how to organize that information in tidy data frames. From time-to-time, however, you’ll hit a site where html_elements() and friends don’t return anything like what you see in the browser. In many cases, that’s because you’re trying to scrape a website that dynamically generates the content of the page with javascript. This doesn’t currently work with rvest, because rvest downloads the raw HTML and doesn’t run any javascript.

        +

        It’s still possible to scrape these types of sites, but rvest needs to use a more expensive process: fully simulating the web browser including running all javascript. This functionality is not available at the time of writing, but it’s something we’re actively working on and might be available by the time you read this. It uses the chromote package which actually runs the Chrome browser in the background, and gives you additional tools to interact with the site, like a human typing text and clicking buttons. Check out the rvest website for more details.

        +

        +24.8 Summary

        +

        In this chapter, you’ve learned about the why, the why not, and the how of scraping data from web pages. First, you’ve learned about the basics of HTML and using CSS selectors to refer to specific elements, then you’ve learned about using the rvest package to get data out of HTML into R. We then demonstrated web scraping with two case studies: a simpler scenario on scraping data on StarWars films from the rvest package website and a more complex scenario on scraping the top 250 films from IMDB.

        +

        Technical details of scraping data off the web can be complex, particularly when dealing with sites, however legal and ethical considerations can be even more complex. It’s important for you to educate yourself about both of these before setting out to scrape data.

        +

        This brings us to the end of the import part of the book where you’ve learned techniques to get data from where it lives (spreadsheets, databases, JSON files, and web sites) into a tidy form in R. Now it’s time to turn our sights to a new topic: making the most of R as a programming language.

        + + +

        +
          +
        1. And many popular APIs already have CRAN packages that wrap them, so start with a little research first!↩︎

        2. +
        3. Obviously we’re not lawyers, and this is not legal advice. But this is the best summary we can give having read a bunch about this topic.↩︎

        4. +
        5. One example of an article on the OkCupid study was published by Wired, https://www.wired.com/2016/05/okcupid-study-reveals-perils-big-data-science.↩︎

        6. +
        7. A number of tags (including <p> and <li>) don’t require end tags, but we think it’s best to include them because it makes seeing the structure of the HTML a little easier.↩︎

        8. +
        9. This class comes from the xml2 package. xml2 is a low-level package that rvest builds on top of.↩︎

        10. +
        11. rvest also provides html_text() but you should almost always use html_text2() since it does a better job of converting nested HTML to text.↩︎

        12. +
        +
        +
        +
        + + + \ No newline at end of file diff --git a/whole-game.html b/whole-game.html index 6df2ee373..e0ca85a0d 100644 --- a/whole-game.html +++ b/whole-game.html @@ -133,29 +133,226 @@ 2  Fluxo de Trabalho: básico + + + + + + + + + + + + + + + + + + + + + +
        @@ -190,11 +387,11 @@

        Capítulo 1, você mergulhará na visualização, aprendendo a estrutura básica de um gráfico ggplot2 e técnicas poderosas para transformar dados em gráficos.

        -
      38. Geralmente, apenas a visualização não é suficiente. Portanto, no ?sec-data-transform, você aprenderá os principais verbos que permitem selecionar variáveis importantes, filtrar observações essenciais, criar novas variáveis e fazer sumarizações.

      39. -
      40. No ?sec-data-tidy, você aprenderá sobre dados organizados (tidy data), uma maneira consistente de armazenar seus dados que facilita a transformação, visualização e modelagem. Você aprenderá os princípios de tidy data e como deixar seus dados neste formato.

      41. -
      42. Antes de poder transformar e visualizar seus dados, você precisa primeiro importá-los para o R. No ?sec-data-import, você aprenderá o básico de como importar arquivos .csv para o R.

      43. +
      44. Geralmente, apenas a visualização não é suficiente. Portanto, no Capítulo 3, você aprenderá os principais verbos que permitem selecionar variáveis importantes, filtrar observações essenciais, criar novas variáveis e fazer sumarizações.

      45. +
      46. No Capítulo 5, você aprenderá sobre dados organizados (tidy data), uma maneira consistente de armazenar seus dados que facilita a transformação, visualização e modelagem. Você aprenderá os princípios de tidy data e como deixar seus dados neste formato.

      47. +
      48. Antes de poder transformar e visualizar seus dados, você precisa primeiro importá-los para o R. No Capítulo 7, você aprenderá o básico de como importar arquivos .csv para o R.

      49. -

        Entre esses capítulos, há outros quatro capítulos que se concentram no fluxo de trabalho no R. Em Capítulo 2, ?sec-workflow-style e ?sec-workflow-scripts-projects, você aprenderá boas práticas de fluxo de trabalho para escrever e organizar seu código R. Isso te preparará para o sucesso a longo prazo, pois fornecerá as ferramentas necessárias para manter a organização ao enfrentar projetos reais. Por fim, no ?sec-workflow-getting-help, você aprenderá como obter ajuda e continuar aprendendo.

        +

        Entre esses capítulos, há outros quatro capítulos que se concentram no fluxo de trabalho no R. Em Capítulo 2, Capítulo 4 e Capítulo 6, você aprenderá boas práticas de fluxo de trabalho para escrever e organizar seu código R. Isso te preparará para o sucesso a longo prazo, pois fornecerá as ferramentas necessárias para manter a organização ao enfrentar projetos reais. Por fim, no Capítulo 8, você aprenderá como obter ajuda e continuar aprendendo.

        diff --git a/workflow-basics.html b/workflow-basics.html index cc38bb42e..51bf0c665 100644 --- a/workflow-basics.html +++ b/workflow-basics.html @@ -61,7 +61,7 @@ - + @@ -167,29 +167,226 @@ 2  Fluxo de Trabalho: básico + + + + + + + + + + + + + + + + + + + + + +
        @@ -281,7 +478,7 @@

        algumas.pessoas.usam.pontos E_aLgumas.Pessoas_nAoUsamConvencao -

        Vamos voltar aos nomes quando discutirmos o estilo de código no ?sec-workflow-style.

        +

        Vamos voltar aos nomes quando discutirmos o estilo de código no Capítulo 4.

        Você pode ver o conteúdo de um objeto (chamaremos isso de inspecionar) digitando seu nome:

        diff --git a/workflow-help.html b/workflow-help.html new file mode 100644 index 000000000..571e2fa6b --- /dev/null +++ b/workflow-help.html @@ -0,0 +1,743 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 8  Workflow: getting help + + + + + + + + + + + + + + + + + + + + + + + + +
        +
        + +
        + + + +
        +

        8  Workflow: getting help

        +
        + + + +
        + + + + +
        + + +

        This book is not an island; there is no single resource that will allow you to master R. As you begin to apply the techniques described in this book to your own data, you will soon find questions that we do not answer. This section describes a few tips on how to get help and to help you keep learning.

        +

        +8.1 Google is your friend

        +

        If you get stuck, start with Google. Typically adding “R” to a query is enough to restrict it to relevant results: if the search isn’t useful, it often means that there aren’t any R-specific results available. Additionally, adding package names like “tidyverse” or “ggplot2” will help narrow down the results to code that will feel more familiar to you as well, e.g., “how to make a boxplot in R” vs. “how to make a boxplot in R with ggplot2”. Google is particularly useful for error messages. If you get an error message and you have no idea what it means, try googling it! Chances are that someone else has been confused by it in the past, and there will be help somewhere on the web. (If the error message isn’t in English, run Sys.setenv(LANGUAGE = "en") and re-run the code; you’re more likely to find help for English error messages.)

        +

        If Google doesn’t help, try Stack Overflow. Start by spending a little time searching for an existing answer, including [R], to restrict your search to questions and answers that use R.

        +

        +8.2 Making a reprex

        +

        If your googling doesn’t find anything useful, it’s a really good idea to prepare a reprex, short for minimal reproducible example. A good reprex makes it easier for other people to help you, and often you’ll figure out the problem yourself in the course of making it. There are two parts to creating a reprex:

        +
          +
        • First, you need to make your code reproducible. This means that you need to capture everything, i.e. include any library() calls and create all necessary objects. The easiest way to make sure you’ve done this is using the reprex package.

        • +
        • Second, you need to make it minimal. Strip away everything that is not directly related to your problem. This usually involves creating a much smaller and simpler R object than the one you’re facing in real life or even using built-in data.

        • +
        +

        That sounds like a lot of work! And it can be, but it has a great payoff:

        +
          +
        • 80% of the time, creating an excellent reprex reveals the source of your problem. It’s amazing how often the process of writing up a self-contained and minimal example allows you to answer your own question.

        • +
        • The other 20% of the time, you will have captured the essence of your problem in a way that is easy for others to play with. This substantially improves your chances of getting help!

        • +
        +

        When creating a reprex by hand, it’s easy to accidentally miss something, meaning your code can’t be run on someone else’s computer. Avoid this problem by using the reprex package, which is installed as part of the tidyverse. Let’s say you copy this code onto your clipboard (or, on RStudio Server or Cloud, select it):

        +
        +
        y <- 1:4
        +mean(y)
        +
        +

        Then call reprex(), where the default output is formatted for GitHub:

        +
        reprex::reprex()
        +

        A nicely rendered HTML preview will display in RStudio’s Viewer (if you’re in RStudio) or your default browser otherwise. The reprex is automatically copied to your clipboard (on RStudio Server or Cloud, you will need to copy this yourself):

        +
        ``` r
        +y <- 1:4
        +mean(y)
        +#> [1] 2.5
        +```
        +

        This text is formatted in a special way, called Markdown, which can be pasted to sites like StackOverflow or Github and they will automatically render it to look like code. Here’s what that Markdown would look like rendered on GitHub:

        +
        +
        y <- 1:4
        +mean(y)
        +#> [1] 2.5
        +
        +

        Anyone else can copy, paste, and run this immediately.

        +

        There are three things you need to include to make your example reproducible: required packages, data, and code.

        +
          +
        1. Packages should be loaded at the top of the script so it’s easy to see which ones the example needs. This is a good time to check that you’re using the latest version of each package; you may have discovered a bug that’s been fixed since you installed or last updated the package. For packages in the tidyverse, the easiest way to check is to run tidyverse_update().

        2. +
        3. +

          The easiest way to include data is to use dput() to generate the R code needed to recreate it. For example, to recreate the mtcars dataset in R, perform the following steps:

          +
            +
          1. Run dput(mtcars) in R
          2. +
          3. Copy the output
          4. +
          5. In reprex, type mtcars <-, then paste.
          6. +
          +

          Try to use the smallest subset of your data that still reveals the problem.

          +
        4. +
        5. +

          Spend a little bit of time ensuring that your code is easy for others to read:

          +
            +
          • Make sure you’ve used spaces and your variable names are concise yet informative.

          • +
          • Use comments to indicate where your problem lies.

          • +
          • Do your best to remove everything that is not related to the problem.

          • +
          +

          The shorter your code is, the easier it is to understand and the easier it is to fix.

          +
        6. +
        +

        Finish by checking that you have actually made a reproducible example by starting a fresh R session and copying and pasting your script.

        +

        Creating reprexes is not trivial, and it will take some practice to learn to create good, truly minimal reprexes. However, learning to ask questions that include the code, and investing the time to make it reproducible will continue to pay off as you learn and master R.

        +

        +8.3 Investing in yourself

        +

        You should also spend some time preparing yourself to solve problems before they occur. Investing a little time in learning R each day will pay off handsomely in the long run. One way is to follow what the tidyverse team is doing on the tidyverse blog. To keep up with the R community more broadly, we recommend reading R Weekly: it’s a community effort to aggregate the most interesting news in the R community each week.

        +

        +8.4 Summary

        +

        This chapter concludes the Whole Game part of the book. You’ve now seen the most important parts of the data science process: visualization, transformation, tidying and importing. Now you’ve got a holistic view of the whole process, and we start to get into the details of small pieces.

        +

        The next part of the book, Visualize, does a deeper dive into the grammar of graphics and creating data visualizations with ggplot2, showcases how to use the tools you’ve learned so far to conduct exploratory data analysis, and introduces good practices for creating plots for communication.

        + + +
        +
        +
        +
        + + + \ No newline at end of file diff --git a/workflow-scripts.html b/workflow-scripts.html new file mode 100644 index 000000000..537191391 --- /dev/null +++ b/workflow-scripts.html @@ -0,0 +1,885 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 6  Workflow: scripts and projects + + + + + + + + + + + + + + + + + + + + + + + + +
        +
        + +
        + + + +
        +

        6  Workflow: scripts and projects

        +
        + + + +
        + + + + +
        + + +

        This chapter will introduce you to two essential tools for organizing your code: scripts and projects.

        +

        +6.1 Scripts

        +

        So far, you have used the console to run code. That’s a great place to start, but you’ll find it gets cramped pretty quickly as you create more complex ggplot2 graphics and longer dplyr pipelines. To give yourself more room to work, use the script editor. Open it up by clicking the File menu, selecting New File, then R script, or using the keyboard shortcut Cmd/Ctrl + Shift + N. Now you’ll see four panes, as in Figura 6.1. The script editor is a great place to experiment with your code. When you want to change something, you don’t have to re-type the whole thing, you can just edit the script and re-run it. And once you have written code that works and does what you want, you can save it as a script file to easily return to later.

        +
        +
        +
        +

        RStudio IDE with Editor, Console, and Output highlighted.

        +
        Figura 6.1: Opening the script editor adds a new pane at the top-left of the IDE.
        +
        +
        +
        +

        +6.1.1 Running code

        +

        The script editor is an excellent place for building complex ggplot2 plots or long sequences of dplyr manipulations. The key to using the script editor effectively is to memorize one of the most important keyboard shortcuts: Cmd/Ctrl + Enter. This executes the current R expression in the console. For example, take the code below.

        +
        +
        library(dplyr)
        +library(nycflights13)
        +
        +not_cancelled <- flights |> 
        +  filter(!is.na(dep_delay)█, !is.na(arr_delay))
        +
        +not_cancelled |> 
        +  group_by(year, month, day) |> 
        +  summarize(mean = mean(dep_delay))
        +
        +

        If your cursor is at █, pressing Cmd/Ctrl + Enter will run the complete command that generates not_cancelled. It will also move the cursor to the following statement (beginning with not_cancelled |>). That makes it easy to step through your complete script by repeatedly pressing Cmd/Ctrl + Enter.

        +

        Instead of running your code expression-by-expression, you can also execute the complete script in one step with Cmd/Ctrl + Shift + S. Doing this regularly is a great way to ensure that you’ve captured all the important parts of your code in the script.

        +

        We recommend you always start your script with the packages you need. That way, if you share your code with others, they can easily see which packages they need to install. Note, however, that you should never include install.packages() in a script you share. It’s inconsiderate to hand off a script that will change something on their computer if they’re not being careful!

        +

        When working through future chapters, we highly recommend starting in the script editor and practicing your keyboard shortcuts. Over time, sending code to the console in this way will become so natural that you won’t even think about it.

        +

        +6.1.2 RStudio diagnostics

        +

        In the script editor, RStudio will highlight syntax errors with a red squiggly line and a cross in the sidebar:

        +
        +
        +

        Script editor with the script x y <- 10. A red X indicates that there is syntax error. The syntax error is also highlighted with a red squiggly line.

        +
        +
        +

        Hover over the cross to see what the problem is:

        +
        +
        +

        Script editor with the script x y <- 10. A red X indicates that there is syntax error. The syntax error is also highlighted with a red squiggly line. Hovering over the X shows a text box with the text unexpected token y and unexpected token <-.

        +
        +
        +

        RStudio will also let you know about potential problems:

        +
        +
        +

        Script editor with the script 3 == NA. A yellow exclamation mark indicates that there may be a potential problem. Hovering over the exclamation mark shows a text box with the text use is.na to check whether expression evaluates to NA.

        +
        +
        +

        +6.1.3 Saving and naming

        +

        RStudio automatically saves the contents of the script editor when you quit, and automatically reloads it when you re-open. Nevertheless, it’s a good idea to avoid Untitled1, Untitled2, Untitled3, and so on and instead save your scripts and to give them informative names.

        +

        It might be tempting to name your files code.R or myscript.R, but you should think a bit harder before choosing a name for your file. Three important principles for file naming are as follows:

        +
          +
        1. File names should be machine readable: avoid spaces, symbols, and special characters. Don’t rely on case sensitivity to distinguish files.
        2. +
        3. File names should be human readable: use file names to describe what’s in the file.
        4. +
        5. File names should play well with default ordering: start file names with numbers so that alphabetical sorting puts them in the order they get used.
        6. +
        +

        For example, suppose you have the following files in a project folder.

        +
        alternative model.R
        +code for exploratory analysis.r
        +finalreport.qmd
        +FinalReport.qmd
        +fig 1.png
        +Figure_02.png
        +model_first_try.R
        +run-first.r
        +temp.txt
        +

        There are a variety of problems here: it’s hard to find which file to run first, file names contain spaces, there are two files with the same name but different capitalization (finalreport vs. FinalReport1), and some names don’t describe their contents (run-first and temp).

        +

        Here’s a better way of naming and organizing the same set of files:

        +
        01-load-data.R
        +02-exploratory-analysis.R
        +03-model-approach-1.R
        +04-model-approach-2.R
        +fig-01.png
        +fig-02.png
        +report-2022-03-20.qmd
        +report-2022-04-02.qmd
        +report-draft-notes.txt
        +

        Numbering the key scripts make it obvious in which order to run them and a consistent naming scheme makes it easier to see what varies. Additionally, the figures are labelled similarly, the reports are distinguished by dates included in the file names, and temp is renamed to report-draft-notes to better describe its contents. If you have a lot of files in a directory, taking organization one step further and placing different types of files (scripts, figures, etc.) in different directories is recommended.

        +

        +6.2 Projects

        +

        One day, you will need to quit R, go do something else, and return to your analysis later. One day, you will be working on multiple analyses simultaneously and you want to keep them separate. One day, you will need to bring data from the outside world into R and send numerical results and figures from R back out into the world.

        +

        To handle these real life situations, you need to make two decisions:

        +
          +
        1. What is the source of truth? What will you save as your lasting record of what happened?

        2. +
        3. Where does your analysis live?

        4. +
        +

        +6.2.1 What is the source of truth?

        +

        As a beginner, it’s okay to rely on your current Environment to contain all the objects you have created throughout your analysis. However, to make it easier to work on larger projects or collaborate with others, your source of truth should be the R scripts. With your R scripts (and your data files), you can recreate the environment. With only your environment, it’s much harder to recreate your R scripts: you’ll either have to retype a lot of code from memory (inevitably making mistakes along the way) or you’ll have to carefully mine your R history.

        +

        To help keep your R scripts as the source of truth for your analysis, we highly recommend that you instruct RStudio not to preserve your workspace between sessions. You can do this either by running usethis::use_blank_slate()2 or by mimicking the options shown in Figura 6.2. This will cause you some short-term pain, because now when you restart RStudio, it will no longer remember the code that you ran last time nor will the objects you created or the datasets you read be available to use. But this short-term pain saves you long-term agony because it forces you to capture all important procedures in your code. There’s nothing worse than discovering three months after the fact that you’ve only stored the results of an important calculation in your environment, not the calculation itself in your code.

        +
        +
        +
        +

        RStudio Global Options window where the option Restore .RData into workspace at startup is not checked. Also, the option Save workspace to .RData on exit is set to Never.

        +
        Figura 6.2: Copy these options in your RStudio options to always start your RStudio session with a clean slate.
        +
        +
        +
        +

        There is a great pair of keyboard shortcuts that will work together to make sure you’ve captured the important parts of your code in the editor:

        +
          +
        1. Press Cmd/Ctrl + Shift + 0/F10 to restart R.
        2. +
        3. Press Cmd/Ctrl + Shift + S to re-run the current script.
        4. +
        +

        We collectively use this pattern hundreds of times a week.

        +

        Alternatively, if you don’t use keyboard shortcuts, you can go to Session > Restart R and then highlight and re-run your current script.

        +
        +
        +
        + +
        +
        +RStudio server +
        +
        +
        +

        If you’re using RStudio server, your R session is never restarted by default. When you close your RStudio server tab, it might feel like you’re closing R, but the server actually keeps it running in the background. The next time you return, you’ll be in exactly the same place you left. This makes it even more important to regularly restart R so that you’re starting with a clean slate.

        +
        +
        +

        +6.2.2 Where does your analysis live?

        +

        R has a powerful notion of the working directory. This is where R looks for files that you ask it to load, and where it will put any files that you ask it to save. RStudio shows your current working directory at the top of the console:

        +
        +
        +

        The Console tab shows the current working directory as ~/Documents/r4ds.

        +
        +
        +

        And you can print this out in R code by running getwd():

        +
        +
        getwd()
        +#> [1] "/Users/hadley/Documents/r4ds"
        +
        +

        In this R session, the current working directory (think of it as “home”) is in hadley’s Documents folder, in a subfolder called r4ds. This code will return a different result when you run it, because your computer has a different directory structure than Hadley’s!

        +

        As a beginning R user, it’s OK to let your working directory be your home directory, documents directory, or any other weird directory on your computer. But you’re seven chapters into this book, and you’re no longer a beginner. Very soon now you should evolve to organizing your projects into directories and, when working on a project, set R’s working directory to the associated directory.

        +

        You can set the working directory from within R but we do not recommend it:

        +
        +
        setwd("/path/to/my/CoolProject")
        +
        +

        There’s a better way; a way that also puts you on the path to managing your R work like an expert. That way is the RStudio project.

        +

        +6.2.3 RStudio projects

        +

        Keeping all the files associated with a given project (input data, R scripts, analytical results, and figures) together in one directory is such a wise and common practice that RStudio has built-in support for this via projects. Let’s make a project for you to use while you’re working through the rest of this book. Click File > New Project, then follow the steps shown in Figura 6.3.

        +
        +
        +
        +

        Three screenshots of the New Project menu. In the first screenshot, the Create Project window is shown and New Directory is selected. In the second screenshot, the Project Type window is shown and Empty Project is selected. In the third screenshot, the Create New Project  window is shown and the directory name is given as r4ds and the project is being created as subdirectory of the Desktop.

        +
        Figura 6.3: To create new project: (top) first click New Directory, then (middle) click New Project, then (bottom) fill in the directory (project) name, choose a good subdirectory for its home and click Create Project.
        +
        +
        +
        +

        Call your project r4ds and think carefully about which subdirectory you put the project in. If you don’t store it somewhere sensible, it will be hard to find it in the future!

        +

        Once this process is complete, you’ll get a new RStudio project just for this book. Check that the “home” of your project is the current working directory:

        +
        +
        getwd()
        +#> [1] /Users/hadley/Documents/r4ds
        +
        +

        Now enter the following commands in the script editor, and save the file, calling it “diamonds.R”. Then, create a new folder called “data”. You can do this by clicking on the “New Folder” button in the Files pane in RStudio. Finally, run the complete script which will save a PNG and CSV file into your project directory. Don’t worry about the details, you’ll learn them later in the book.

        +
        +
        library(tidyverse)
        +
        +ggplot(diamonds, aes(x = carat, y = price)) + 
        +  geom_hex()
        +ggsave("diamonds.png")
        +
        +write_csv(diamonds, "data/diamonds.csv")
        +
        +

        Quit RStudio. Inspect the folder associated with your project — notice the .Rproj file. Double-click that file to re-open the project. Notice you get back to where you left off: it’s the same working directory and command history, and all the files you were working on are still open. Because you followed our instructions above, you will, however, have a completely fresh environment, guaranteeing that you’re starting with a clean slate.

        +

        In your favorite OS-specific way, search your computer for diamonds.png and you will find the PNG (no surprise) but also the script that created it (diamonds.R). This is a huge win! One day, you will want to remake a figure or just understand where it came from. If you rigorously save figures to files with R code and never with the mouse or the clipboard, you will be able to reproduce old work with ease!

        +

        +6.2.4 Relative and absolute paths

        +

        Once you’re inside a project, you should only ever use relative paths not absolute paths. What’s the difference? A relative path is relative to the working directory, i.e. the project’s home. When Hadley wrote data/diamonds.csv above it was a shortcut for /Users/hadley/Documents/r4ds/data/diamonds.csv. But importantly, if Mine ran this code on her computer, it would point to /Users/Mine/Documents/r4ds/data/diamonds.csv. This is why relative paths are important: they’ll work regardless of where the R project folder ends up.

        +

        Absolute paths point to the same place regardless of your working directory. They look a little different depending on your operating system. On Windows they start with a drive letter (e.g., C:) or two backslashes (e.g., \\servername) and on Mac/Linux they start with a slash “/” (e.g., /users/hadley). You should never use absolute paths in your scripts, because they hinder sharing: no one else will have exactly the same directory configuration as you.

        +

        There’s another important difference between operating systems: how you separate the components of the path. Mac and Linux uses slashes (e.g., data/diamonds.csv) and Windows uses backslashes (e.g., data\diamonds.csv). R can work with either type (no matter what platform you’re currently using), but unfortunately, backslashes mean something special to R, and to get a single backslash in the path, you need to type two backslashes! That makes life frustrating, so we recommend always using the Linux/Mac style with forward slashes.

        +

        +6.3 Exercises

        +
          +
        1. Go to the RStudio Tips Twitter account, https://twitter.com/rstudiotips and find one tip that looks interesting. Practice using it!

        2. +
        3. What other common mistakes will RStudio diagnostics report? Read https://support.posit.co/hc/en-us/articles/205753617-Code-Diagnostics to find out.

        4. +

        +6.4 Summary

        +

        In this chapter, you’ve learned how to organize your R code in scripts (files) and projects (directories). Much like code style, this may feel like busywork at first. But as you accumulate more code across multiple projects, you’ll learn to appreciate how a little up front organisation can save you a bunch of time down the road.

        +

        In summary, scripts and projects give you a solid workflow that will serve you well in the future:

        +
          +
        • Create one RStudio project for each data analysis project.
        • +
        • Save your scripts (with informative names) in the project, edit them, run them in bits or as a whole. Restart R frequently to make sure you’ve captured everything in your scripts.
        • +
        • Only ever use relative paths, not absolute paths.
        • +
        +

        Then everything you need is in one place and cleanly separated from all the other projects that you are working on.

        +

        So far, we’ve worked with datasets bundled inside of R packages. This makes it easier to get some practice on pre-prepared data, but obviously your data won’t be available in this way. So in the next chapter, you’re going to learn how load data from disk into your R session using the readr package.

        + + +

        +
          +
        1. Not to mention that you’re tempting fate by using “final” in the name 😆 The comic Piled Higher and Deeper has a fun strip on this.↩︎

        2. +
        3. If you don’t have usethis installed, you can install it with install.packages("usethis").↩︎

        4. +
        +
        +
        +
        + + + \ No newline at end of file diff --git a/workflow-style.html b/workflow-style.html new file mode 100644 index 000000000..9e884f1d7 --- /dev/null +++ b/workflow-style.html @@ -0,0 +1,870 @@ + + + + + + + +R para Ciência de Dados (2ª edição) - 4  Workflow: code style + + + + + + + + + + + + + + + + + + + + + + + + +
        +
        + +
        + + + +
        +

        4  Workflow: code style

        +
        + + + +
        + + + + +
        + + +

        Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread. Even as a very new programmer, it’s a good idea to work on your code style. Using a consistent style makes it easier for others (including future-you!) to read your work and is particularly important if you need to get help from someone else. This chapter will introduce the most important points of the tidyverse style guide, which is used throughout this book.

        +

        Styling your code will feel a bit tedious to start with, but if you practice it, it will soon become second nature. Additionally, there are some great tools to quickly restyle existing code, like the styler package by Lorenz Walthert. Once you’ve installed it with install.packages("styler"), an easy way to use it is via RStudio’s command palette. The command palette lets you use any built-in RStudio command and many addins provided by packages. Open the palette by pressing Cmd/Ctrl + Shift + P, then type “styler” to see all the shortcuts offered by styler. Figura 4.1 shows the results.

        +
        +
        +
        +

        A screenshot showing the command palette after typing "styler", showing the four styling tool provided by the package.

        +
        Figura 4.1: RStudio’s command palette makes it easy to access every RStudio command using only the keyboard.
        +
        +
        +
        +

        We’ll use the tidyverse and nycflights13 packages for code examples in this chapter.

        + +

        +4.1 Names

        +

        We talked briefly about names in Seção 2.3. Remember that variable names (those created by <- and those created by mutate()) should use only lowercase letters, numbers, and _. Use _ to separate words within a name.

        +
        +
        # Strive for:
        +short_flights <- flights |> filter(air_time < 60)
        +
        +# Avoid:
        +SHORTFLIGHTS <- flights |> filter(air_time < 60)
        +
        +

        As a general rule of thumb, it’s better to prefer long, descriptive names that are easy to understand rather than concise names that are fast to type. Short names save relatively little time when writing code (especially since autocomplete will help you finish typing them), but it can be time-consuming when you come back to old code and are forced to puzzle out a cryptic abbreviation.

        +

        If you have a bunch of names for related things, do your best to be consistent. It’s easy for inconsistencies to arise when you forget a previous convention, so don’t feel bad if you have to go back and rename things. In general, if you have a bunch of variables that are a variation on a theme, you’re better off giving them a common prefix rather than a common suffix because autocomplete works best on the start of a variable.

        +

        +4.2 Spaces

        +

        Put spaces on either side of mathematical operators apart from ^ (i.e. +, -, ==, <, …), and around the assignment operator (<-).

        +
        +
        # Strive for
        +z <- (a + b)^2 / d
        +
        +# Avoid
        +z<-( a + b ) ^ 2/d
        +
        +

        Don’t put spaces inside or outside parentheses for regular function calls. Always put a space after a comma, just like in standard English.

        +
        +
        # Strive for
        +mean(x, na.rm = TRUE)
        +
        +# Avoid
        +mean (x ,na.rm=TRUE)
        +
        +

        It’s OK to add extra spaces if it improves alignment. For example, if you’re creating multiple variables in mutate(), you might want to add spaces so that all the = line up.1 This makes it easier to skim the code.

        +
        +
        flights |> 
        +  mutate(
        +    speed      = distance / air_time,
        +    dep_hour   = dep_time %/% 100,
        +    dep_minute = dep_time %%  100
        +  )
        +
        +

        +4.3 Pipes

        +

        |> should always have a space before it and should typically be the last thing on a line. This makes it easier to add new steps, rearrange existing steps, modify elements within a step, and get a 10,000 ft view by skimming the verbs on the left-hand side.

        +
        +
        # Strive for 
        +flights |>  
        +  filter(!is.na(arr_delay), !is.na(tailnum)) |> 
        +  count(dest)
        +
        +# Avoid
        +flights|>filter(!is.na(arr_delay), !is.na(tailnum))|>count(dest)
        +
        +

        If the function you’re piping into has named arguments (like mutate() or summarize()), put each argument on a new line. If the function doesn’t have named arguments (like select() or filter()), keep everything on one line unless it doesn’t fit, in which case you should put each argument on its own line.

        +
        +
        # Strive for
        +flights |>  
        +  group_by(tailnum) |> 
        +  summarize(
        +    delay = mean(arr_delay, na.rm = TRUE),
        +    n = n()
        +  )
        +
        +# Avoid
        +flights |>
        +  group_by(
        +    tailnum
        +  ) |> 
        +  summarize(delay = mean(arr_delay, na.rm = TRUE), n = n())
        +
        +

        After the first step of the pipeline, indent each line by two spaces. RStudio will automatically put the spaces in for you after a line break following a |> . If you’re putting each argument on its own line, indent by an extra two spaces. Make sure ) is on its own line, and un-indented to match the horizontal position of the function name.

        +
        +
        # Strive for 
        +flights |>  
        +  group_by(tailnum) |> 
        +  summarize(
        +    delay = mean(arr_delay, na.rm = TRUE),
        +    n = n()
        +  )
        +
        +# Avoid
        +flights|>
        +  group_by(tailnum) |> 
        +  summarize(
        +             delay = mean(arr_delay, na.rm = TRUE), 
        +             n = n()
        +           )
        +
        +# Avoid
        +flights|>
        +  group_by(tailnum) |> 
        +  summarize(
        +  delay = mean(arr_delay, na.rm = TRUE), 
        +  n = n()
        +  )
        +
        +

        It’s OK to shirk some of these rules if your pipeline fits easily on one line. But in our collective experience, it’s common for short snippets to grow longer, so you’ll usually save time in the long run by starting with all the vertical space you need.

        +
        +
        # This fits compactly on one line
        +df |> mutate(y = x + 1)
        +
        +# While this takes up 4x as many lines, it's easily extended to 
        +# more variables and more steps in the future
        +df |> 
        +  mutate(
        +    y = x + 1
        +  )
        +
        +

        Finally, be wary of writing very long pipes, say longer than 10-15 lines. Try to break them up into smaller sub-tasks, giving each task an informative name. The names will help cue the reader into what’s happening and makes it easier to check that intermediate results are as expected. Whenever you can give something an informative name, you should give it an informative name, for example when you fundamentally change the structure of the data, e.g., after pivoting or summarizing. Don’t expect to get it right the first time! This means breaking up long pipelines if there are intermediate states that can get good names.

        +

        +4.4 ggplot2

        +

        The same basic rules that apply to the pipe also apply to ggplot2; just treat + the same way as |>.

        +
        +
        flights |> 
        +  group_by(month) |> 
        +  summarize(
        +    delay = mean(arr_delay, na.rm = TRUE)
        +  ) |> 
        +  ggplot(aes(x = month, y = delay)) +
        +  geom_point() + 
        +  geom_line()
        +
        +

        Again, if you can’t fit all of the arguments to a function on to a single line, put each argument on its own line:

        +
        +
        flights |> 
        +  group_by(dest) |> 
        +  summarize(
        +    distance = mean(distance),
        +    speed = mean(distance / air_time, na.rm = TRUE)
        +  ) |> 
        +  ggplot(aes(x = distance, y = speed)) +
        +  geom_smooth(
        +    method = "loess",
        +    span = 0.5,
        +    se = FALSE, 
        +    color = "white", 
        +    linewidth = 4
        +  ) +
        +  geom_point()
        +
        +

        Watch for the transition from |> to +. We wish this transition wasn’t necessary, but unfortunately, ggplot2 was written before the pipe was discovered.

        +

        +4.5 Sectioning comments

        +

        As your scripts get longer, you can use sectioning comments to break up your file into manageable pieces:

        +
        +
        # Load data --------------------------------------
        +
        +# Plot data --------------------------------------
        +
        +

        RStudio provides a keyboard shortcut to create these headers (Cmd/Ctrl + Shift + R), and will display them in the code navigation drop-down at the bottom-left of the editor, as shown in Figura 4.2.

        +
        +
        +
        +

        +
        Figura 4.2: After adding sectioning comments to your script, you can easily navigate to them using the code navigation tool in the bottom-left of the script editor.
        +
        +
        +
        +

        +4.6 Exercises

        +
          +
        1. +

          Restyle the following pipelines following the guidelines above.

          +
          +
          flights|>filter(dest=="IAH")|>group_by(year,month,day)|>summarize(n=n(),
          +delay=mean(arr_delay,na.rm=TRUE))|>filter(n>10)
          +
          +flights|>filter(carrier=="UA",dest%in%c("IAH","HOU"),sched_dep_time>
          +0900,sched_arr_time<2000)|>group_by(flight)|>summarize(delay=mean(
          +arr_delay,na.rm=TRUE),cancelled=sum(is.na(arr_delay)),n=n())|>filter(n>10)
          +
          +
        2. +

        +4.7 Summary

        +

        In this chapter, you’ve learned the most important principles of code style. These may feel like a set of arbitrary rules to start with (because they are!) but over time, as you write more code, and share code with more people, you’ll see how important a consistent style is. And don’t forget about the styler package: it’s a great way to quickly improve the quality of poorly styled code.

        +

        In the next chapter, we switch back to data science tools, learning about tidy data. Tidy data is a consistent way of organizing your data frames that is used throughout the tidyverse. This consistency makes your life easier because once you have tidy data, it just works with the vast majority of tidyverse functions. Of course, life is never easy, and most datasets you encounter in the wild will not already be tidy. So we’ll also teach you how to use the tidyr package to tidy your untidy data.

        + + +

        +
          +
        1. Since dep_time is in HMM or HHMM format, we use integer division (%/%) to get hour and remainder (also known as modulo, %%) to get minute.↩︎

        2. +
        +
        + + + + \ No newline at end of file