10_factors.Rmd

# (PART) Data analysis 2 {-} 

# Be the boss of your factors {#factors-boss}

```{r include = FALSE}
source("common.R")
```

<!--Original content: https://stat545.com/block029_factors.html-->

## Factors: where they fit in

We've spent a lot of time working with big, beautiful data frames, like the Gapminder data. But we also need to manage the individual variables housed within.

Factors are the variable type that useRs love to hate. It is how we store truly categorical information in R. The values a factor can take on are called the **levels**. For example, the levels of the factor `continent` in Gapminder are are "Africa", "Americas", etc. and this is what's usually presented to your eyeballs by R. In general, the levels are friendly human-readable character strings, like "male/female" and "control/treated". But *never ever ever* forget that, under the hood, R is really storing integer codes 1, 2, 3, etc.

This [Janus][wiki-janus]-like nature of factors means they are rich with booby traps for the unsuspecting but they are a necessary evil. I recommend you learn how to be the boss of your factors. The pros far outweigh the cons. Specifically in modelling and figure-making, factors are anticipated and accommodated by the functions and packages you will want to exploit.

**The worst kind of factor is the stealth factor.** The variable that you think of as character, but that is actually a factor (numeric!!). This is a classic R gotcha. Check your variable types explicitly when things seem weird. It happens to the best of us.

Where do stealth factors come from? Base R has a burning desire to turn character information into factor. The happens most commonly at data import via `read.table()` and friends. But `data.frame()` and other functions are also eager to convert character to factor. To shut this down, use `stringsAsFactors = FALSE` in `read.table()` and `data.frame()` or -- even better -- **use the tidyverse**! For data import, use `readr::read_csv()`, `readr::read_tsv()`, etc. For data frame creation, use `tibble::tibble()`. And so on.

Good articles about how the factor fiasco came to be:

* [stringsAsFactors: An unauthorized biography][bio-strings-as-factors] by Roger Peng
* [stringsAsFactors = \<sigh\>][blog-strings-as-factors] by Thomas Lumley

## The forcats package

[forcats][forcats-web] is a core package in the tidyverse. It is installed via `install.packages("tidyverse")`, and loaded with `library(tidyverse)`. You can also install via `install.packages("forcats")`and load it yourself separately as needed via `library(forcats)`. Main functions start with `fct_`. There really is no coherent family of base functions that forcats replaces -- that's why it's such a welcome addition.

Currently this lesson will be mostly code vs prose. See the previous lesson for more discussion during the transition.

## Load forcats and gapminder

I choose to load the tidyverse, which will load forcats, among other packages we use incidentally below. 

```{r start_factors}
library(tidyverse)
```

Also load gapminder.

```{r message = FALSE, warning = FALSE}
library(gapminder)
```

## Factor inspection

Get to know your factor before you start touching it! It's polite. Let's use `gapminder$continent` as our example.

```{r}
str(gapminder$continent)
levels(gapminder$continent)
nlevels(gapminder$continent)
class(gapminder$continent)
```

To get a frequency table as a tibble, from a tibble, use `dplyr::count()`. To get similar from a free-range factor, use `forcats::fct_count()`.

```{r}
gapminder %>% 
  count(continent)
fct_count(gapminder$continent)
```

## Dropping unused levels

Just because you drop all the rows corresponding to a specific factor level, the levels of the factor itself do not change. Sometimes all these unused levels can come back to haunt you later, e.g., in figure legends.

Watch what happens to the levels of `country` (= nothing) when we filter Gapminder to a handful of countries.

```{r}
nlevels(gapminder$country)
h_countries <- c("Egypt", "Haiti", "Romania", "Thailand", "Venezuela")
h_gap <- gapminder %>%
  filter(country %in% h_countries)
nlevels(h_gap$country)
```

Even though `h_gap` only has data for a handful of countries, we are still schlepping around all `r nlevels(gapminder$country)` levels from the original `gapminder` tibble.

How can you get rid of them? The base function `droplevels()` operates on all the factors in a data frame or on a single factor. The function `forcats::fct_drop()` operates on a factor.

```{r}
h_gap_dropped <- h_gap %>% 
  droplevels()
nlevels(h_gap_dropped$country)

## use forcats::fct_drop() on a free-range factor
h_gap$country %>%
  fct_drop() %>%
  levels()
```

__Exercise:__ Filter the gapminder data down to rows where population is less than a quarter of a million, i.e. 250,000. Get rid of the unused factor levels for `country` and `continent` in different ways, such as:

* `droplevels()`
* `fct_drop()` inside `mutate()`
* `fct_dopr()` with `mutate_at()` or `mutate_if()`

```{r, eval = FALSE, include = FALSE}
low_pop <- gapminder %>% 
  filter(pop < 250000)

low_pop %>%
  droplevels() %>% 
  group_by(country, continent) %>% 
  summarize(max_pop = max(pop))

low_pop %>%
  mutate(country = fct_drop(country),
         continent = fct_drop(continent)) %>% 
  group_by(country, continent) %>% 
  summarize(max_pop = max(pop))
```

## Change order of the levels, principled {#reorder-factors}

By default, factor levels are ordered alphabetically. Which might as well be random, when you think about it! It is preferable to order the levels according to some principle:

* Frequency. Make the most common level the first and so on.
* Another variable. Order factor levels according to a summary statistic for another variable. __Example:__ order Gapminder countries by life expectancy.

First, let's order continent by frequency, forwards and backwards. This is often a great idea for tables and figures, esp. frequency barplots.

```{r}
## default order is alphabetical
gapminder$continent %>%
  levels()

## order by frequency
gapminder$continent %>% 
  fct_infreq() %>%
  levels()

## backwards!
gapminder$continent %>% 
  fct_infreq() %>%
  fct_rev() %>% 
  levels()
```

These two barcharts of frequency by continent differ only in the order of the continents. Which do you prefer?

```{r, fig.show = 'hold', out.width = '49%', echo = FALSE}
p <- ggplot(gapminder, aes(x = continent)) +
  geom_bar() +
  coord_flip()
p

gap_tmp <- gapminder %>% 
  mutate(continent = continent %>% fct_infreq() %>% fct_rev())

p <- ggplot(gap_tmp, aes(x = continent)) +
  geom_bar() +
  coord_flip()
p
```

Now we order `country` by another variable, forwards and backwards. This other variable is usually quantitative and you will order the factor according to a grouped summary. The factor is the grouping variable and the default summarizing function is `median()` but you can specify something else.

```{r}
## order countries by median life expectancy
fct_reorder(gapminder$country, gapminder$lifeExp) %>% 
  levels() %>% head()

## order accoring to minimum life exp instead of median
fct_reorder(gapminder$country, gapminder$lifeExp, min) %>% 
  levels() %>% head()

## backwards!
fct_reorder(gapminder$country, gapminder$lifeExp, .desc = TRUE) %>% 
  levels() %>% head()
```

Example of why we reorder factor levels: often makes plots much better! When a factor is mapped to x or y, it should almost always be reordered by the quantitative variable you are mapping to the other one.

```{r include = FALSE, eval = FALSE}
boxplot(Sepal.Width ~ Species, data = iris)
boxplot(Sepal.Width ~ fct_reorder(Species, Sepal.Width), data = iris)
boxplot(Sepal.Width ~ fct_reorder(Species, Sepal.Width, .desc = TRUE), data = iris)
```

Compare the interpretability of these two plots of life expectancy in Asian countries in 2007. The *only difference* is the order of the `country` factor. Which one do you find easier to learn from?

```{r alpha-order-silly, fig.show = 'hold', out.width = '49%'}
gap_asia_2007 <- gapminder %>% filter(year == 2007, continent == "Asia")
ggplot(gap_asia_2007, aes(x = lifeExp, y = country)) + geom_point()
ggplot(gap_asia_2007, aes(x = lifeExp, y = fct_reorder(country, lifeExp))) +
  geom_point()
```

Use `fct_reorder2()` when you have a line chart of a quantitative x against another quantitative y and your factor provides the color. This way the legend appears in some order as the data! Contrast the legend on the left with the one on the right.

```{r legends-made-for-humans, fig.show = 'hold', out.width = '49%'}
h_countries <- c("Egypt", "Haiti", "Romania", "Thailand", "Venezuela")
h_gap <- gapminder %>%
  filter(country %in% h_countries) %>% 
  droplevels()
ggplot(h_gap, aes(x = year, y = lifeExp, color = country)) +
  geom_line()
ggplot(h_gap, aes(x = year, y = lifeExp,
                  color = fct_reorder2(country, year, lifeExp))) +
  geom_line() +
  labs(color = "country")
```

## Change order of the levels, "because I said so"

Sometimes you just want to hoist one or more levels to the front. Why? Because I said so. This resembles what we do when we move variables to the front with `dplyr::select(special_var, everything())`.

```{r}
h_gap$country %>% levels()
h_gap$country %>% fct_relevel("Romania", "Haiti") %>% levels()
```

This might be useful if you are preparing a report for, say, the Romanian government. The reason for always putting Romania first has nothing to do with the *data*, it is important for external reasons and you need a way to express this.

## Recode the levels

Sometimes you have better ideas about what certain levels should be. This is called recoding.

```{r}
i_gap <- gapminder %>% 
  filter(country %in% c("United States", "Sweden", "Australia")) %>% 
  droplevels()
i_gap$country %>% levels()
i_gap$country %>%
  fct_recode("USA" = "United States", "Oz" = "Australia") %>% levels()
```

__Exercise:__ Isolate the data for `"Australia"`, `"Korea, Dem. Rep."`, and `"Korea, Rep."` in the 2000x. Revalue the country factor levels to `"Oz"`, `"North Korea"`, and `"South Korea"`.

## Grow a factor

Let's create two data frames, each with data from two countries, dropping unused factor levels.

```{r}
df1 <- gapminder %>%
  filter(country %in% c("United States", "Mexico"), year > 2000) %>%
  droplevels()
df2 <- gapminder %>%
  filter(country %in% c("France", "Germany"), year > 2000) %>%
  droplevels()
```

The `country` factors in `df1` and `df2` have different levels.

```{r}
levels(df1$country)
levels(df2$country)
```

Can you just combine them?

```{r}
c(df1$country, df2$country)
```

Umm, no. That is wrong on many levels! Use `fct_c()` to do this.

```{r}
fct_c(df1$country, df2$country)
```

__Exercise:__ Explore how different forms of row binding work behave here, in terms of the `country` variable in the result.

```{r end_factors}
bind_rows(df1, df2)
rbind(df1, df2)
```


```{r links, child="links.md"}
```