05-tidy.Rmd

---
title: "Tidy"
output: github_document
---

## Setup

```{r setup}
library(tidyverse)
```


## Warm up

This is a toy dataset. Why is it messy?

```{r}
cases <- tribble(
  ~Country, ~"2011", ~"2012", ~"2013",
      "FR",    7000,    6900,    7000,
      "DE",    5800,    6000,    6200,
      "US",   15000,   14000,   13000
)

cases
```


* Can you explain the two different approaches to achieve the same tidy dataset?
* Why is this dataset tidy?

```{r}
cases %>% pivot_longer(cols = "2011":"2013")

cases %>% pivot_longer(cols = -Country)
```


## Data

Read a messy dataset, then have a look.

```{r}
gap_wide <- readr::read_csv('https://raw.githubusercontent.com/carpentries-incubator/open-science-with-r/gh-pages/data/gapminder_wide.csv')

gap_wide
```


* Pivot on the columns that `start_with("gdp")`. The result should be longer.
* How long is the input and output?
* What happened?

```{r}
gap_wide %>% 
  pivot_______(cols = starts_____("gdp"))
```


* Pivot on all columns except `continent` and `country`.
* Refer to the columns you want to pivot on using `starts_with()`
* Why is this longer?

```{r}
gap_wide %>% 
  pivot_longer(
    cols = ___________("gdp") | ___________("life") | ___________("pop")
  )
```


This alternative is identical but shorter to type.

* The result is still messy. Why?

```{r}
gap_wide %>% pivot_longer(cols = -continent:-country)
```


* Now also use `separate()` to separate the column `name` into two columns: "metric" and "year". Hint: Use the vector `c("metric", "year")`.
* We now achieved a tidy dataset. Why?

```{r}
tidy <- gap_wide %>% 
  pivot_longer(cols = -continent:-country) %>% 
  ________(col = name, into = c("______", "____"))

tidy
```


It's easy to mess things up back again with `unite()` and `pivot_wider()`.

```{r}
messy <- tidy %>% 
  unite("name", metric, year) %>% 
  pivot_wider()

messy

gap_wide
```

## Filling missing data with `complete()`

This toy dataset has implicit missing data in year 2000. Why?

```{r}
kelp <- tibble(
  year = c(1999, 1999, 2000, 2004, 2004),
  taxon = c("Agarum", "Saccharina", "Saccharina", "Agarum", "Saccharina"),
  abundance = c(1, 4, 5, 8, 2)
)

kelp
```


* Use `complete()` to make explicit the missing data in `year` and `taxon`.

```{r}
kelp %>% ________(year, _____)
```


Pretend in `year = 2000` you found no individual of `taxon` "Agarum".

You can use the argument `fill` to fill with `0` the missing data in `abundance`.

```{r}
kelp %>% complete(year, taxon, fill = list(abundance = 0))
```


Pretend you surveyed every year from 1999 to 2004 but found nothing in 2001-3.

```{r}
success_years <- kelp %>% 
  distinct(year) %>% 
  pull(year)

success_years
```

* Use `full_seq()` to produce the full sequence of the years you surveyed.

```{r}
_________ <- success_years %>% ________(period = 1)
all_years
```


* Use `all_years` to `complete()` your dataset with explicit missing values.

```{r}
kelp %>% 
  complete(year = _________, taxon = taxon)
```


You now want to study "Agarum" only. Here is your dataset.

```{r}
agarum <- kelp %>% 
  filter(taxon == "Agarum") %>% 
  complete(year = all_years)

agarum
```


* `fill()` the missing values of `taxon`.

```{r}
agarum %>% ____(taxon)
```

* Now also `replace_na()` in `abundance` with the value `0`. Hint: One solution uses `list(abundance = 0)`

```{r}
agarum %>% 
  fill(taxon) %>% 
  __________(list(abundance = _))
```

Bored? 

* See `?replace_na()` and find another solution with using `mutate()`.


***

# Take Aways

Data comes in many formats but R prefers just one: _tidy data_.

A data set is tidy if and only if:

1. Every variable is in its own column
2. Every observation is in its own row
3. Every value is in its own cell (which follows from the above)

What is a variable and an observation may depend on your immediate goal.