-
Notifications
You must be signed in to change notification settings - Fork 54
/
Copy pathout-type-stability.qmd
99 lines (71 loc) · 3.34 KB
/
out-type-stability.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# Type-stability {#sec-out-type-stability}
```{r}
#| include = FALSE
source("common.R")
```
The less you need to know about a function's inputs to predict the type of its output, the better.
Ideally, a function should either always return the same type of thing, or return something that can be trivially computed from its inputs.
If a function is **type-stable** it satisfies two conditions:
- You can predict the output type based only on the input types (not their values).
- If the function uses `...`, the order of arguments in does not affect the output type.
```{r}
#| label = "setup"
library(vctrs)
```
## Simple examples
- `purrr::map()` and `base::lapply()` are trivially type-stable because they always return lists.
- `paste()` is type stable because it always returns a character vector.
```{r}
vec_ptype(paste(1))
vec_ptype(paste("x"))
```
- `base::mean(x)` almost always returns the same type of output as `x`.
For example, the mean of a numeric vector is a numeric vector, and the mean of a date-time is a date-time.
```{r}
vec_ptype(mean(1))
vec_ptype(mean(Sys.time()))
```
- `ifelse()` is not type-stable because the output type depends on the value:
```{r}
vec_ptype(ifelse(NA, 1L, 2))
vec_ptype(ifelse(FALSE, 1L, 2))
vec_ptype(ifelse(TRUE, 1L, 2))
```
## More complicated examples
Some functions are more complex because they take multiple input types and have to return a single output type.
This includes functions like `c()` and `ifelse()`.
The rules governing base R functions are idiosyncratic, and each function tends to apply it's own slightly different set of rules.
Tidy functions should use the consistent set of rules provided by the [vctrs](https://vctrs.r-lib.org) package.
## Challenge: the median
A more challenging example is `median()`.
The median of a vector is a value that (as evenly as possible) splits the vector into a lower half and an upper half.
In the absence of ties, `mean(x > median(x)) == mean(x <= median(x)) == 0.5`.
The median is straightforward to compute for odd lengths: you simply order the vector and pick the value in the middle, i.e. `sort(x)[(length(x) - 1) / 2]`.
It's clear that the type of the output should be the same type as `x`, and this algorithm can be applied to any vector that can be ordered.
But what if the vector has an even length?
In this case, there's no longer a unique median, and by convention we usually take the mean of the middle two numbers.
In R, this makes the `median()` not type-stable:
```{r}
typeof(median(1:3))
typeof(median(1:4))
```
Base R doesn't appear to follow a consistent principle when computing the median of a vector of length 2.
Factors throw an error, but dates do not (even though there's no date half way between two days that differ by an odd number of days).
```{r}
#| error = TRUE
median(factor(1:2))
median(Sys.Date() + 0:1)
```
To be clear, the problems caused by this behaviour are quite small in practice, but it makes the analysis of `median()` more complex, and it makes it difficult to decide what principle you should adhere to when creating `median` methods for new vector classes.
```{r}
#| error = TRUE
median("foo")
median(c("foo", "bar"))
```
## Exercises
1. How is a date like an integer?
Why is this inconsistent?
```{r}
vec_ptype(mean(Sys.Date()))
vec_ptype(mean(1L))
```