-
Notifications
You must be signed in to change notification settings - Fork 288
/
06-fitting-models.Rmd
379 lines (263 loc) · 21.5 KB
/
06-fitting-models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
```{r models-setup, include = FALSE}
knitr::opts_chunk$set(fig.path = "figures/")
library(tidymodels)
library(kknn)
library(kableExtra)
library(tidyr)
tidymodels_prefer()
source("ames_snippets.R")
```
# Fitting Models with parsnip {#models}
The `r pkg(parsnip)` package, one of the R packages that are part of the `r pkg(tidymodels)` metapackage, provides a fluent and standardized interface for a variety of different models. In this chapter, we give some motivation for why a common interface is beneficial for understanding and building models in practice and show how to use the `r pkg(parsnip)` package.
Specifically, we will focus on how to `fit()` and `predict()` directly with a `r pkg(parsnip)` object, which may be a good fit for some straightforward modeling problems. The next chapter illustrates a better approach for many modeling tasks by combining models and preprocessors together into something called a `workflow` object.
## Create a Model
Once the data have been encoded in a format ready for a modeling algorithm, such as a numeric matrix, they can be used in the model building process.
Suppose that a linear regression model was our initial choice. This is equivalent to specifying that the outcome data is numeric and that the predictors are related to the outcome in terms of simple slopes and intercepts:
$$y_i = \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{pi}$$
A variety of methods can be used to estimate the model parameters:
* _Ordinary linear regression_ uses the traditional method of least squares to solve for the model parameters.
* _Regularized linear regression_ adds a penalty to the least squares method to encourage simplicity by removing predictors and/or shrinking their coefficients towards zero. This can be executed using Bayesian or non-Bayesian techniques.
In R, the `r pkg(stats)` package can be used for the first case. The syntax for linear regression using the function `lm()` is:
```r
model <- lm(formula, data, ...)
```
where `...` symbolizes other options to pass to `lm()`. The function does _not_ have an `x`/`y` interface, where we might pass in our outcome as `y` and our predictors as `x`.
To estimate with regularization, the second case, a Bayesian model can be fit using the `r pkg(rstanarm)` package:
```r
model <- stan_glm(formula, data, family = "gaussian", ...)
```
In this case, the other options passed via `...` would include arguments for the prior distributions of the parameters as well as specifics about the numerical aspects of the model. As with `lm()`, only the formula interface is available.
A popular non-Bayesian approach to regularized regression is the `r pkg(glmnet)` model [@glmnet]. Its syntax is:
```r
model <- glmnet(x = matrix, y = vector, family = "gaussian", ...)
```
In this case, the predictor data must already be formatted into a numeric matrix; there is only an `x`/`y` method and no formula method.
Note that these interfaces are heterogeneous in either how the data are passed to the model function or in terms of their arguments. The first issue is that, to fit models across different packages, the data must be formatted in different ways. `lm()` and `stan_glm()` only have formula interfaces while `glmnet()` does not. For other types of models, the interfaces may be even more disparate. For a person trying to do data analysis, these differences require the memorization of each package's syntax and can be very frustrating.
For tidymodels, the approach to specifying a model is intended to be more unified:
1. *Specify the _type_ of model based on its mathematical structure* (e.g., linear regression, random forest, KNN, etc).
2. *Specify the _engine_ for fitting the model.* Most often this reflects the software package that should be used, like Stan or `r pkg(glmnet)`. These are models in their own right, and `r pkg(parsnip)` provides consistent interfaces by using these as engines for modeling.
3. *When required, declare the _mode_ of the model.* The mode reflects the type of prediction outcome. For numeric outcomes, the mode is regression; for qualitative outcomes, it is classification.^[Note that `r pkg(parsnip)` constrains the outcome column of a classification model to be encoded as a _factor_; using binary numeric values will result in an error.] If a model algorithm can only address one type of prediction outcome, such as linear regression, the mode is already set.
These specifications are built without referencing the data. For example, for the three cases we outlined:
```{r models-lin-reg-spec}
library(tidymodels)
tidymodels_prefer()
linear_reg() %>% set_engine("lm")
linear_reg() %>% set_engine("glmnet")
linear_reg() %>% set_engine("stan")
```
Once the details of the model have been specified, the model estimation can be done with either the `fit()` function (to use a formula) or the `fit_xy()` function (when your data are already pre-processed). The `r pkg(parsnip)` package allows the user to be indifferent to the interface of the underlying model; you can always use a formula even if the modeling package's function only has the `x`/`y` interface.
The `translate()` function can provide details on how `r pkg(parsnip)` converts the user's code to the package's syntax:
```{r models-lin-reg-trans}
linear_reg() %>% set_engine("lm") %>% translate()
linear_reg(penalty = 1) %>% set_engine("glmnet") %>% translate()
linear_reg() %>% set_engine("stan") %>% translate()
```
Note that `missing_arg()` is just a placeholder for the data that has yet to be provided.
:::rmdnote
We supplied a required `penalty` argument for the glmnet engine. Also, for the Stan and glmnet engines, the `family` argument was automatically added as a default. As will be shown later in this section, this option can be changed.
:::
Let's walk through how to predict the sale price of houses in the Ames data as a function of only longitude and latitude:[^fitxy]
```{r models-ames-geocodes}
lm_model <-
linear_reg() %>%
set_engine("lm")
lm_form_fit <-
lm_model %>%
# Recall that Sale_Price has been pre-logged
fit(Sale_Price ~ Longitude + Latitude, data = ames_train)
lm_xy_fit <-
lm_model %>%
fit_xy(
x = ames_train %>% select(Longitude, Latitude),
y = ames_train %>% pull(Sale_Price)
)
lm_form_fit
lm_xy_fit
```
[^fitxy]: What are the differences between `fit()` and `fit_xy()`? The `fit_xy()` function always passes the data as is to the underlying model function. It will not create dummy/indicator variables before doing so. When `fit()` is used with a model specification, this almost always means that dummy variables will be created from qualitative predictors. If the underlying function requires a matrix (like glmnet), it will make the matrix. However, if the underlying function uses a formula, `fit()` just passes the formula to that function. We estimate that 99% of modeling functions using formulas make dummy variables. The other 1% include tree-based methods that do not require purely numeric predictors. See Section \@ref(workflow-encoding) for more about using formulas in tidymodels.
Not only does `r pkg(parsnip)` enable a consistent model interface for different packages, it also provides consistency in the model arguments. It is common for different functions that fit the same model to have different argument names. Random forest model functions are a good example. Three commonly used arguments are the number of trees in the ensemble, the number of predictors to randomly sample with each split within a tree, and the number of data points required to make a split. For three different R packages implementing this algorithm, those arguments are shown in Table \@ref(tab:rand-forest-args).
```{r, models-rf-arg-names, echo = FALSE, results = "asis"}
arg_info <-
tribble(
~ `Argument Type`, ~parsnip,
"# trees", "trees",
"# sampled predictors", "mtry",
"# data points to split", "min_n"
)
arg_info <-
get_from_env("rand_forest_args") %>%
select(engine, parsnip, original) %>%
full_join(arg_info, by = "parsnip") %>%
mutate(package = ifelse(engine == "spark", "sparklyr", engine))
arg_info %>%
select(package, `Argument Type`, original) %>%
# mutate(original = paste0("<tt>", original, "</tt>")) %>%
pivot_wider(
id_cols = c(`Argument Type`),
values_from = c(original),
names_from = c(package)
) %>%
kable(
caption = "Example argument names for different random forest functions.",
label = "rand-forest-args",
escape = FALSE
) %>%
kable_styling() %>%
column_spec(2:4, monospace = TRUE)
```
In an effort to make argument specification less painful, `r pkg(parsnip)` uses common argument names within and between packages. Table \@ref(tab:parsnip-args) shows, for random forests, what `r pkg(parsnip)` models use.
```{r, models-parsnip-names, echo = FALSE, results = "asis"}
arg_info %>%
select(`Argument Type`, parsnip) %>%
distinct() %>%
# mutate(parsnip = paste0("<tt>", parsnip, "</tt>")) %>%
kable(
caption = "Random forest argument names used by parsnip.",
label = "parsnip-args",
escape = FALSE
) %>%
kable_styling(full_width = FALSE) %>%
column_spec(2, monospace = TRUE)
```
Admittedly, this is one more set of arguments to memorize. However, when other types of models have the same argument types, these names still apply. For example, boosted tree ensembles also create a large number of tree-based models, so `trees` is also used there, as is `min_n`, and so on.
Some of the original argument names can be fairly jargon-y. For example, to specify the amount of regularization to use in a glmnet model, the Greek letter `lambda` is used. While this mathematical notation is commonly used in the statistics literature, it is not obvious to many people what `lambda` represents (especially those who consume the model results). Since this is the penalty used in regularization, `r pkg(parsnip)` standardizes on the argument name `penalty`. Similarly, the number of neighbors in a KNN model is called `neighbors` instead of `k`. Our rule of thumb when standardizing argument names is:
> If a practitioner were to include these names in a plot or table, would the people viewing those results understand the name?
To understand how the `r pkg(parsnip)` argument names map to the original names, use the help file for the model (available via `?rand_forest`) as well as the `translate()` function:
```{r models-glmnet-trans}
rand_forest(trees = 1000, min_n = 5) %>%
set_engine("ranger") %>%
set_mode("regression") %>%
translate()
```
Modeling functions in `r pkg(parsnip)` separate model arguments into two categories:
* _Main arguments_ are more commonly used and tend to be available across engines.
* _Engine arguments_ are either specific to a particular engine or used more rarely.
For example, in the translation of the previous random forest code, the arguments `num.threads`, `verbose`, and `seed` were added by default. These arguments are specific to the `r pkg(ranger)` implementation of random forest models and wouldn't make sense as main arguments. Engine-specific arguments can be specified in `set_engine()`. For example, to have the `ranger::ranger()` function print out more information about the fit:
```{r models-ranger-verb}
rand_forest(trees = 1000, min_n = 5) %>%
set_engine("ranger", verbose = TRUE) %>%
set_mode("regression")
```
## Use the Model Results
Once the model is created and fit, we can use the results in a variety of ways; we might want to plot, print, or otherwise examine the model output. Several quantities are stored in a `r pkg(parsnip)` model object, including the fitted model. This can be found in an element called `fit`, which can be returned using the `extract_fit_engine()` function:
```{r models-pluck}
lm_form_fit %>% extract_fit_engine()
```
Normal methods can be applied to this object, such as printing and plotting:
```{r models-pluck-coef}
lm_form_fit %>% extract_fit_engine() %>% vcov()
```
:::rmdwarning
Never pass the `fit` element of a `r pkg(parsnip)` model to a model prediction function, i.e., use `predict(lm_form_fit)` but *do not* use `predict(lm_form_fit$fit)`. If the data were preprocessed in any way, incorrect predictions will be generated (sometimes, without errors). The underlying model's prediction function has no idea if any transformations have been made to the data prior to running the model. See Section \@ref(parsnip-predictions) for more on making predictions.
:::
One issue with some existing methods in base R is that the results are stored in a manner that may not be the most useful. For example, the `summary()` method for `lm` objects can be used to print the results of the model fit, including a table with parameter values, their uncertainty estimates, and p-values. These particular results can also be saved:
```{r models-lm-param}
model_res <-
lm_form_fit %>%
extract_fit_engine() %>%
summary()
# The model coefficient table is accessible via the `coef` method.
param_est <- coef(model_res)
class(param_est)
param_est
```
There are a few things to notice about this result. First, the object is a numeric matrix. This data structure was mostly likely chosen since all of the calculated results are numeric and a matrix object is stored more efficiently than a data frame. This choice was probably made in the late 1970s when computational efficiency was extremely critical. Second, the non-numeric data (the labels for the coefficients) are contained in the row names. Keeping the parameter labels as row names is very consistent with the conventions in the original S language.
A reasonable next step might be to create a visualization of the parameter values. To do this, it would be sensible to convert the parameter matrix to a data frame. We could add the row names as a column so that they can be used in a plot. However, notice that several of the existing matrix column names would not be valid R column names for ordinary data frames (e.g., `"Pr(>|t|)"`). Another complication is the consistency of the column names. For `lm` objects, the column for the p-value is `"Pr(>|t|)"`, but for other models, a different test might be used and, as a result, the column name would be different (e.g., `"Pr(>|z|)"`) and the type of test would be encoded in the column name.
While these additional data formatting steps are not impossible to overcome, they are a hindrance, especially since they might be different for different types of models. The matrix is not a highly reusable data structure mostly because it constrains the data to be of a single type (e.g., numeric). Additionally, keeping some data in the dimension names is also problematic since those data must be extracted to be of general use.
As a solution, the `r pkg(broom)` package can convert many types of model objects to a tidy structure. For example, using the `tidy()` method on the linear model produces:
```{r models-tidy-lm}
tidy(lm_form_fit)
```
The column names are standardized across models and do not contain any additional data (such as the type of statistical test). The data previously contained in the row names are now in a column called `term`. One important principle in the tidymodels ecosystem is that a function should return values that are _predictable, consistent,_ and _unsurprising_.
## Make Predictions {#parsnip-predictions}
Another area where `r pkg(parsnip)` diverges from conventional R modeling functions is the format of values returned from `predict()`. For predictions, `r pkg(parsnip)` always conforms to the following rules:
1. The results are always a tibble.
2. The column names of the tibble are always predictable.
3. There are always as many rows in the tibble as there are in the input data set.
For example, when numeric data are predicted:
```{r models-small-pred}
ames_test_small <- ames_test %>% slice(1:5)
predict(lm_form_fit, new_data = ames_test_small)
```
The row order of the predictions are always the same as the original data.
:::rmdnote
Why the leading dot in some of the column names? Some tidyverse and tidymodels arguments and return values contain periods. This is to protect against merging data with duplicate names. There are some data sets that contain predictors named `pred`!
:::
These three rules make it easier to merge predictions with the original data:
```{r models-small-int}
ames_test_small %>%
select(Sale_Price) %>%
bind_cols(predict(lm_form_fit, ames_test_small)) %>%
# Add 95% prediction intervals to the results:
bind_cols(predict(lm_form_fit, ames_test_small, type = "pred_int"))
```
The motivation for the first rule comes from some R packages producing dissimilar data types from prediction functions. For example, the `r pkg(ranger)` package is an excellent tool for computing random forest models. However, instead of returning a data frame or vector as output, it returns a specialized object that has multiple values embedded within it (including the predicted values). This is just one more step for the data analyst to work around in their scripts. As another example, the native `r pkg(glmnet)` model can return at least four different output types for predictions, depending on the model specifics and characteristics of the data. These are shown in Table \@ref(tab:predict-types).
```{r model-pred-types, echo = FALSE, results = "asis"}
tribble(
~ `Type of Prediction`, ~ `Returns a:`,
"numeric", "numeric matrix",
"class", "character matrix",
"probability (2 classes)", "numeric matrix (2nd level only)",
"probability (3+ classes)", "3D numeric array (all levels)",
) %>%
kable(
caption = "Different return values for glmnet prediction types.",
label = "predict-types"
) %>%
kable_styling(full_width = FALSE)
```
Additionally, the column names of the results contain coded values that map to a vector called `lambda` within the glmnet model object. This excellent statistical method can be discouraging to use in practice because of all of the special cases an analyst might encounter that require additional code to be useful.
For the second tidymodels prediction rule, the predictable column names for different types of predictions are shown in Table \@ref(tab:predictable-column-names).
```{r model-pred-info, echo = FALSE, results = "asis"}
tribble(
~ `type value`, ~ `column name(s)`,
"numeric", ".pred",
"class", ".pred_class",
"prob", ".pred_{class levels}",
"conf_int", ".pred_lower, .pred_upper",
"pred_int", ".pred_lower, .pred_upper"
) %>%
kable(
caption = "The tidymodels mapping of prediction types and column names.",
label = "predictable-column-names",
) %>%
kable_styling(full_width = FALSE) %>%
column_spec(1:2, monospace = TRUE)
```
The third rule regarding the number of rows in the output is critical. For example, if any rows of the new data contain missing values, the output will be padded with missing results for those rows.
A main advantage of standardizing the model interface and prediction types in `r pkg(parsnip)` is that, when different models are used, the syntax is identical. Suppose that we used a decision tree to model the Ames data. Outside of the model specification, there are no significant differences in the code pipeline:
```{r models-cart}
tree_model <-
decision_tree(min_n = 2) %>%
set_engine("rpart") %>%
set_mode("regression")
tree_fit <-
tree_model %>%
fit(Sale_Price ~ Longitude + Latitude, data = ames_train)
ames_test_small %>%
select(Sale_Price) %>%
bind_cols(predict(tree_fit, ames_test_small))
```
This demonstrates the benefit of homogenizing the data analysis process and syntax across different models. It enables users to spend their time on the results and interpretation rather than having to focus on the syntactical differences between R packages.
## parsnip-Extension Packages
The `r pkg(parsnip)` package itself contains interfaces to a number of models. However, for ease of package installation and maintenance, there are other tidymodels packages that have `r pkg(parsnip)` model definitions for other sets of models. The `r pkg(discrim)` package has model definitions for the set of classification techniques called discriminant analysis methods (such as linear or quadratic discriminant analysis). In this way, the package dependencies required for installing `r pkg(parsnip)` are reduced. A list of all of the models that can be used with `r pkg(parsnip)` (across different packages that are on CRAN) can be found at <https://www.tidymodels.org/find/>.
## Creating Model Specifications {#parsnip-addin}
It may become tedious to write many model specifications, or to remember how to write the code to generate them. The `r pkg(parsnip)` package includes an RStudio addin^[<https://rstudio.github.io/rstudioaddins/>] that can help. Either choosing this addin from the _Addins_ toolbar menu or running the code:
```{r models-add-in, eval = FALSE}
parsnip_addin()
```
will open a window in the Viewer panel of the RStudio IDE with a list of possible models for each model mode. These can be written to the source code panel.
The model list includes models from `r pkg(parsnip)` and `r pkg(parsnip)`-extension packages that are on CRAN.
## Chapter Summary {#models-summary}
This chapter introduced the `r pkg(parsnip)` package, which provides a common interface for models across R packages using a standard syntax. The interface and resulting objects have a predictable structure.
The code for modeling the Ames data that we will use moving forward is:
```{r models-summary, eval = FALSE}
library(tidymodels)
data(ames)
ames <- mutate(ames, Sale_Price = log10(Sale_Price))
set.seed(502)
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
lm_model <- linear_reg() %>% set_engine("lm")
```