-
Notifications
You must be signed in to change notification settings - Fork 176
/
Copy pathexplore-categorical.qmd
904 lines (785 loc) · 42.4 KB
/
explore-categorical.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
# Exploring categorical data {#sec-explore-categorical}
```{r}
#| include: false
source("_common.R")
```
::: {.chapterintro data-latex=""}
This chapter focuses on exploring **categorical** data using summary statistics and visualizations.
The summaries and graphs presented in this chapter are created using statistical software; however, since this might be your first exposure to the concepts, we take our time in this chapter to detail how to create them.
Where possible, we present multivariate plots; plots that visualize the relationship between multiple variables.
Mastery of the content presented in this chapter will be crucial for understanding the methods and techniques introduced in the rest of the book.
:::
In this chapter we will work with data on loans from Lending Club that you've previously seen in @sec-data-hello.
The `loan50` dataset from @sec-data-hello represents a sample from a larger loan dataset called `loans`.
This larger dataset contains information on 10,000 loans made through Lending Club.
We will examine the relationship between `homeownership`, which for the `loans` data can take a value of `rent`, `mortgage` (owns but has a mortgage), or `own`, and `application_type`, which indicates whether the loan application was made with a partner or whether it was an individual application.
::: {.data data-latex=""}
The [`loans_full_schema`](http://openintrostat.github.io/openintro/reference/loans_full_schema.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
Based on the data in this dataset we have modified the `homeownership` and `application_type` variables.
We will refer to this modified dataset as `loans`.
:::
## Contingency tables and bar plots
```{r}
loans <- loans_full_schema |>
mutate(application_type = as.character(application_type)) |>
filter(application_type != "") |>
mutate(
homeownership = tolower(homeownership),
homeownership = fct_relevel(homeownership, "rent", "mortgage", "own"),
application_type = fct_relevel(application_type, "joint", "individual")
)
loans_individual_rent <- loans |>
filter(
application_type == "individual",
homeownership == "rent"
) |>
nrow()
```
@tbl-loan-home-app-type-totals summarizes two variables: `application_type` and `homeownership`.
Note that loans from Lending Club are typically for small items or for cash, not for homes.
The individuals in the dataset are taking out loans for their personal use, and we categorize them based on their `homeownership` status (which is unrelated to the purpose of the loan).
A table that summarizes data for two categorical variables in this way is called a **contingency table**\index{contingency table}.
Each value in the table represents the number of times a particular combination of variable outcomes occurred.
For example, the value `r loans_individual_rent` corresponds to the number of loans in the dataset where the borrower rents their home and the application type was by an individual.
Row and column totals are also included.
The **row totals**\index{row totals} provide the total counts across each row and the **column totals**\index{column totals} down each column.
We can also create a table that shows only the overall percentages or proportions for each combination of categories, or we can create a table for a single variable, such as the one shown in @tbl-loan-homeownership-totals for the `homeownership` variable.
```{r}
#| include: false
terms_chp_04 <- c("contingency table", "row totals", "column totals")
```
```{r}
#| label: tbl-loan-home-app-type-totals
#| tbl-cap: A contingency table for application type and homeownership.
#| tbl-pos: H
loans |>
count(application_type, homeownership) |>
pivot_wider(names_from = homeownership, values_from = n) |>
select(application_type, rent, mortgage, own) |>
adorn_totals(where = c("row", "col")) |>
kbl(linesep = "", booktabs = TRUE) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")
) |>
add_header_above(c(" " = 1, "homeownership" = 3, " " = 1)) |>
column_spec(1, width = "8em") |>
column_spec(2:5, width = "5em")
```
```{r}
#| label: tbl-loan-homeownership-totals
#| tbl-cap: |
#| A table summarizing the frequencies for each value of the homeownership
#| variable -- mortgage, own, and rent.
#| tbl-pos: H
loans |>
count(homeownership, name = "Count") |>
adorn_totals(where = "row") |>
kbl(linesep = "", booktabs = TRUE) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped"), full_width = FALSE
) |>
column_spec(1:2, width = "10em")
```
A bar plot is a common way to display a single categorical variable.
@fig-loan-homeownership-bar-plot-1 displays a **bar plot** of the `homeownership` variable.
In @fig-loan-homeownership-bar-plot-2 the counts are converted into proportions, showing the proportion of observations that are in each level.
```{r}
#| label: fig-loan-homeownership-bar-plot
#| fig-cap: Distribution of homeownership.
#| fig-subcap:
#| - Counts of homeownership.
#| - Proportions of homeownership.
#| fig-alt: |
#| Counts and proportions of values of the homeownership variable. The
#| highest proportion of borrowers have a home mortgage. The next highest
#| group rents. The smallest group of people own their home
#| outright.
#| fig-width: 4
#| layout-ncol: 2
ggplot(loans, aes(x = homeownership)) +
geom_bar(fill = IMSCOL["green", "full"]) +
labs(x = "Homeownership", y = "Count")
loans |>
count(homeownership) |>
mutate(proportion = n / sum(n)) |>
ggplot(aes(x = homeownership, y = proportion)) +
geom_col(fill = IMSCOL["green", "full"]) +
labs(x = "Homeownership", y = "Proportion")
```
## Visualizing two categorical variables
### Bar plots with two variables
We can display the distributions of two categorical variables on a bar plot concurrently.
Such plots are generally useful for visualizing the relationship between two categorical variables.
@fig-loan-homeownership-app-type-bar-plot shows three such plots that visualize @fig-loan-homeownership-app-type-bar-plot-1 is a **stacked bar plot**\index{plot!stacked bar}\index{stacked bar plot}.
This plot most clearly displays that loan applicants most commonly live in mortgaged homes.
It is difficult to say, based on this plot alone, how different application types vary across the levels of homeownership.
@fig-loan-homeownership-app-type-bar-plot-2 is a **standardized bar plot**\index{plot!standardized bar}\index{standardized bar plot} (also known as **filled bar plot**\index{plot!filled bar}\index{filled bar plot}).
This type of visualization is helpful in understanding the fraction of individual or joint loan applications for borrowers in each level of `homeownership`.
Additionally, since the proportions of joint and individual loans vary across the groups, we can conclude that the two variables are associated for this sample.
Finally, @fig-loan-homeownership-app-type-bar-plot-3 is a **dodged bar plot**\index{plot!dodged bar}\index{dodged bar plot}.
This plot most clearly displays that within each level of homeownership, individual applications are more common than joint applications.
This plot most clearly displays that joint applications are most common among loans for applicants who live in mortgaged homes, compared to renters and owners.
```{r}
#| include: false
terms_chp_04 <- c(terms_chp_04, "stacked bar plot", "dodged bar plot", "filled bar plot", "standardized bar plot")
```
```{r}
#| label: fig-loan-homeownership-app-type-bar-plot
#| fig-cap: |
#| Three bar plots displaying homeownership and application type variables.
#| fig-subcap:
#| - Stacked bar plot
#| - Standardized bar plot
#| - Dodged bar plot
#| fig-alt: |
#| Three bar plots (stacked, dodged, and standardized) displaying homeownership
#| and application type variables. There are three or four times as many
#| individual applications as joint applications. The highest proportion of
#| borrowers has a home mortgage. The next highest group rents. The smallest
#| group of people own their home outright.
#| fig-width: 3.5
#| layout: [[50, 50], [-22, 56, -22]]
ggplot(loans, aes(x = homeownership, fill = application_type)) +
geom_bar(show.legend = FALSE) +
scale_fill_manual(values = c(IMSCOL["blue", "full"], IMSCOL["yellow", "full"])) +
labs(x = "Homeownership", y = "Count")
ggplot(loans, aes(x = homeownership, fill = application_type)) +
geom_bar(position = "fill", show.legend = FALSE) +
scale_fill_manual(values = c(IMSCOL["blue", "full"], IMSCOL["yellow", "full"])) +
labs(x = "Homeownership", y = "Proportion")
ggplot(loans, aes(x = homeownership, fill = application_type)) +
geom_bar(position = "dodge") +
scale_fill_manual(values = c(IMSCOL["blue", "full"], IMSCOL["yellow", "full"])) +
labs(x = "Homeownership", y = "Count", fill = "Application type") +
theme(legend.position = "bottom")
```
::: {.workedexample data-latex=""}
Examine the three bar plots in @fig-loan-homeownership-app-type-bar-plot.
When is the stacked, dodged, or standardized bar plot the most useful?
------------------------------------------------------------------------
The stacked bar plot is most useful when it's reasonable to assign one variable as the explanatory variable (here `homeownership`) and the other variable as the response (here `application_type`) since we are effectively grouping by one variable first and then breaking it down by the others.
Dodged bar plots are more agnostic in their display about which variable, if any, represents the explanatory and which the response variable.
It is also easy to discern the number of cases in each of the six different group combinations.
However, one downside is that it tends to require more horizontal space; the narrowness of Plot B compared to the other two in @fig-loan-homeownership-app-type-bar-plot makes the plot feel a bit cramped.
Additionally, when two groups are of very different sizes, as we see in the group `own` relative to either of the other two groups, it is difficult to discern if there is an association between the variables.
The standardized stacked bar plot is helpful if the primary variable in the stacked bar plot is relatively imbalanced, e.g., the category has only a third of the observations in the category, making the simple stacked bar plot less useful for checking for an association.
The major downside of the standardized version is that we lose all sense of how many cases each of the bars represents.
:::
\vspace{-5mm}
### Mosaic plots
A **mosaic plot**\index{plot!mosaic}\index{mosaic plot} is a visualization technique suitable for contingency tables that resembles a standardized stacked bar plot with the benefit that we still see the relative group sizes of the primary variable as well.
```{r}
#| include: false
terms_chp_04 <- c(terms_chp_04, "mosaic plot")
```
To get started in creating our first mosaic plot, we'll break a square into columns for each category of the variable, with the result shown in @fig-loan-homeownership-type-mosaic-plot-1.
Each column represents a level of `homeownership`, and the column widths correspond to the proportion of loans in each of those categories.
For instance, there are fewer loans where the borrower is an owner than where the borrower has a mortgage.
In general, mosaic plots use box *areas* to represent the number of cases in each category.
@fig-loan-homeownership-type-mosaic-plot-2 displays the relationship between homeownership and application type.
Each column is split proportionally to the number of loans from individual and joint borrowers.
For example, the second column represents loans where the borrower has a mortgage, and it was divided into individual loans (upper) and joint loans (lower).
As another example, the bottom segment of the third column represents loans where the borrower owns their home and applied jointly, while the upper segment of this column represents borrowers who are homeowners and filed individually.
We can again use this plot to see that the `homeownership` and `application_type` variables are associated, since some columns are divided in different vertical locations than others, which was the same technique used for checking an association in the standardized stacked bar plot.
```{r}
#| label: fig-loan-homeownership-type-mosaic-plot
#| fig-cap: |
#| Two mosaic plots, one for homeownership alone and the other displaying the
#| relationship between homeownership and application type.
#| fig-subcap:
#| - Homeownership.
#| - Homeownership vs. application type.
#| fig-alt: |
#| Two mosaic plots, one for homeownership alone and the other displaying the
#| relationship between homeownership and application type. Again, the majority
#| of borrowers are individuals, as compared with joint applications. The
#| highest proportion of borrowers have a mortgage; the next highest proportion
#| rent their home; and the smallest group owns their home outright.
#| layout-ncol: 2
#| fig-width: 3.5
#| fig-asp: 0.8
ggplot(loans) +
geom_mosaic(aes(x = product(homeownership)), fill = IMSCOL["green", "full"]) +
labs(x = "Homeownership", y = "") +
theme(
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()
)
ggplot(loans) +
geom_mosaic(aes(x = product(homeownership), fill = application_type)) +
scale_fill_manual(values = c(IMSCOL["blue", "full"], IMSCOL["yellow", "full"])) +
labs(x = "Homeownership", y = "Application type") +
guides(fill = FALSE)
```
In @fig-loan-homeownership-type-mosaic-plot, we chose to first split by the homeowner status of the borrower.
However, we could have instead first split by the application type, as in @fig-loan-app-type-mosaic-plot.
Like with the bar plots, it's common to use the explanatory variable to represent the first split in a mosaic plot, and then for the response to break up each level of the explanatory variable if these labels are reasonable to attach to the variables under consideration.
```{r}
#| label: fig-loan-app-type-mosaic-plot
#| fig-cap: |
#| Mosaic plot where loans are grouped by homeownership after they have been
#| divided into individual and joint application types.
#| fig-alt: |
#| Mosaic plot where loans are grouped by homeownership after they have been
#| divided into individual and joint application types. Again, the majority of
#| borrowers are individuals, as compared with joint applications. The highest
#| proportion of borrowers have a mortgage; the next highest proportion rent
#| their home; and the smallest group owns their home outright.
#| fig-width: 6
#| fig-asp: 0.5
ggplot(loans) +
geom_mosaic(aes(x = product(application_type), fill = homeownership)) +
scale_fill_openintro("hot") +
labs(x = "Application type", y = "Homeownership") +
guides(fill = FALSE)
```
\clearpage
## Row and column proportions
In the previous sections we inspected visualizations of two categorical variables in bar plots and mosaic plots.
However, we have not discussed how the values in the bar and mosaic plots that show proportions are calculated.
In this section we will investigate fractional breakdown of one variable in another and we can modify our contingency table to provide such a view.
@tbl-loan-home-app-type-row-proportions shows **row proportions**\index{row proportions} for @tbl-loan-home-app-type-totals, which are computed as the counts divided by their row totals.
The value 3496 at the intersection of individual and rent is replaced by $3496 / 8505 = 0.411,$ i.e., 3496 divided by its row total, 8505.
So, what does 0.411 represent?
It corresponds to the proportion of individual applicants who rent.
```{r}
#| label: tbl-loan-home-app-type-row-proportions
#| tbl-cap: |
#| A contingency table with row proportions for application type and
#| homeownership.
#| tbl-pos: H
loans |>
count(application_type, homeownership) |>
group_by(application_type) |>
mutate(proportion = n / sum(n)) |>
select(-n) |>
pivot_wider(names_from = homeownership, values_from = proportion) |>
adorn_totals(where = "col") |>
kbl(linesep = "", booktabs = TRUE) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")
) |>
add_header_above(c(" " = 1, "homeownership" = 3, " " = 1)) |>
column_spec(1, width = "8em") |>
column_spec(2:5, width = "5em")
```
A contingency table of the **column proportions**\index{column proportions} is computed in a similar way, where each is computed as the count divided by the corresponding column total.
@tbl-loan-home-app-type-column-proportions shows such a table, and here the value 0.906 indicates that 90.6% of renters applied as individuals for the loan.
This rate is higher compared to loans from people with mortgages (80.2%) or who own their home (86.5%).
Because these rates vary between the three levels of `homeownership` (`rent`, `mortgage`, `own`), this provides evidence that `app_type` and `homeownership` variables may be associated.
```{r}
#| label: tbl-loan-home-app-type-column-proportions
#| tbl-cap: |
#| A contingency table with column proportions for application type
#| and homeownership.
#| tbl-pos: H
loans |>
count(application_type, homeownership) |>
group_by(homeownership) |>
mutate(proportion = n / sum(n)) |>
select(-n) |>
pivot_wider(names_from = homeownership, values_from = proportion) |>
adorn_totals(where = "row") |>
kbl(linesep = "", booktabs = TRUE) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")
) |>
add_header_above(c(" " = 1, "homeownership" = 3)) |>
column_spec(1, width = "8em") |>
column_spec(2:4, width = "5em")
```
Row and column proportions can also be thought of as **conditional proportions**\index{conditional proportions} as they tell us about the proportion of observations in a given level of a categorical variable conditional on the level of another categorical variable.
```{r}
#| include: false
terms_chp_04 <- c(terms_chp_04, "row proportions", "column proportions", "conditional proportions")
```
We could also have checked for an association between `application_type` and `homeownership` in @tbl-loan-home-app-type-row-proportions using row proportions.
When comparing these row proportions, we would look down columns to see if the fraction of loans where the borrower rents, has a mortgage, or owns varied across the application types.
::: {.guidedpractice data-latex=""}
What does 0.451 represent in @tbl-loan-home-app-type-row-proportions?
What does 0.802 represent in @tbl-loan-home-app-type-column-proportions?[^04-explore-categorical-1]
:::
[^04-explore-categorical-1]: 0.451 represents the proportion of individual applicants who have a mortgage.
0.802 represents the fraction of applicants with mortgages who applied as individuals.
::: {.guidedpractice data-latex=""}
What does 0.122 represent in @tbl-loan-home-app-type-row-proportions?
What does 0.135 represent in @tbl-loan-home-app-type-column-proportions?[^04-explore-categorical-2]
:::
[^04-explore-categorical-2]: 0.122 represents the fraction of joint borrowers who own their home.
0.135 represents the home-owning borrowers who had a joint application for the loan.
::: {.workedexample data-latex=""}
Data scientists use statistics to build email spam filters.
By noting specific characteristics of an email, a data scientist may be able to classify some emails as spam or not spam with high accuracy.
One such characteristic is the email format, which indicates whether an email has any HTML content, such as bolded text.
We'll focus on email format and spam status using the dataset; these variables are summarized in a contingency table in @tbl-email-count-table.
Which would be more helpful to someone hoping to classify email as spam or regular email for this table: row or column proportions?
------------------------------------------------------------------------
A data scientist would be interested in how the proportion of spam changes within each email format.
This corresponds to column proportions: the proportion of spam in plain text emails and the proportion of spam in HTML emails.
If we generate the column proportions, we can see that a higher fraction of plain text emails are spam ($209/1195 = 17.5\%$) than compared to HTML emails ($158/2726 = 5.8\%$).
This information on its own is insufficient to classify an email as spam or not spam, as over 80% of plain text emails are not spam.
Yet, when we carefully combine this information with many other characteristics, we stand a reasonable chance of being able to classify some emails as spam or not spam with confidence.
This example points out that row and column proportions are not equivalent.
Before settling on one form for a table, it is important to consider each to ensure that the most useful table is constructed.
However, sometimes it simply isn't clear which, if either, is more useful.
:::
::: {.data data-latex=""}
The [email](http://openintrostat.github.io/openintro/reference/email.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
:::
```{r}
#| label: tbl-email-count-table
#| tbl-cap: A contingency table for spam and format.
#| tbl-pos: H
email |>
mutate(
format = if_else(format == 0, "text", "HTML"),
spam = if_else(spam == 0, "not spam", "spam"),
) |>
count(spam, format) |>
pivot_wider(names_from = format, values_from = n) |>
adorn_totals(where = c("row", "col")) |>
kbl(linesep = "", booktabs = TRUE) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped"), full_width = FALSE
) |>
column_spec(1, width = "8em") |>
column_spec(2:4, width = "5em")
```
::: {.workedexample data-latex=""}
Look back to @tbl-loan-home-app-type-row-proportions and @tbl-loan-home-app-type-column-proportions.
Are there any obvious scenarios where one might be more useful than the other?
------------------------------------------------------------------------
None that we think are obvious!
What is distinct about the email example is that the two loan variables do not have a clear explanatory-response variable relationship that we might hypothesize.
Usually it is most useful to "condition" on the explanatory variable.
For instance, in the email example, the email format was seen as a possible explanatory variable of whether the message was spam, so we would find it more interesting to compute the relative frequencies (proportions) for each email format.
:::
## Pie charts
A **pie chart** is shown in @fig-loan-homeownership-pie-chart-1 alongside a bar plot representing the same information in @fig-loan-homeownership-pie-chart-2.
Pie charts can be useful for giving a high-level overview to show how a set of cases break down.
However, it is also difficult to decipher certain details in a pie chart.
For example, it's not immediately obvious that there are more loans where the borrower has a mortgage than rent when looking at the pie chart, while this detail is very obvious in the bar plot.
```{r}
#| label: fig-loan-homeownership-pie-chart
#| fig-cap: A pie chart and bar plot of homeownership.
#| fig-subcap:
#| - Pie chart
#| - Bar plot
#| fig-alt: |
#| A pie chart and bar plot of homeownership. Both plots show that about half
#| of the individuals taking out a loan have a mortgage. A slightly smaller
#| group of individuals rents. The smallest group of borrowers owns their home.
#| layout: [[46, -6, 46]]
#| fig-width: 5
#| out-width: 100%
loans |>
mutate(homeownership = fct_infreq(homeownership)) |>
count(homeownership) |>
mutate(text_y = cumsum(n) - n / 2) |>
ggplot(aes(x = "", fill = homeownership, y = n)) +
geom_col(position = position_stack(reverse = TRUE), show.legend = FALSE) +
geom_text_repel(aes(x = 1, label = homeownership, y = text_y)) +
coord_polar("y", start = 0) +
scale_fill_openintro("hot") +
theme_void() +
labs(title = "Homeownership")
loans |>
mutate(homeownership = fct_infreq(homeownership)) |>
ggplot(aes(x = homeownership, fill = homeownership)) +
geom_bar(show.legend = FALSE) +
scale_fill_openintro("hot") +
labs(x = "Homeownership", y = "Count")
```
\vspace{-5mm}
Pie charts can work well when the goal is to visualize a categorical variable with very few levels, and especially if each level represents a simple fraction (e.g., one-half, one-quarter, etc.).
However, they can be quite difficult to read when they are used to visualize a categorical variable with many levels.
For example, the pie chart @fig-loan-grade-pie-chart-1 and the @fig-loan-grade-pie-chart-2 both represent the distribution of loan grades (A through G).
In this case, it is far easier to compare the counts of each loan grade using the bar plot than the pie chart.
\vspace{-5mm}
```{r}
#| label: fig-loan-grade-pie-chart
#| fig-cap: A pie chart and bar plot of loan grades.
#| fig-subcap:
#| - Pie chart
#| - Bar plot
#| fig-alt: |
#| A pie chart and a bar plot of loan grades. Both plots shows that
#| the most frequent grades are A, B, and C. The bar plot makes it easier to
#| count the number of loans in each grade.
#| layout: [[46, -6, 46]]
#| fig-width: 5
#| out-width: 100%
loans |>
count(grade) |>
mutate(text_y = cumsum(n) - n / 2) |>
ggplot(aes(x = "", fill = grade, y = n)) +
geom_col(position = position_stack(reverse = TRUE), show.legend = FALSE) +
geom_text_repel(
aes(x = 1.4, label = grade, y = text_y), nudge_x = 0.3, segment.size = 0.5
) +
coord_polar(theta = "y") +
scale_fill_openintro("cool") +
theme_void() +
labs(title = "Loan grade")
loans |>
ggplot(aes(x = grade, fill = grade)) +
geom_bar(show.legend = FALSE) +
scale_fill_openintro("cool") +
labs(x = "Loan grade", y = "Count")
```
\vspace{-5mm}
## Waffle charts
Another useful technique of visualizing categorical data is a **waffle chart**.
Waffle charts can be used to communicate the proportion of the data that falls into each level of a categorical variable.
Just like with pie charts, they work best when the number of levels represented is low.
However, unlike pie charts, they can make it easier to compare proportions that represent non-simple fractions.
@fig-loan-waffle-1 is a waffle chart of homeownership and @fig-loan-waffle-2 is a waffle chart of loan status.
```{r}
#| label: fig-loan-waffle
#| fig-cap: Waffle charts of homeownership and loan status.
#| fig-subcap:
#| - "Homeownership: rent, mortgage, and own"
#| - "Loan status: fully paid, in grace period, and late"
#| fig-alt: |
#| Waffle chart of homeownership, with levels rent, mortgage, and own, and
#| waffle chart of loan status, with levels current, fully paid, in grace
#| period, and late. The waffle charts are broken down into a 10 by 10 grid
#| where each square represents 1 percent of the data. The squares are
#| colored proportionally to the variable distributions.
#| layout: [[46, -6, 46]]
#| fig-width: 5
#| out-width: 100%
loans |>
count(homeownership) |>
ggplot(aes(fill = homeownership, values = n)) +
geom_waffle(
color = "white", flip = TRUE, make_proportional = TRUE, na.rm = TRUE
) +
labs(fill = NULL, title = "Homeownership") +
scale_fill_openintro("hot") +
coord_equal() +
theme_enhance_waffle() +
theme(
legend.position = "bottom",
legend.text = element_text(size = 13)
)
loans |>
count(loan_status) |>
ggplot(aes(fill = loan_status, values = n)) +
geom_waffle(
color = "white", flip = TRUE, make_proportional = TRUE, na.rm = TRUE
) +
labs(fill = NULL, title = "Loan status") +
scale_fill_openintro("four") +
coord_equal() +
theme_enhance_waffle() +
theme(
legend.position = "bottom",
legend.text = element_text(size = 13)
) +
guides(fill = guide_legend(nrow = 2))
```
## Comparing numerical data across groups
Some of the more interesting investigations can be considered by examining numerical data across groups.
In this section we will expand on a few methods we have already seen to make plots for numerical data from multiple groups on the same graph as well as introduce a few new methods for comparing numerical data across groups.
We will revisit the `county` dataset and compare the median household income for counties that gained population from 2010 to 2017 versus counties that had no gain.
While we might like to make a causal connection between income and population growth, remember that these are observational data and so such an interpretation would be, at best, half-baked.
```{r}
n_county <- nrow(county)
n_missing_pop2017 <- county |>
filter(is.na(pop2017)) |>
nrow()
n_with_pop_change <- n_county - n_missing_pop2017
county <- county |>
mutate(
pop_change_3levels = case_when(
pop_change < 0 ~ "loss",
pop_change == 0 ~ "no change",
pop_change > 0 ~ "gain"
),
pop_change_2levels = if_else(pop_change_3levels == "gain", "gain", "no gain")
)
n_pop_no_gain <- county |>
filter(pop_change_2levels == "no gain") |>
nrow()
n_pop_loss <- county |>
filter(pop_change_3levels == "loss") |>
nrow()
n_pop_no_change <- county |>
filter(pop_change_3levels == "no change") |>
nrow()
n_pop_gain <- county |>
filter(pop_change_3levels == "gain") |>
nrow()
```
We have data on `r n_county` counties in the United States.
We are missing 2017 population data from `r n_missing_pop2017` of them, and of the remaining `r n_with_pop_change` counties, in `r n_pop_gain` the population increased from 2010 to 2017 and in the remaining `r n_pop_no_gain` the population decreased.
@tbl-countyIncomeSplitByPopGainTable shows a sample of four observations from each group.
```{r}
#| label: tbl-countyIncomeSplitByPopGainTable
#| tbl-cap: |
#| The median household income from a random sample of four counties with
#| population gain between 2010 to 2017 and another random sample of four counties
#| with no population gain.
#| tbl-pos: H
county |>
select(state, name, pop_change, pop_change_2levels, median_hh_income) |>
filter(!is.na(pop_change)) |>
group_by(pop_change_2levels) |>
slice_sample(n = 4) |>
arrange(pop_change_2levels, state, name) |>
rename(
State = state,
County = name,
`Population change (%)` = pop_change,
`Gain / No gain` = pop_change_2levels,
`Median household income` = median_hh_income
) |>
kbl(linesep = "", booktabs = TRUE, align = "llccc") |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped"), full_width = FALSE
) |>
column_spec(3, width = "8em") |>
column_spec(4, width = "4em") |>
column_spec(5, width = "8em")
```
Color can be used to split histograms (see @sec-histograms for an introduction to histograms) for numerical variables by levels of a categorical variable.
An example of this is shown in @fig-countyIncomeSplitByPopGain-1.
The **side-by-side box plot**\index{plot!side-by-side box}\index{side-by-side box plot} is another traditional tool for comparing across groups.
An example is shown in @fig-countyIncomeSplitByPopGain-2, where there are two box plots (see @sec-boxplots for an introduction to box plots), one for each group, placed into one plotting window and drawn on the same scale.
```{r}
#| include: false
terms_chp_04 <- c(terms_chp_04, "side-by-side box plot")
```
```{r}
#| label: fig-countyIncomeSplitByPopGain
#| fig-cap: |
#| Visualizations of median household income of counties by change in
#| population (gain or loss).
#| fig-subcap:
#| - Histograms
#| - Side by-side box plots
#| fig-alt: |
#| Histograms and side by-side box plots of median household income, where
#| counties are split by whether there was a population gain or not.
#| In both plots, the counties who have had a population gain have a household
#| income distribution with a higher center. Additionally, the histogram
#| (but not the boxp lot) shows that there are more counties who have had
#| a population gain than who have not had a population gain.
#| fig-asp: 0.23
#| out-width: 90%
county |>
filter(!is.na(pop_change)) |>
ggplot(aes(x = median_hh_income, fill = pop_change_2levels)) +
geom_histogram(binwidth = 5000, alpha = 0.5) +
scale_fill_openintro("two") +
scale_x_continuous(labels = label_dollar(scale = 1/1000, big.mark = "K")) +
labs(x = "Median household income", y = NULL, fill = "Change in\npopulation")
county |>
filter(!is.na(pop_change)) |>
ggplot(aes(x = median_hh_income, y = pop_change_2levels, color = pop_change_2levels)) +
geom_boxplot() +
scale_color_openintro("two") +
scale_x_continuous(labels = label_dollar(scale = 1/1000, big.mark = "K")) +
labs(x = "Median household income", y = NULL, color = "Change in\npopulation")
```
::: {.guidedpractice data-latex=""}
Use the plots in @fig-countyIncomeSplitByPopGain to compare the incomes for counties across the two groups.
What do you notice about the approximate center of each group?
What do you notice about the variability between groups?
Is the shape relatively consistent between groups?
How many *prominent* modes are there for each group?[^04-explore-categorical-3]
:::
[^04-explore-categorical-3]: Answers may vary a little.
The counties with population gains tend to have higher income (median of about \$45,000) versus counties without a gain (median of about \$40,000).
The variability is also slightly larger for the population gain group.
This is evident in the IQR, which is about 50% bigger in the *gain* group.
Both distributions show slight to moderate right skew and are unimodal.
The box plots indicate there are many observations far above the median in each group, though we should anticipate that many observations will fall beyond the whiskers when examining any dataset that contain more than a few hundred data points.
\vspace{-5mm}
::: {.guidedpractice data-latex=""}
What components of each plot in @fig-countyIncomeSplitByPopGain do you find most useful?[^04-explore-categorical-4]
:::
[^04-explore-categorical-4]: Answers will vary.
The side-by-side box plots are especially useful for comparing centers and spreads, while the hollow histograms are more useful for seeing distribution shape, skew, modes, and potential anomalies.
Another useful visualization for comparing numerical data across groups is a **ridge plot**\index{plot!ridge}\index{ridge plot}, which combines density plots (see @sec-boxplots for an introduction to density plots) for various groups drawn on the same scale in a single plotting window.
@fig-countyIncomeSplitByPopGainRidge displays a ridge plot for the distribution of median household income in counties, split by whether there was a population gain or not.
```{r}
#| include: false
terms_chp_04 <- c(terms_chp_04, "ridge plot")
```
```{r}
#| label: fig-countyIncomeSplitByPopGainRidge
#| fig-cap: |
#| Ridge plot for median household income, where counties are split by whether
#| there was a population gain or not.
#| fig-alt: |
#| Ridge plot for median household income, where counties are split by whether
#| there was a population gain or not. The figure shows that the counties who
#| have had a population gain have a household income distribution with a
#| higher center.
#| fig-asp: 0.32
#| out-width: 90%
county |>
filter(!is.na(pop_change)) |>
ggplot(
aes(
x = median_hh_income, y = pop_change_2levels,
fill = pop_change_2levels, color = pop_change_2levels
)
) +
geom_density_ridges(alpha = 0.5) +
scale_fill_openintro("two") +
scale_color_openintro("two") +
scale_x_continuous(labels = label_dollar(scale = 1/1000, big.mark = "K")) +
labs(
x = "Median household income",
y = NULL,
fill = "Change in\npopulation",
color = "Change in\npopulation"
)
```
::: {.guidedpractice data-latex=""}
What components of the ridge plot in @fig-countyIncomeSplitByPopGainRidge do you find most useful compared to those in @fig-countyIncomeSplitByPopGain?[^04-explore-categorical-5]
:::
[^04-explore-categorical-5]: The ridge plot give us a better sense of the shape, and especially modality, of the data.
One last visualization technique we'll highlight for comparing numerical data across groups is **faceting**\index{plot!faceted}\index{faceted plot}.
In this technique we split (facet) the graphical display of the data across plotting windows based on groups.
In @fig-countyIncomeSplitByPopGainFacetHist-1 displays the same information as @fig-countyIncomeSplitByPopGain-1, however here the distributions of median household income for counties with and without population gain are faceted across two plotting windows.
We preserve the same scale on the x and y axes for easier comparison.
An advantage of this approach is that it extends to splitting the data across levels of two categorical variables, which allows for displaying relationships between three variables.
In @fig-countyIncomeSplitByPopGainFacetHist-2 we have now split the data into four groups using the `pop_change` and `metro` variables:
- top left represents counties that are *not* in a `metro`politan area with population gain,
- top right represents counties that are in a metropolitan area with population gain,
- bottom left represents counties that are *not* in a metropolitan area without population gain, and finally
- bottom right represents counties that are in a metropolitan area without population gain.
```{r}
#| include: false
terms_chp_04 <- c(terms_chp_04, "faceted plot")
```
::: {#fig-countyIncomeSplitByPopGainFacetHist layout="[[ 30, 70 ]]" layout-valign="bottom"}
```{r}
#| label: fig-countyIncomeSplitByPopGainFacetHist-1
#| fig-cap: By population gain.
#| fig-alt: |
#| Distribution of median income in counties, faceted by whether there was a population gain or not.
#| fig-width: 2.14
#| fig-asp: 1.752
#| out-width: 100%
county |>
filter(!is.na(pop_change) & !is.na(metro)) |>
# for better labeling on plot
rename(
pop_change_num = pop_change,
pop_change = pop_change_2levels
) |>
ggplot(aes(x = median_hh_income, fill = pop_change)) +
geom_histogram(binwidth = 7500) +
scale_fill_openintro("two") +
scale_x_continuous(labels = label_dollar(scale = 1/1000, big.mark = "K")) +
facet_grid(pop_change ~ ., labeller = label_both) +
labs(x = "Median household income", y = NULL) +
guides(fill = FALSE) +
theme(axis.title.x = element_text(size = 10))
```
```{r}
#| label: fig-countyIncomeSplitByPopGainFacetHist-2
#| fig-cap: By both population gain and metropolitan area.
#| fig-alt: |
#| Distribution of median income in counties, faceted by whether there was a
#| population gain and whether the county is in a metropolitan area.
#| Those counties in metropolitan areas have household income distributions
#| which are higher than those which are not in metropolitan araes.
#| fig-width: 5
#| fig-asp: 0.75
#| out-width: 100%
county |>
filter(!is.na(pop_change) & !is.na(metro)) |>
# for better labeling on plot
rename(
pop_change_num = pop_change,
pop_change = pop_change_2levels
) |>
ggplot(aes(x = median_hh_income, fill = pop_change)) +
geom_histogram(binwidth = 5000) +
scale_fill_openintro("two") +
scale_x_continuous(labels = label_dollar(scale = 1/1000, big.mark = "K")) +
facet_grid(pop_change ~ metro, labeller = label_both) +
labs(x = "Median household income", y = NULL) +
guides(fill = FALSE) +
theme(
axis.title.x = element_text(size = 10),
strip.placement = "outside"
)
```
Distribution of median income in counties using faceted histograms.
:::
We can continue building upon this visualization to add one more variable, `median_edu`, which is the median education level in the county.
In @fig-countyIncomeRidgeMulti, we represent median education level using color, where pink (solid line) represents counties where the median education level is high school diploma, yellow (dashed line) is some college degree, and red (dotted line) is Bachelor's.
::: {.guidedpractice data-latex=""}
Based on @fig-countyIncomeRidgeMulti, what can you say about how median household income in counties vary depending on population gain/no gain, metropolitan area/not, and median degree?[^04-explore-categorical-6]
:::
[^04-explore-categorical-6]: Regardless of the location (metropolitan or not) or change in population, it seems like there is an increase in median household income from individuals with only a HS diploma, to individuals with some college, to individuals with a Bachelor's degree.
\vspace{-5mm}
```{r}
#| label: fig-countyIncomeRidgeMulti
#| fig-cap: |
#| Distribution of median income in counties using a ridge plot, faceted by
#| whether the county had a population gain or not as well as whether the county is
#| in a metropolitan area and colored by the median education level in the county.
#| fig-alt: |
#| Distribution of median income in counties using a ridge plot, faceted by
#| whether the county had a population gain or not as well as whether the county is
#| in a metropolitan area and colored by the median education level in the county.
#| Those counties where the median education level is a bachelor's degree have
#| household income distributions that are substantially higher than counties
#| with some college or high school degree only as their education level.
#| out-width: 100%
#| fig-asp: 0.45
county |>
filter(!is.na(pop_change) & !is.na(metro) & !is.na(median_edu) & median_edu != "below_hs") |>
# for better labeling on plot
rename(
pop_change_num = pop_change,
pop_change = pop_change_2levels
) |>
ggplot(aes(x = median_hh_income, y = median_edu, fill = median_edu, color = median_edu)) +
geom_density_ridges(alpha = 0.5, aes(linetype = median_edu)) +
scale_fill_openintro("hot") +
scale_color_openintro("hot") +
scale_linetype_manual(values = c("solid", "dashed", "dotted")) +
scale_x_continuous(labels = label_dollar(scale = 1/1000, big.mark = "K")) +
facet_grid(pop_change ~ metro, labeller = label_both) +
labs(x = "Median household income", y = NULL) +
guides(fill = FALSE, color = FALSE, linetype = FALSE)
```
\clearpage
## Chapter review {#sec-chp4-review}
### Summary
Fluently working with categorical variables is an important skill for data analysts.
In this chapter we have introduced different visualizations and numerical summaries applied to categorical variables.
The graphical visualizations are even more descriptive when two variables are presented simultaneously.
We presented bar plots, mosaic plots, pie charts, and estimations of conditional proportions.
### Terms
The terms introduced in this chapter are presented in @tbl-terms-chp-04.
If you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.
You should be able to easily spot them as **bolded text**.
```{r}
#| label: tbl-terms-chp-04
#| tbl-cap: Terms introduced in this chapter.
#| tbl-pos: H
make_terms_table(terms_chp_04)
```
\clearpage
## Exercises {#sec-chp4-exercises}
Answers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-04].
::: {.exercises data-latex=""}
{{< include exercises/_04-ex-explore-categorical.qmd >}}
:::