forked from o-date/draft
-
Notifications
You must be signed in to change notification settings - Fork 0
/
02.1a-cleaning_data.rmd
153 lines (112 loc) · 13.9 KB
/
02.1a-cleaning_data.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
## Cleaning Data
We often talk about data as being "clean", "messy," "raw,"
"processed," or some combination of all of those adjectives. Clean data is easy to understand, manipulate, and distill. Messy data can take a lot of work to refine into something more useful. "Raw" data is a little more complicated. We make consequential decisions at every step of the way about what kinds of information we're going to collect, what units of measurement we're using, which terms we choose to describe our observations and they each transform the data. With those facts in mind, data can't ever truly be "raw"[@gitelman_raw_2013]. For a field archaeology example, we make a choice to measure and draw all artefacts and inclusions in a soil layer, or only some (leaving out exhaustive detail about categories like "rubble"). We make a choice to record dimensions of artefacts and stratigraphy visually with a ruler to the centimeter, or to the half centimeter, or with precision instrumentation to fragments of millimeters. We might excavate natural soil layers or by arbitrary levels. Each of these decisions influences the resulting data.
When we're working with someone else's pre-existing, or "legacy" data, it's important to think about what kinds of choices and assumptions may have been made by the original investigator. What were the original project's goals? What *didn't* the original creator attempt to capture?
The way you choose to clean your data will be informed by the structure you want the dataset to take. Have a look at the following section on [arranging and storing data](/book/arranging-and-storing-data-for-the-long-haul-databases.html) before you settle on a methodology for data cleaning. And note that a key element here, as in the prior sections on useful data, preservation, and data collection, is *planning*.
### Tools
You can clean data using [regular expressions](https://programminghistorian.org/en/lessons/understanding-regular-expressions), or in Microsoft Excel using built in functionality. Or you could use a programming language like Python or R to text operations. But one of the most accessible and powerful, free tools for cleaning up messy data is [OpenRefine](http://openrefine.org/).
<iframe width="560" height="315" src="https://www.youtube.com/embed/B70J_H_zAWM" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
Before you go any further, take a few moments to watch the OpenRefine videos, download the program, and learn how it works. You won't be sorry, but you may shed a tear for time wasted manually cleaning up spreadsheet data in the past.
### Cleaning Tasks
**Gathering from many datasets**
When working with legacy data, or even on your own large projects, you'll need to begin by gathering many datasets into a harmonious collection. Don't transform "original" data. Any cleanup should happen on derivative copies. Make copies of everything at the very beginning and store them somewhere safe.
Building on ideas discussed in the previous section on [Designing Data Collection](/book/designing-data-collection-and-curation.html), documentation and transparency are key concepts. Where did you get your datasets? Write that down. Did you do anything (like transcription) to make them computer-readable? Write that down too, describing your methods.
Are you bringing data from multiple places together? Will they fit nicely, or will they need to be massaged a bit in order to be unified?
Examine your dataset(s) closely. Does each row describe one thing? What is that thing? Sometimes articulating the structure of data is more difficult than it seems. For example, take this dataset describing National Register of Historic Places listings in Virginia.
```
Name Listing_Date ID
<chr> <chr> <chr>
1 Fifth Street Historic District 2/8/2012 118-5318
2 Armstrong School 8/22/2012 118-5320
3 Pierce Street Historic District 8/25/2014 118-5238
4 Doyle Florist Inc./ H. R. Schenkel Inc. Greenhouse Range 11/19/2014 118-5294
5 Lynchburg Hosiery Mill #1 2/21/2017 118-0126
```
On first glance, each row in the dataset might appear to be describing a building, historic district, or archaeological site. But on closer examination, it becomes clear that each row is describing a *document* which then itself describes a historic property. Whether you created a dataset or you're working with inherited data, it is worthwhile to ensure that your data are describing what you think they are.
Define your metadata. Create a document that explains each column heading. Avoid calculated values inside your datasets themselves; leave that bit for the analysis.
Normalization refers to each fact being expressed in one place. The objective is to divide up your information into as many standard subsets as is practical. However, atomic specificity and perfection are impossible and not going to help anyone. Getting to granular may make a huge, unwieldy dataset. Ultimately, analysis will likely require recombining data together again, but that task will be straightforward if your data is normal. Whether you're working in a relational database or performing analyses on derived tables, appropriate normalization may vary. But considering normalization of data from the start can keep things clean. A spreadsheet that mixes variables between rows and columns may be useful at a glance for visual interpretation, but the same data may not work very easily for analysis. A collection of over-normalized tables might be quite challenging to simply read and interpret (and have a computing performance cost later on if you're working with large amounts of information).
Generally, each row should represent an observation and each column should contain a variable.<sup>[1](foot_2_1a_1)</sup> Watch out for many kinds of information in one column or field. Splitting these data into multiple columns will make things much easier to work with.
Following are several steps to prepare data for analysis and reuse, using examples from several different kinds of messy datasets.
1. **Characterize the messiness**
Before doing any real cleaning or reshaping of your data, take some time to explore. Are there leading or spaces in any of your values? Are there typos or nonstandard formatting? Do dates and numbers look the way they are supposed to? Don't make any changes immediately, but note what you'd like to do. After a bit of exploration, the best order of operations will become more clear. Handling spaces and formatting issues at the outset is usually good advise, however.
```text
NAME BIRTH (date, location) DEATH SPOUSE(S)
<chr> <chr> <chr> <chr>
1 Bacchus, Ellen P. abt 1863 (Richmond) 9 May 1916 n/a
2 Ballard (Reese), Bessie E. Abt 1879 (Richmond) 29 Dec 1928 John L. Ballard Sr. (√) (m. 18 Sep 1898, in Manhattan)
3 Ballard, Dora Maude [went by Maude] 16 Mar 1898 (North Carolina) 31 Oct 1973 John L. Ballard Sr. (√)
4 Ballard, John Lee, Sr. 15 Jun 1877 9 Sep 1949 Dora M. — (√); Bessie E. Reese (√) (m. 18 Sep 1898, in Manhattan)
```
In the example above, notice that we've got multiple types of data stored in individual columns. In the BIRTH column we have information about both the date, and birth place within parenthesis, but we've also got "abt" indicating an approximate date. Some dates include the day, month, and year, but others only include a year. The SPOUSE(S) column includes not only one or more names, but a "√" symbol to indicate something, and potentially a marriage date and location. Each of those elements of information deserve their own column or row. Even the name can be separated into First, Middle, Last, Maiden, Other, Honorific, Suffix, etc. The lengths to which you need to go depend on the end goals of your project.
```text
cf_ppl_surname cf_ppl_first_name cf_ppl_middle_name cf_ppl_other_name cf_ppl_honorif cf_ppl_birth_month cf_ppl_birth_year cf_ppl_birth_day
<chr> <chr> <chr> <chr> <chr> <chr> <int> <int>
1 Ballard Dora Maude NA NA NA 1898 16
2 Ballard Bessie E. Reese NA NA 1879 NA
3 Bacchus Ellen P. NA NA NA 1850 NA
4 Ballard John Lee NA NA June 1877 15
```
2. **Consider multiple related tables.**
The data below come from a project to create a finding aid for a large collection of archaeological field notes. Notice that the "Associated Site Numbers" column contains a mix of individual site numbers, ranges, and multiple values separated by commas. This was a practical choice when data was collected, since multiple people were entering data via a simple [KoboToolbox](https://www.kobotoolbox.org/) form on a mobile device. But to move these data into a relational database or perform any analysis, we'll need to split up those values into a table about the recordsets and one about the sites.
```text
Box ID Recordset ID Recordset Name Year Associated Site Numbers
<chr> <chr> <chr> <int> <chr>
1 FN097 RS 57214 Dulles International Airport Live Fir… 1991 44LD0500
2 FN097 RS 90948 Dulles International Airport Satellit… 1992 44LD0423
3 FN097 RS 22491 Fall HIll Avenue Bridge Replacement 2010 44SP0637
4 FN097 RS 40525 Bluestone Ph I 2007 44TZ0157-44TZ0168
5 FN097 RS 25592 Bluestone Ph II 2008 44TZ0160-44TZ0162, 44TZ0165
6 FN097 RS 1583 Route 5 Improvements 2007 44HE0057, 44HE1079, 44HE1080, 44HE1081
```
Again, there are lots of methods to parse those data, but in this case we used OpenRefine. After processing, we end up with two tables, field_notes_info and field_notes_sites:
```text
Box ID Recordset ID Recordset Name Year
<chr> <chr> <chr> <int>
1 FN097 RS 57214 Dulles International Airport Live Fire and Police/Fire 1991
2 FN097 RS 90948 Dulles International Airport Satellite Parking 1992
3 FN097 RS 22491 Fall HIll Avenue Bridge Replacement 2010
4 FN097 RS 40525 Bluestone Ph I 2007
5 FN097 RS 25592 Bluestone Ph II 2008
6 FN097 RS 1583 Route 5 Improvements 2007
```
*Table 2.1.x, field_notes_info*
```text
`Recordset ID` Split_Site_Nums
<chr> <chr>
1 RS 57214 44LD0500
2 RS 90948 44LD0423
3 RS 22491 44SP0637
4 RS 40525 44TZ0157
5 RS 40525 44TZ0158
6 RS 40525 44TZ0159
7 RS 40525 44TZ0160
8 RS 40525 44TZ0161
9 RS 40525 44TZ0162
10 RS 40525 44TZ0163
11 RS 40525 44TZ0164
12 RS 40525 44TZ0165
13 RS 40525 44TZ0166
14 RS 40525 44TZ0167
15 RS 40525 44TZ0168
16 RS 25592 44TZ0160
17 RS 25592 44TZ0161
18 RS 25592 44TZ0162
19 RS 25592 44TZ0165
20 RS 1583 44HE0057
21 RS 1583 44HE1079
22 RS 1583 44HE1080
23 RS 1583 44HE1081
```
*Table 2.1.x, field_notes_sites*
3. **Bring them together**
We can now join those two tables on the "Recordset ID" column appearing in both tables. You can find more about data structure and joining in the [Databases](https://o-date.github.io/draft/book/arranging-and-storing-data-for-the-long-haul-databases.html) section
### Takeaways
+ Don't clean or transform your original data. Always work on copies and keep track of your methods.
+ Take time to explore your data to develop a cleaning strategy. Ask a lot of questions of your data.
+ Don't do cleanup manually if you don't have to. Learning tools/languages like OpenRefine, Python, and R will pay off in time saved and mistakes avoided.
### Further Reading
[Cleaning Data with OpenRefine](https://programminghistorian.org/en/lessons/cleaning-data-with-openrefine), from The Programming Historian
[Formatting Problems](file:///home/jolene/Zotero/storage/W5XR4PVV/02-common-mistakes.html), from Data Carpentry
Broman, Karl W., and Kara H. Woo 2018 Data Organization in Spreadsheets. *The American Statistician* 72(1):2–10. DOI:[10.1080/00031305.2017.1375989](https://doi.org/10.1080/00031305.2017.1375989).
Wilson, Greg, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy K. Teal 2017 Good enough practices in scientific computing. *PLOS Computational Biology* 13(6):e1005510. DOI:[10.1371/journal.pcbi.1005510](https://doi.org/10.1371/journal.pcbi.1005510).
<a name="foot_2_1a_1">1</a>: These are generally reflective of Hadley Wickham's principles of "tidy data" for use in R. See [@hadley_wickham_tidy_2014]