-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathimport_to_report.qmd
405 lines (287 loc) · 17.9 KB
/
import_to_report.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
# From importing to reporting
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("polishing")
```
In [First Steps in RStudio](first_steps_rstudio.html) we typed data into R. This is not very practical when you have a lot of data! Instead, we much more commonly import data from a file. In this chapter we go through the workflow from importing, through summarising and plotting to saving a saving the figure.
## Importing data from files
There are two things you need to know before you can import data from a file.
1. What format the data are in
The format of the data determines what *function* you will use to import it. Often the file extension indicates format.
- `.txt` a plain text file[^import_to_report-1], where the columns are often separated by a space but might also be separated by a tab, a backslash or forward slash, or some other character
- `.csv` a plain text file where the columns are separated by commas
- `.xlsx` an Excel file More detail on file types was covered in [Understanding file systems](file_systems.html#files)
However, you should always check the file to make sure it is in the format you expect because there is little to force a match between a file's contents and the extension in its name. You can check by opening the file in a text editor (e.g., Notepad on Windows, TextEdit on Mac) or in RStudio (see below).
2. Where the file is relative to your working directory
`R` can only read in a file if you say where it is, *i.e.*, you give its **relative path**. More detail on relative file paths and working directories was covered in [Understanding file systems](file_systems.html#working-directories)
🎬 Your turn! If you want to code along you will need to start a new [RStudio project](workflow_rstudio.html#rstudio-projects) then a new script.
This chapter covers reading `.txt` files and `.csv` files using **`tidyverse`** [@tidyverse] functions and excel files using the **`readxl`** [@readxl] package. We will demonstrate what needs to be done differently if the file is not in your working directory.
```{r}
library(tidyverse)
library(readxl)
```
### Importing data from `.txt` file
The data in [adipocytes.txt](data-raw/adipocytes.txt) give the concentration of a hormone called adiponectin in some cells. There are two columns: the first gives the adiponectin concentration and the second, treatment, indicates whether the cells were treated with nicotinic acid or not. Save this file to the project folder.
A `.txt` extension suggests this is plain text file with columns separated by spaces. However, before we attempt to read it in, when should take a look at it. We can do this from RStudio by clicking on the file in the Files pane. Any plain text file will open in the top left pane.
![The adipocytes.txt data file open. We can see the columns are separated by spaces](images/text-data-file-open.png){#fig-text-data-file-open fig-alt="screenshot of RStudio showing the data file open" width="600"}
The files are separated by spaces as we suspected. We use the `read_table()` command to read in plain text files of single columns, or where the columns are separated by spaces:
```{r}
#| eval: false
adipo <- read_table("adipocytes.txt")
```
The data from the file has been read into a dataframe called `adipo`. You will and you will be able to see it in the Environment window. Clicking on it in the Environment window will open a spreadsheet-like view of the dataframe.
### Importing a from a`.csv` file
The data [seal.csv](data-raw/seal.csv) give the myoglobin concentration of skeletal muscle for three species of seal. There are two columns: the first gives the myoglobin concentration and the second indicates species.
The `.csv` extension suggests this is plain text file with columns separated by commas. We will again check this before we attempt to read it in. Click on the file in the Files pane - a pop-up will appear for files ending `.csv` or `.xlsx`. Choose View File[^import_to_report-2].
![Rstudio Files pane showing the data files and the View File option that appears when you click on the a particular file](images/csv-data-file-view.png){#csv-data-file-view fig-alt="screenshot of RStudio showing the View File option that appears when you click on the a particular file" width="600"}
CSV files will open in the top left pane (Excel files will launch Excel). We can see that the file does contain comma separated values. There is a`read_csv()` function which works very like `read_table()`\[\^working_with_data_rstudio-3\]:
```{r}
#| eval: false
seal <- read_csv("seals.csv")
```
### Importing a from a`.xlsx` file
The data in [blood.xlsx](data-raw/blood.xlsx) are measurements of several blood parameters from fifty people with Crohn's disease, a lifelong condition where parts of the digestive system become inflamed. Twenty-five of people are in the early stages of diagnosis and 25 have started treatment.
```{r}
#| eval: false
blood <- read_excel("blood.xlsx")
```
### Importing data from a file not in your working directory
When using an RStudio project, your working directory is the project folder (the folder containing the `.Rproj` file. Suppose our file [adipocytes.txt](data-raw/adipocytes.txt) is in a folder, `data-raw`, in our working directory.
```
-- myproject
|__myproject.Rproj
|__import.R
|__data-raw
|__adipocytes.txt
|__blood.xlsx
|__seal.csv
```
We need to adjust the code to give the *relative path* to the datafile:
```{r}
adipo <- read_table("data-raw/adipocytes.txt")
```
The [relative path](https://3mmarand.github.io/comp4biosci/file_systems.html#relative-paths) is the path from the working directory to the file. In this case, the working directory is `myproject` and the relative path is `data-raw/adipocytes.txt`.
🎬 Your turn! Create a folder called `data-raw` inside the project folder and move the data files to it. Now modify the data import code to import seal.csv from `data-raw`.
```{r}
#| code-fold: true
seal <- read_csv("data-raw/seal.csv")
```
## Summarising data
We summarise data using the the **`dplyr`** [@dplyr] package, which provides a set of functions designed for efficient data manipulation. This is a **`tidyverse`** [@tidyverse] package which you already loaded. The approach replies on the data being in a tidy format, meaning each column represents a variable, each row represents an observation, and each cell contains a single value. The pipeline is:
- Group the data: If you want to summarize your data based on certain groups, you can use the `group_by()`
- Summarise: Once your data grouped (if necessary), you use `summarise()` with functions like` mean()`, `median()`, `sd()`, `min()`and `max()` within it.
We will demonstrate summarising using the `adipo` dataframe. `adiponectin` is the response and is continuous and `treatment` is an explanatory with categorical with two levels (groups).
The most useful summary statistics for a continuous variable like `adiponectin` are the means, standard deviations, sample sizes and standard errors. We use the `group_by()` and `summarise()` functions along with the functions that do the calculations.
To create a data frame called `adipo_summary` that contains the means, standard deviations, sample sizes and standard errors for the control and nicotinic acid treated samples:
```{r}
#| include: false
adipo <- read_table("data-raw/adipocytes.txt")
```
```{r}
adipo_summary <- adipo |>
group_by(treatment) |>
summarise(mean = mean(adiponectin),
std = sd(adiponectin),
n = length(adiponectin),
se = std/sqrt(n))
```
You can type:
```{r}
adipo_summary
```
or click on environment to open a spreadsheet-like view of the dataframe.
### Visualise data
Most commonly, we put the explanatory variable on the *x* axis and the response variable on the *y* axis. A continuous response, particularly one that follows the normal distribution, is best summarised with the mean and the standard error. In my opinion, you should also show all the raw data points if possible.
We are going to create a figure like this:
```{r}
#| echo: false
ggplot() +
geom_point(data = adipo, aes(x = treatment, y = adiponectin),
position = position_jitter(width = 0.1, height = 0),
colour = "gray50") +
geom_errorbar(data = adipo_summary,
aes(x = treatment, ymin = mean - se, ymax = mean + se),
width = 0.3) +
geom_errorbar(data = adipo_summary,
aes(x = treatment, ymin = mean, ymax = mean),
width = 0.2) +
scale_y_continuous(name = "Adiponectin (pg/mL)",
limits = c(0, 12),
expand = c(0, 0)) +
scale_x_discrete(name = "Treatment",
labels = c("Control", "Nicotinic acid")) +
theme_classic()
```
#### **`ggplot2`**
**`ggplot2`** [@ggplot2] is a powerful data visualisation package in R that is part of the tidyverse. It provides a flexible and layered approach to creating high-quality and customizable graphics.
The core concept of **`ggplot2`** is to build a plot layer by layer. The basic structure consists of three main components:
1. Data: the data frame to be used for plotting.
2. Aesthetic mappings (aes): how variables in the data map to visual elements such as *x* and *y* positions, colours, shapes, etc.
3. Geometric objects (geoms): the actual graphical elements used to visualize the data, such as points, lines, bars, etc.
To create a basic plot, you start with the `ggplot()` function and provide the data and aesthetic mappings.
```
ggplot(data = adipo, aes(x = treatment, y = adiponectin))
```
You can add geometric layers to the plot using specific functions such as `geom_point()`, `geom_line()`, `geom_bar()`, etc. These functions define the type of plot you want to create:
```
ggplot(data = adipo, aes(x = treatment, y = adiponectin)) +
geom_point()
```
In the figure we are aiming for, we are plotting two dataframes: the `adipo` dataframe which contains the data points themselves; and the `adipo_summary` dataframe containing and the means and standard errors.
The dataframes and aesthetics for ggplot can be specified *within a `geom_xxxx`* (rather than in the `ggplot()`). This is very useful if the geom only applies to some of the data you want to plot.
::: callout-tip
## Tip: `ggplot()`
You put the `data` argument and `aes()` inside `ggplot()` if you want all the `geoms` to use that dataframe and variables. If you want a different dataframe for a `geom`, put the `data` argument and `aes()` inside the `geom_xxxx()`
:::
I will build the plot up in small steps you should edit your *existing* `ggplot()` command as we go.
We will plot the data points first. Notice that we have given the data argument and the aesthetics *inside* the `geom_point()`. The variables `treatment` and `adiponectin` are in the `adipo` dataframe
```{r}
ggplot() +
geom_point(data = adipo,
aes(x = treatment, y = adiponectin))
```
So the data points don't overlap, we can add some random jitter in the *x* direction (edit your existing code):
```{r}
ggplot() +
geom_point(data = adipo,
aes(x = treatment, y = adiponectin),
position = position_jitter(width = 0.1, height = 0))
```
Note that `position = position_jitter(width = 0.1, height = 0)` is inside the `geom_point()` parentheses, after the `aes()` and a comma.
We've set the vertical jitter to 0 because, in contrast to the categorical *x*-axis, movement on the *y*-axis has meaning (the adiponectin levels).
Let's make the points a light grey (edit your existing code):
```{r}
ggplot() +
geom_point(data = adipo,
aes(x = treatment, y = adiponectin),
position = position_jitter(width = 0.1, height = 0),
colour = "grey50")
```
Now to add the errorbars. These go from one standard error below the mean to one standard error above the mean.
Add a `geom_errorbar()` for errorbars (edit your existing code):
```{r}
ggplot() +
geom_point(data = adipo, aes(x = treatment, y = adiponectin),
position = position_jitter(width = 0.1, height = 0),
colour = "grey50") +
geom_errorbar(data = adipo_summary,
aes(x = treatment, ymin = mean - se, ymax = mean + se),
width = 0.3)
```
We have specified the `adipo_summary` dataframe and the variables `treatment`, `mean` and `se` are in that.
There are several ways you could add the mean. You could use `geom_point()` but I like to use `geom_errorbar()` again with the `ymin` and `ymax` both set to the mean.
Add a `geom_errorbar()` for the mean (edit your existing code):
```{r}
ggplot() +
geom_point(data = adipo, aes(x = treatment, y = adiponectin),
position = position_jitter(width = 0.1, height = 0),
colour = "grey50") +
geom_errorbar(data = adipo_summary,
aes(x = treatment, ymin = mean - se, ymax = mean + se),
width = 0.3) +
geom_errorbar(data = adipo_summary,
aes(x = treatment, ymin = mean, ymax = mean),
width = 0.2)
```
Alter the axis labels and limits using `scale_y_continuous()` and `scale_x_discrete()` (edit your existing code):
```{r}
ggplot() +
geom_point(data = adipo, aes(x = treatment, y = adiponectin),
position = position_jitter(width = 0.1, height = 0),
colour = "grey50") +
geom_errorbar(data = adipo_summary,
aes(x = treatment, ymin = mean - se, ymax = mean + se),
width = 0.3) +
geom_errorbar(data = adipo_summary,
aes(x = treatment, ymin = mean, ymax = mean),
width = 0.2) +
scale_y_continuous(name = "Adiponectin (pg/mL)",
limits = c(0, 12),
expand = c(0, 0)) +
scale_x_discrete(name = "Treatment",
labels = c("Control", "Nicotinic acid"))
```
You only need to use `scale_y_continuous()` and `scale_x_discrete()` to use labels that are different from those in the dataset. Often this is to use proper terminology and captialisation.
Format the figure in a way that is more suitable for including in a report using `theme_classic()` (edit your existing code):
```{r}
ggplot() +
geom_point(data = adipo, aes(x = treatment, y = adiponectin),
position = position_jitter(width = 0.1, height = 0),
colour = "gray50") +
geom_errorbar(data = adipo_summary,
aes(x = treatment, ymin = mean - se, ymax = mean + se),
width = 0.3) +
geom_errorbar(data = adipo_summary,
aes(x = treatment, ymin = mean, ymax = mean),
width = 0.2) +
scale_y_continuous(name = "Adiponectin (pg/mL)",
limits = c(0, 12),
expand = c(0, 0)) +
scale_x_discrete(name = "Treatment",
labels = c("Control", "Nicotinic acid")) +
theme_classic()
```
[^import_to_report-1]: Plain text files can be opened in notepad or other similar editor and still be readable.
[^import_to_report-2]: There is also and option to import the dataset. **Do not** be tempted to import data this way! Unless you are careful and know what you are doing, your data import will not be scripted or will not be scripted correctly.
The `ggsave()` function is used to save a `ggplot` object as an image file. It provides a convenient way to export your plots to various file formats, such as PNG, PDF, SVG, or JPEG.
The basic syntax of `ggsave()` is as follows:
```
ggsave(filename,
plot,
device,
width,
height,
units,
dpi)
```
You must give a file name for the output file but all the other options have defaults.
- plot: The ggplot object you want to save. Defaults to the last created plot.
- device: one of "png", "eps", "ps", "tex", "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only). Defaults to the format given by the file extension in filename
- width, height units: Plot size in units ("in", "cm", "mm", or "px"). Defaults to the size of the plot in the Plots window.
- dpi: Plot resolution.
We can save the figure we just created as a 3 inch x 3 inch png file as follows:
```{r}
ggsave("adipocytes.png",
device = "png",
width = 3,
height = 3,
units = "in",
dpi = 300)
```
It is often a good idea to explicitly assign the ggplot object to a variable and use that in the `ggsave()`
```{r}
fig1 <- ggplot() +
geom_point(data = adipo, aes(x = treatment, y = adiponectin),
position = position_jitter(width = 0.1, height = 0),
colour = "gray50") +
geom_errorbar(data = adipo_summary,
aes(x = treatment, ymin = mean - se, ymax = mean + se),
width = 0.3) +
geom_errorbar(data = adipo_summary,
aes(x = treatment, ymin = mean, ymax = mean),
width = 0.2) +
scale_y_continuous(name = "Adiponectin (pg/mL)",
limits = c(0, 12),
expand = c(0, 0)) +
scale_x_discrete(name = "Treatment",
labels = c("Control", "Nicotinic acid")) +
theme_classic()
ggsave("adipocytes.png",
plot = fig1,
device = "png",
width = 3,
height = 3,
units = "in",
dpi = 300)
```
## Tidy data
Data that is in a format that is easy to work with is often referred to as "tidy data". Tidy data is a concept that was introduced by Hadley Wickham in his paper Tidy Data [@Wickham2014-nl]. The paper is available online at <http://vita.had.co.nz/papers/tidy-data.pdf>.
Tidy data:
- Each variable should be in one column.
- Each different observation of that variable should be in a different row.
- There should be one table for each "kind" of data.
- If you have multiple tables, they should include a column in the table that allows them to be linked.
These concepts have been around for a long time and underlie formats enforced in several statistical packages such as SPSS, minitab and SAS.
Recognising the structure of your data and organising it in tidy format is one of the most important step in data analysis.