-
Notifications
You must be signed in to change notification settings - Fork 21
/
LakeSuperior.Rmd
131 lines (99 loc) · 6.92 KB
/
LakeSuperior.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
title: "Lake Superior water quality study"
author: "Douglas Bates"
date: "November 5, 2015"
output:
html_document:
fig_caption: yes
number_sections: yes
toc: yes
---
# Introduction
Those taking the Master's Exam in the Department of Statistics at the University of Wisconsin - Madison are provided with data and a problem description plus access to the client. In the Fall 2014 exam one of the problems related to changes in characteristics of Lake Superior over time. The description of the problem stated
> Understanding the “health” of Wisconsin’s Great Lakes as well as other Great Lakes, has been receiving increasing attention. Many of the currently existing indicators for ecosystem “health” in the Great Lakes are likely valid but are biologically, chemically, and physically compartmentalized (e.g. [16 indicators recommended to the International Joint Commission (IJC)](
http://ijc.org/files/publications/Technical%20Report_Eco%20Indicators_2013.pdf). Certainly, these and other indicators reasonably reflect conditions and processes of interest to the Great Lakes community. However, two important aspects are lacking. First, as acknowledged by the IJC, there are no threshold levels recommended and the inherent variability of metrics of interest are often unknown. Without an ecological tipping point/measure of societal acceptance/economic cost or benefit associated with each metric, it is difficult to measure its utility and current status. Second, there is no relative weight to each of the indicators with regard to overall health. Some indicators will undoubtedly reflect human induced changes, environmental variability, ecological stability, biodiversity, and societal values to a greater or lesser degree. As such, an overarching analysis of available data to understand the dynamics and relative importance of measured variables and thereby develop an appropriate measure of overall health is of central importance.
> Water quality data for Lake Superior were obtained from the [Water Quality Portal](http://www.waterqualitydata.us) which houses a tremendous amount of data and is a central repository of data from numerous academic and governmental agencies within the Great Lakes Basin. The data sets were cleaned to some extent with regard to selection of metrics of interest
# Initial exploration
The data are available as a tab-separated text file, `/afs/cs.wisc.edu/p/stat/Data/MS.Exam/f14/Lake_Superior_Data.txt`
We begin by reading the file, using the `readr` package.
```{r preliminaries,cache=FALSE}
options(width=102)
library(readr)
suppressPackageStartupMessages(library(dplyr))
library(ggplot2)
```
```{r superior,cache=TRUE}
fnm <- "/afs/cs.wisc.edu/p/stat/Data/MS.exam/f14/Lake_Superior_Data.txt"
superior <- read_delim(fnm,delim="\t")
```
This approach is unsuccessful because some of the values in the `Detectio_1` column are floating point numbers but they do not occur in the first 1000 rows. The `read_delim` function scans only the first 1000 rows before deciding on the types in the columns. We will fall back on the read.delim function and convert from the data frame.
```{r superior2,cache=TRUE}
(superior <- tbl_df(read.delim(fnm)))
glimpse(superior)
summary(superior)
```
We can see that, like many datasets derived from spreadsheets, these data are very messy. All the rows have `Latitude`, `Longitude`, `Month`, `Day`, and `Year` recorded. Many other columns are inconsistently recorded (units of `feet` and `ft` for `ActivityDepth`; `ResultTemp` of `20 Deg C` and `20 deg C`). We may want to split according to `Characteri` to see if we can induce some consistency but first we should take care of the values coded as blanks and convert the information on dates to an `ISODate` object
## Replace blank values by NA`s and converting Dates
In a spreadsheet it makes sense to leave a value blank but really it indicates a missing value, or a case where the column is not meaningful. We see these in the columns `ActivityDe_Unit`,`ResultSamp`,`ResultTime`,`ResultTemp`,`DetectionQ`,`Detectio_2` and `Preparation`.
We'll keep a backup in case things go south and create a small function to perform the change
```{r NAsandDates}
blank2NA <- function(x) {
if (is.factor(x) && '' %in% levels(x)) {
x[x == ''] <- NA
droplevels(x)
}
x
}
sup <- lapply(superior,blank2NA) %>% data.frame %>% tbl_df %>% mutate(Date = ISOdate(Year,Month,Day), OBJECTID = factor(OBJECTID))
glimpse(sup)
summary(sup)
```
# The Ecoli measurements
```{r ecoli}
summary(ecoli <- droplevels(filter(sup,MeasCommon == 'Ecoli')))
```
The call to `droplevels` drops the unused levels in a factor. This is why several of the columns, such as `Characteri`, `ResultSamp`, `MeasCommon` and`MeasureQua`, have only one level in the reduced frame.
We see that the earliest measurements of E. Coli concentrations were made in April of 1984 and over 75% were made between 2005 and 2013. It would be good to check on location as well. First we give a reference map from Wikipedia
![Lake Superior](/home/bates/git/stat692/Lake_Superior_bathymetry_map.png)
```{r location}
locs <- unique(select(ecoli,LatitudeMe,LongitudeM))
#alt locs <- ecoli %>% select(LatitudeME,LongitudeM) %>% unique
nrow(locs)
```
```{r }
p <- ggplot(locs,aes(x=LongitudeM,y=LatitudeMe)) + xlab("Longitude") + ylab("Latitude") + coord_fixed(ratio=1.5)
p + geom_point()
```
The use of `coord_fixed` in the plot is to get the aspect ratio approximately the same as on the map.
To start to look at time trends we want to find the most frequently occurring location.
```{r ll}
latlong <- with(ecoli,paste(LatitudeMe,LongitudeM,sep=":"))
sort(xtabs(~latlong),decreasing=TRUE)
```
Notice that the four most frequently occurring locations are very close together, on the south shore.
For the most frequently occurring location a plot of the time series is
```{r pos1}
pos1 <- ecoli %>% filter(LongitudeM==-87.3909,LatitudeMe==46.5291) %>% select(ResultMe_1,Date,MonitorLocID) %>% arrange(Date) %>% droplevels
glimpse(pos1)
summary(pos1)
```
Notice that `ResultMe_1` is a factor. Convert it to numeric.
```{r pos1Result}
pos1$ResultMe_1 <- as.numeric(pos1$ResultMe_1)
p2 <- ggplot(pos1,aes(x=Date,y=ResultMe_1)) + ylab("E. Coli concentration (cfu/100ml)")
p2 + geom_point()
```
This plot over 13 years obscures the within-year pattern. Consider just the data in 2013
```{r pos1Result201314}
(pos113 <- filter(pos1,Date > ISOdate(2013,01,01)))
p3 <- ggplot(pos113,aes(x=Date,y=ResultMe_1))+ ylab("E. Coli concentration (cfu/100ml)")
p3 + geom_point()
```
Here the mysterious result is that often results on the same day vary widely
```{r ecoli20130603}
Jun032013 <- droplevels(filter(ecoli,Date == ISOdate(2013,06,03))) %>% arrange(MonitorLocID)
summary(Jun032013)
glimpse(Jun032013)
```
# General approach
We have been following the split-apply-combine approach although at this point all we have is the split part. Even at a single location it is difficult to see the time trends in the E. Coli measurements.