forked from edzer/sdsr
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path13-Geostatistics.qmd
391 lines (355 loc) · 16.1 KB
/
13-Geostatistics.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
# Multivariate and Spatiotemporal Geostatistics {#sec-stgeostatistics}
```{r load16, eval = Sys.getenv("USER") == "edzer", echo=FALSE}
# try(load("data/ch16.rda"))
set.seed(123)
```
Building on the simple interpolation methods presented in
@sec-interpolation, this chapter continues with multivariate
geostatistics and spatiotemporal geostatistics. The topic of
multivariate geostatistics, more extensively illustrated in @asdar,
is briefly introduced. Spatiotemporal geostatistics is illustrated
with a worked out case study for spatiotemporal interpolation,
using NO$_2$ air quality data, and population density as covariate.
## Preparing the air quality dataset {#sec-preparing}
The dataset we work with is an air quality dataset obtained from
the European Environmental Agency (EEA). European member states report
air quality measurements to this agency. So-called _validated_
data are quality controlled by member states, and are reported on a
yearly basis. They form the basis for policy compliancy evaluations
and (counter) measures.
The EEA's [air quality
e-reporting](https://www.eea.europa.eu/data-and-maps/data/aqereporting-9)
website gives access to the data reported by European member states.
We decided to download hourly (time series) data, which is the data primarily measured.
A web form helps convert simple selection criteria into an http `GET` request. The
[URL](https://fme.discomap.eea.europa.eu/fmedatastreaming/AirQualityDownload/AQData_Extract.fmw?CountryCode=DE&CityName=&Pollutant=8&Year_from=2017&Year_to=2017&Station=&Samplingpoint=&Source=E1a&Output=TEXT&UpdateDate=)\footnote{\url{https://fme.discomap.eea.europa.eu/fmedatastreaming/AirQualityDownload/AQData_Extract.fmw?CountryCode=DE&CityName=&Pollutant=8&Year_from=2017&Year_to=2017&Station=&Samplingpoint=&Source=E1a&Output=TEXT&UpdateDate=}}
was created to select all validated (`Source=E1a`) $NO_2$
(`Pollutant=8`) data for 2017 (`Year_from`, `Year_to`) from Germany
(`CountryCode=DE`). It returns a text file with a set of URLs to CSV
files, each containing the hourly values for the whole period for a
single measurement station. These files were downloaded and converted
to the right encoding using the `dos2unix` command line utility.
In the following, we will read all the files into a list, except
for the single file with the station metadata:
```{r readallcsvfiles}
files <- list.files("aq", pattern = "*.csv", full.names = TRUE)
files <- setdiff(files, "aq/AirBase_v8_stations.csv") # metadata file
r <- lapply(files, function(f) read.csv(f))
```
then we convert the time variable into a `POSIXct` variable, and put them in time order by
```{r fixtimes}
Sys.setenv(TZ = "UTC") # don't use local time zone
r <- lapply(r, function(f) {
f$t = as.POSIXct(f$DatetimeBegin)
f[order(f$t), ]
}
)
```
We remove smaller sub-datasets, which for this dataset have no hourly data:
```{r deselect_smaller_datasets}
r <- r[sapply(r, nrow) > 1000]
names(r) <- sapply(r,
function(f) unique(f$AirQualityStationEoICode))
length(r) == length(unique(names(r)))
```
\index{xts}
and then merge all files using `xts::cbind`, so that records are combined based on matching times:
```{r converttoxts}
library(xts) |> suppressPackageStartupMessages()
r <- lapply(r, function(f) xts(f$Concentration, f$t))
aq <- do.call(cbind, r)
```
A usual further selection for this dataset is to select stations
for which 75\% of the hourly values measured are valid, i.e., drop
those with more than 25\% missing hourly values.
Knowing that `mean(is.na(x))` gives the _fraction_ of missing values
in a vector `x`, we can apply this function to the columns (stations):
```{r removestationswithmissingvalues}
sel <- apply(aq, 2, function(x) mean(is.na(x)) < 0.25)
aqsel <- aq[, sel]
```
Next, the [station metadata](http://ftp.eea.europa.eu/www/AirBase_v8/AirBase_v8_stations.zip) was read
and filtered for rural background stations in Germany (`"DE"`) by
```{r filterruralbackground}
library(tidyverse) |> suppressPackageStartupMessages()
read.csv("aq/AirBase_v8_stations.csv", sep = "\t") |>
as_tibble() |>
filter(country_iso_code == "DE",
station_type_of_area == "rural",
type_of_station == "Background") -> a2
```
These stations contain coordinates, and an `sf` object with (static) station metadata is created by
```{r importstationcoords}
library(sf) |> suppressPackageStartupMessages()
a2.sf <- st_as_sf(a2, crs = 'OGC:CRS84',
coords = c("station_longitude_deg", "station_latitude_deg"))
```
We now subset the air quality measurements to include only stations that are of type rural background,
which we saved in `a2`:
```{r subseteuropean}
sel <- colnames(aqsel) %in% a2$station_european_code
aqsel <- aqsel[, sel]
dim(aqsel)
```
We can compute station means and join these to station locations by
```{r fig-computestationmeans, message = FALSE}
tb <- tibble(NO2 = apply(aqsel, 2, mean, na.rm = TRUE),
station_european_code = colnames(aqsel))
crs <- st_crs('EPSG:32632')
right_join(a2.sf, tb) |> st_transform(crs) -> no2.sf
read_sf("data/de_nuts1.gpkg") |> st_transform(crs) -> de
```
Station mean NO$_2$ concentrations, along with country borders, are shown in in @fig-plotDE.
## Multivariable geostatistics {#sec-cokriging}
\index{kriging!multivariable}
\index{kriging!cokriging}
\index{cokriging}
Multivariable geostatics involves the _joint_ modelling, prediction,
and simulation of multiple variables,
$$Z_1(s) = X_1 \beta_1 + e_1(s)$$
$$...$$
$$Z_n(s) = X_n \beta_n + e_n(s).$$
In addition to having observations, trend models, and variograms
for each variable, the _cross_-variogram for each pair of residual
variables, describing the covariance of $e_i(s), e_j(s+h)$,
is required. If this cross-covariance is non-zero, knowledge of
$e_j(s+h)$ may help predict (or simulate) $e_i(s)$. This is
especially true if $Z_j(s)$ is more densely sample than $Z_i(s)$.
Prediction and simulation under this model are called cokriging
and cosimulation. Examples using gstat are found when running the
demo scripts
```{r eval=FALSE}
library(gstat)
demo(cokriging)
demo(cosimulation)
```
and are further illustrated and discussed in @asdar.
\index{variogram!cross}
\index{cross-variogram}
In case the different variables considered are observed at
the same set of locations, for instance different air quality
parameters, then the statistical _gain_ of using cokriging
as opposed to direct (univariable) kriging is often modest,
when not negligible. A gain may however be that the prediction
is truly multivariable: in addition to the prediction vector
$\hat{Z(s_0)}=(\hat{Z}_1(s_0),...,\hat{Z}_n(s_0))$, we get the full
covariance matrix of the prediction error [@ver1993multivariable].
Using these prediction error covariances, for any linear combination
of $\hat{Z}(s_0)$, such as $\hat{Z}_2(s_0) - \hat{Z}_1(s_0)$,
we can get the standard error of that combination.
Although sets of direct and cross-variograms can be computed
and fitted automatically, multivariable geostatistical modelling
becomes quickly hard to manage when the number of variables gets
large, because the number of direct and cross-variograms required
is $n(n+1)/2$.
In case different variables refer to the same variable taken at
different time steps, one could use a multivariable (cokriging)
prediction approach, but this would not allow for interpolation
between two time steps. For this, and for handling the case of
having data observed at many time instances, one can also model
its variation as a function of continuous space _and_ time, as of
$Z(s,t)$, which we will do in the next section.
## Spatiotemporal geostatistics {#sec-stgeos}
\index{kriging!spatiotemporal}
Spatiotemporal geostatistical processes are modelled as variables
having a value everywhere in space and time, $Z(s,t)$, with $s$ and
$t$ the continuously indexed space and time index. Given observations
$Z(s_i,t_j)$ and a variogram (covariance) model $\gamma(s,t)$ we can
predict $Z(s_0,t_0)$ at arbitrary space/time locations $(s_0,t_0)$
using standard Gaussian process theory.
Several books have been written recently about modern approaches
to handling and modelling spatiotemporal geostatistical data,
including @wikle2019spatio and @blangiardo2015spatial. Here,
we will use @RJ-2016-014 and give some simple examples using the
dataset also used for the previous chapter.
### A spatiotemporal variogram model
\index{variogram!spatiotemporal}
Starting with the spatiotemporal matrix of NO$_2$ data in `aq`
constructed at the beginning of this chapter, we selected complete
records taken at rural background stations into `aqsel`.
We can select the spatial locations for these 74 stations by
```{r}
sfc <- st_geometry(a2.sf)[match(colnames(aqsel),
a2.sf$station_european_code)] |>
st_transform(crs)
```
and finally build a `stars` vector data cube with time and station as dimensions:
```{r}
library(stars)
st_as_stars(NO2 = as.matrix(aqsel)) |>
st_set_dimensions(names = c("time", "station")) |>
st_set_dimensions("time", index(aqsel)) |>
st_set_dimensions("station", sfc) -> no2.st
no2.st
```
From this, we can compute the spatiotemporal variogram using
```{r echo=FALSE, eval = file.exists("data/vst.RData")}
load(file = "data/vst.RData")
```
```{r}
library(gstat)
```
```{r computespacetimesamplevariogram, eval = !exists("v.st")}
v.st <- variogramST(NO2~1, no2.st[,1:(24*31)], tlags = 0:48,
cores = getOption("mc.cores", 2))
```
```{r echo=FALSE}
save(list = "v.st", file = "data/vst.RData")
```
which is shown in @fig-plotvariograms.
```{r fig-plotvariograms, echo=!knitr::is_latex_output()}
#| fig.cap: "Spatiotemporal sample variogram for hourly NO$_2$ concentrations at rural background stations in Germany over 2027; in the right-hand side plot colour corresponds to time lag (yellow is later); distance in m"
#| code-fold: true
#| out.width: 100%
v1 <- plot(v.st)
v2 <- plot(v.st, map = FALSE, legend = list())
print(v1, split = c(1,1,2,1), more = TRUE)
print(v2, split = c(2,1,2,1), more = FALSE)
```
To this sample variogram, we can fit a variogram model. One relatively
flexible model we try here is the product-sum model [@RJ-2016-014], fitted by
```{r fitprodsummodel}
# product-sum
prodSumModel <- vgmST("productSum",
space = vgm(150, "Exp", 200000, 0),
time = vgm(20, "Sph", 6, 0),
k = 2)
#v.st$dist = v.st$dist / 1000
StAni <- estiStAni(v.st, c(0,200000))
(fitProdSumModel <- fit.StVariogram(v.st, prodSumModel,
fit.method = 7, stAni = StAni, method = "L-BFGS-B",
control = list(parscale = c(1,100000,1,1,0.1,1,10)),
lower = rep(0.0001, 7)))
```
and shown in @fig-prodsummodelplot, which can also be plotted
as wire frames, shown in @fig-modelwire. Fitting this model is
rather sensitive to the chosen parameters, which may be caused by
the relatively small number (74) of monitoring network stations
available.
```{r fig-prodsummodelplot, echo=!knitr::is_latex_output()}
#| fig.cap: "Product-sum model, fitted to the spatiotemporal sample variogram"
#| code-fold: true
plot(v.st, fitProdSumModel, wireframe = FALSE, all = TRUE,
scales = list(arrows = FALSE), zlim = c(0, 150))
```
```{r fig-modelwire, echo = !knitr::is_latex_output()}
#| fig.cap: "Wireframe plot of the fitted spatiotemporal variogram model"
#| code-fold: true
plot(v.st, model = fitProdSumModel, wireframe = TRUE, all = TRUE,
scales = list(arrows = FALSE), zlim = c(0, 185))
```
Hints about the fitting strategy and alternative models for
spatiotemporal variograms are given in @RJ-2016-014.
\newpage
With this fitted model, and given the observations, we can
carry out kriging or simulation at arbitrary points in space and
time. For instance, we could estimate (or simulate) values in the
time series that are now missing: this occurs regularly, and in
@sec-kriging we used means over time series based on
simply ignoring up to 25% of the observations: substituting these
with estimated or simulated values based on neighbouring (in space
and time) observations before computing yearly mean values seems
a more reasonable approach.
More in general, we can estimate at arbitrary locations and time
points, and we will illustrate this with predicting time series at
particular locations and and predicting spatial slices [@RJ-2016-014].
We can create a `stars` object for two randomly selected spatial points
and all time instances by
```{r createstarsobject}
set.seed(1331)
pt <- st_sample(de, 2)
t <- st_get_dimension_values(no2.st, 1)
st_as_stars(list(pts = matrix(1, length(t), length(pt)))) |>
st_set_dimensions(names = c("time", "station")) |>
st_set_dimensions("time", t) |>
st_set_dimensions("station", pt) -> new_pt
```
and we obtain the spatiotemporal predictions at these two points using `krigeST` by
```{r echo = FALSE, eval = file.exists("data/new_ts.RData")}
load("data/new_ts.RData")
```
```{r spacetimekriging, eval = !exists("new_ts")}
no2.st <- st_transform(no2.st, crs)
new_ts <- krigeST(NO2~1, data = no2.st["NO2"], newdata = new_pt,
nmax = 50, stAni = StAni, modelList = fitProdSumModel,
progress = FALSE)
```
```{r echo=FALSE}
save(list = "new_ts", file = "data/new_ts.RData")
```
where the results are shown in @fig-plotxts.
```{r fig-plotxts, echo = !knitr::is_latex_output()}
#| fig.cap: "Time series plot of spatiotemporal predictions for two points"
#| code-fold: true
plot(as.xts(new_ts[2]))
```
Alternatively, we can create spatiotemporal predictions for a set of time-stamped
raster maps, evenly spaced over the year 2017, created by
```{r spatiotemporalpredictionsgrid}
st_bbox(de) |>
st_as_stars(dx = 10000) |>
st_crop(de) -> grd
d <- dim(grd)
t4 <- t[(1:4 - 0.5) * (3*24*30)]
st_as_stars(pts = array(1, c(d[1], d[2], time = length(t4)))) |>
st_set_dimensions("time", t4) |>
st_set_dimensions("x", st_get_dimension_values(grd, "x")) |>
st_set_dimensions("y", st_get_dimension_values(grd, "y")) |>
st_set_crs(crs) -> grd.st
```
and the subsequent predictions are obtained by
```{r echo = FALSE, eval = file.exists("data/new_int.RData")}
load("data/new_int.RData")
```
```{r spatiotemporalpredictions, eval = !exists("new_int") }
new_int <- krigeST(NO2~1, data = no2.st["NO2"], newdata = grd.st,
nmax = 200, stAni = StAni, modelList = fitProdSumModel,
progress = FALSE)
names(new_int)[2] = "NO2"
```
```{r echo=FALSE}
save(list = "new_int", file = "data/new_int.RData")
```
and shown in @fig-stpredictions.
```{r fig-stpredictions, echo=!knitr::is_latex_output(), message=FALSE}
#| fig.cap: "Spatiotemporal predictions for four selected time slices"
#| code-fold: true
library(viridis)
library(viridisLite)
library(ggplot2)
g <- ggplot() + coord_equal() +
scale_fill_viridis() +
theme_void() +
scale_x_discrete(expand=c(0,0)) +
scale_y_discrete(expand=c(0,0))
g + geom_stars(data = new_int, aes(fill = NO2, x = x, y = y)) +
facet_wrap(~as.Date(time)) +
geom_sf(data = st_cast(de, "MULTILINESTRING")) +
geom_sf(data = no2.sf, col = 'grey', cex = .5) +
coord_sf(lims_method = "geometry_bbox")
```
A larger value for `nmax` was needed here to decrease the visible
disturbance (sharp edges) caused by discrete neighbourhood
selections, which are now done in space _and_ time.
### Irregular space time data
For the case where observations are collected at locations that vary
constantly, or at fixed locations but without a common time basis,
`stars` objects (vector data cubes) do not represent them well. Such
irregular space time observations can be represented by `sftime`
objects, provided by package **sftime** [@R-sftime], which are
essentially `sf` objects with a specified time column. An example of
its uses is found in `demo(sftime)`, provided in package **gstat**.
## Exercises
1. Which fraction of the stations is removed in @sec-preparing when
the criterion applied that a station must be 75% complete?
1. From the hourly time series in `no2.st`, compute daily mean concentrations
using `aggregate`, and compute the spatiotemporal variogram of this. How does
it compare to the variogram of hourly values?
1. Carry out a spatiotemporal interpolation for daily mean values for the days
corresponding to those shown in @fig-stpredictions, and
compare the results.
1. Following the example in the demo scripts pointed at in @sec-cokriging, carry out a cokriging on the daily mean station data for the four days shown in @fig-stpredictions.
1. What are the differences of this latter approach to spatiotemporal kriging?
```{r echo=FALSE}
save(list = ls(), file = "ch13.RData")
```