The vapour package provides access to the basic read functions available in GDAL for both raster and vector data sources.
The functions are deliberately lower-level than these data models and provide access to the component entities independently.
For vector data:
- read access to feature attributes.
- read access to raw binary geometry.
- read access to geometry in text forms (GeoJSON, WKT, GML, KML).
- read access to the extent, or bounding box, of feature geometries.
- helper functions to summarize feature identity and geometry status
All vector/feature read tasks can optionally apply an arbitrary limit to
the maximum number of features read or queried, and all allow execution
of OGRSQL to a layer prior
to data extraction. In combination with a SQL query a bounding box
spatial filter
can be applied via the extent
argument.
For raster data:
- read access to the list of available rasters within a collection source (subdatasets).
- read access to structural metadata for individual raster sources.
- read access for raw data using GDAL’s RasterIO framework and its dynamic image decimation / replication resampling algorithms.
The workflows available are intended to support development of applications in R for these vector and raster data without being constrained to any particular data model.
The package can be installed from CRAN.
## install.packages("remotes")
remotes::install_cran("vapour")
The development version can be installed from Github.
remotes::install_github("hypertidy/vapour")
You will need development tools for building R packages.
On Linux and MacOS building also requires an available GDAL installation, but on Windows the ROpenSci rwinlib tools are used and the required GDAL will be downloaded and used when building the package. This installation is self-contained and only affects the use of R, it can be used alongside other applications using GDAL.
The goal of vapour is to provide a basic GDAL API package for R. The key functions provide vector geometry or attributes and raster data and raster metadata.
The priority is to give low-level access to key functionality rather
than comprehensive coverage of the library. The real advantage of
vapour
is the flexibility of a modular workflow, not the outright
efficiency.
A parallel goal is to be freed from the powerful but sometimes limiting high-level data models of GDAL itself, specifically these are simple features and affine-based regular rasters composed of 2D slices. (GDAL will possibly remove these limitations over time but still there will always be value in having modularity in an ecosystem of tools.)
GDAL’s dynamic resampling of arbitrary raster windows is also very useful for interactive tools on local data, and seems under-utilized in favour of less accessible online image services.
This partly draws on work done in the sf
package and in packages rgdal
and
rgdal2
. I’m amazed that something as powerful and general as GDAL is
still only available through these lenses, but recent improvements make
things much easier to use and share. Specifically Rcpp
means that
access to external libs is simplified, easier to learn and easier to get
started and make progress. The other part is that cross-platform support
is now much better, with more consistency on the libraries available on
the CRAN machines and in other contexts.
You might burn limbs off.
It’s possible to give problematic “SELECT” statements via the sql
argument. Note that the geometry readers vapour_read_geometry
,
vapour_read_geometry_text
, and vapour_read_extent
will strip out the
SELECT ... FROM
clause and replace it with SELECT * FROM
to ensure
that the geometry is accessible, though the attributes are ignored. This
means we can allow the user or dplyr
to create any SELECT
statement.
The function vapour_read_geometry
will return a list of NULLs, in this
case.
I haven’t tried this against a real database, I’m not sure if we need
AsBinary()
around EWKB geoms, for example - but at any rate these can
be ingested by sf
.
The package documentation page gives an overview of available functions.
help("vapour-package")
See the vignettes and documentation for examples WIP.
The following concrete example illustrates the motivation for vapour
,
through a timing benchmark for one standard operation: extracting
feature geometries from a data set within a user-defined bounding box.
The data set is the one used throughout the book Geocomputation in
R by Robin Lovelace, Jakub
Nowosad, and Jannes Muenchow, and can be obtained with the following
code.
url <- file.path ("http://www.naturalearthdata.com/http//www.naturalearthdata.com",
"download/10m/cultural/ne_10m_parks_and_protected_lands.zip")
download.file (url = url, destfile = "USA_parks.zip")
unzip (zipfile = "USA_parks.zip", exdir = "usa_parks")
fname <- "usa_parks/ne_10m_parks_and_protected_lands_area.shp"
That last fname
is the file we’re interested in, which contains
polygons for all United States parks. We now construct a timing
benchmark for three ways of extracting the data within a pre-defined
bounding box of:
library (magrittr)
bb <- c (-120, 20, -100, 40)
names (bb) <- c ("xmin", "ymin", "xmax", "ymax")
bb_sf <- sf::st_bbox (bb, crs = sf::st_crs (4326)) %>%
sf::st_as_sfc ()
First, we define a function to do the desired extraction using the sf
package, comparing both
st_crop
and the sf::[
sub-selection operator:
f_sf1 <- function (fname)
{
usa_parks <- sf::st_read (fname, quiet = TRUE)
suppressMessages (suppressWarnings (
parks2 <- sf::st_crop (usa_parks, bb_sf)
))
}
f_sf2 <- function (fname)
{
usa_parks <- sf::st_read (fname, quiet = TRUE)
suppressMessages (suppressWarnings (
parks2 <- usa_parks [bb_sf, ]
))
}
Then three approaches using vapour
, in each case extracting equivalent
data to sf
- that is, both geometries and attributes - yet simply
leaving them separate here. The three approaches are:
- Read geometry as
WKB
, sub-select after reading, and convert tosfc
lists; - Read geometry as
WKB
pre-selected via an SQL statement, and convert tosfc
lists; and - Read geometry as
json
(text), pre-selected via SQL, and convert with thegeojsonsf
package
library (vapour)
f_va1 <- function (fname) # read all then sub-select
{
ext <- do.call (rbind, vapour_read_extent (fname)) # bboxes of each feature
indx <- which (ext [, 1] > bb [1] & ext [, 2] < bb [3] &
ext [, 3] > bb [2] & ext [, 4] < bb [4])
g <- vapour_read_geometry (fname) [indx] %>%
sf::st_as_sfc ()
a <- lapply (vapour_read_attributes (fname), function (i) i [indx])
}
f_va2 <- function (fname) # read selection only via SQL
{
ext <- do.call (rbind, vapour_read_extent (fname))
indx <- which (ext [, 1] > bb [1] & ext [, 2] < bb [3] &
ext [, 3] > bb [2] & ext [, 4] < bb [4])
n <- paste0 (vapour_read_names (fname) [indx], collapse = ",") # GDAL FIDs
stmt <- paste0 ("SELECT FID FROM ", vapour_layer_names (fname),
" WHERE FID in (", n, ")")
g <- vapour_read_geometry (fname, sql = stmt) %>%
sf::st_as_sfc ()
a <- vapour_read_attributes (fname, sql = stmt)
}
f_va3 <- function (fname) # convert json text via geojsonsf
{
ext <- do.call (rbind, vapour_read_extent (fname)) # bboxes of each feature
indx <- which (ext [, 1] > bb [1] & ext [, 2] < bb [3] &
ext [, 3] > bb [2] & ext [, 4] < bb [4])
n <- paste0 (vapour_read_names (fname) [indx], collapse = ",") # GDAL FIDs
stmt <- paste0 ("SELECT FID FROM ", vapour_layer_names (fname),
" WHERE FID in (", n, ")")
g <- vapour_read_geometry_text (fname, textformat = "json", sql = stmt)
g <- lapply (g, function (i) geojsonsf::geojson_sfc (i) [[1]]) %>%
sf::st_sfc ()
a <- vapour_read_attributes (fname, sql = stmt)
}
The benchmark timings - in particular the “relative” values - then
illustrate the advantages of vapour
:
rbenchmark::benchmark (
f_sf1 (fname),
f_sf2 (fname),
f_va1 (fname),
f_va2 (fname),
f_va3 (fname),
replications = 10)
#> test replications elapsed relative user.self sys.self user.child
#> 1 f_sf1(fname) 10 0.242 3.184 0.241 0.000 0
#> 2 f_sf2(fname) 10 0.178 2.342 0.177 0.000 0
#> 3 f_va1(fname) 10 0.076 1.000 0.060 0.016 0
#> 4 f_va2(fname) 10 0.086 1.132 0.082 0.004 0
#> 5 f_va3(fname) 10 0.227 2.987 0.214 0.012 0
#> sys.child
#> 1 0
#> 2 0
#> 3 0
#> 4 0
#> 5 0
Reading geometries only, as opposed to the sf
reading of all
geometries and attributes, affords a speed increase of about 25%, while
utilizing the SQL capabilities of
ogr_sql
offers an
increase of around 75%.
Examples of packages that use vapour are in development,
RGDALSQL and
lazyraster. RGDALSQL
aims
to leverage the facilities of GDAL to provide data on-demand for many
sources as if they were databases. lazyraster
uses the
level-of-detail facility of GDAL to read just enough resolution from a
raster source using traditional window techniques.
Limitations, work-in-progress and other discussion are active here: hypertidy#4
We’ve kept a record of a minimal GDAL wrapper package here:
https://github.com/diminutive/gdalmin
Before those I had worked on getting sp and dplyr to at least work together https://github.com/dis-organization/sp_dplyrexpt and recently rgdal was updated to allow tibbles to be used, something that spbabel and spdplyr really needed to avoid friction.
Early exploration of allow non-geometry read with rgdal was tried here: https://github.com/r-gris/gladr
Thanks to Edzer Pebesma and Roger Bivand and Tim Keitt for prior art that I crib and copy from. Jeroen Ooms helped the R community hugely by providing an accessible build process for libraries on Windows. Mark Padgham helped kick me over a huge obstacle in using C++ libraries with R. Simon Wotherspoon and Ben Raymond have endured ravings about wanting this level of control for many years.
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.