- Download this repository to local machine
- Set up AWS bucket and folders for processing
- Assemble static ancillary inputs (local, AWS)
- Resolve RIIS records to GBIF accepted taxa (local, copy to AWS S3)
- Subset GBIF data to BISON (AWS Redshift/Glue)
- Load BISON subset and ancillary inputs to Redshift
- Annotate BISON subset with regions and RIIS status (AWS Redshift)
- Summarize BISON subset by regions and RIIS status (AWS Redshift)
- Create Presence Absence Matrix (PAM) and compute statistics (local)
The LmBISON repository <https://github.com/lifemapper/bison>
_ can be installed by
downloading from Github. This code repository contains python code, scripts for AWS
tools, Docker composition files, configuration files, and test data for creating the
outputs.
Type git
at the command prompt to see if you have git installed. If you do not,
download and install git from https://git-scm.com/downloads .
Download the LmBISON repository, containing test data and configurations, by typing at the command line:
.. code-block::
git clone https://github.com/lifemapper/bison
When the clone is complete, move to the top directory of the repository, bison
.
All hands-on commands will be executed in a command prompt window from this
directory location.
Create a virtual python environment for installing local python dependencies.
Under the BISON bucket (i.e. bucket-us-east-1), create the following folders:
- annotated_records
- input_data
- lib
- log
- out_data
- scripts
Use the most current version of the United States Register of Introduced and Invasive Species (US-RIIS)
- Year 4 data: https://doi.org/10.5066/P95XL09Q
- Year 5 data: TBA
The current file is named US-RIIS_MasterList_2021.csv, and is available in the data/input directory of this repository. Upload this file to s3:///input_data
- US Census 2021 cartographic boundaries from https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.2021.html#list-tab-B7KHMTDJCFECH4SSL2
- Upload the shapefile to s3:///input_data
Census data are in EPSG:4269 (WGS84), a geographic SRS very close to EPSG:4326 (NAD83). For 2 reasons, I did not project the census data:
- The difference is on the order of meters.
- The GBIF data usually does not specify a datum
Annotate points with AIANNH regions for aggregation by species and RIIS status.
Data:
Upload the shapefile to s3:///input_data
Unable to intersect these data with records because of the complexity of the shapefiles. Next time will try using AWS Redshift with a "flattened" version of the data.
Try:
- PAD-US 3.0 Vector Analysis File https://www.sciencebase.gov/catalog/item/6196b9ffd34eb622f691aca7
- PAD-US 3.0 Raster Analysis File https://www.sciencebase.gov/catalog/item/6196bc01d34eb622f691acb5
These are "flattened" though spatial analysis prioritized by GAP Status Code (ie GAP 1 > GAP 2 > GAP > 3 > GAP 4), these are found on bottom of https://www.usgs.gov/programs/gap-analysis-project/science/pad-us-data-download page.
The vector datasets are available only as ESRI Geodatabases. The raster datasets are Erdas Imagine format. It appears to contain integers between 0 and 92, but may have additional attributes for those classifications. Try both in AWS Redshift.
Upload the raster and vector flattened zip files (test which works best later) to s3:///input_data
Run this locally until it is converted to an AWS step. Make sure that the data/config/process_gbif.json file is present. From the bison repo top directory, making sure the virtual environment is activated, run:
python process_gbif.py --config_file=data/config/process_gbif.json resolve
Upload the output file (like data/input/US-RIIS_MasterList_2021_annotated_2024-02-01.csv with current date string) to s3:///input_data
For all redshift steps, do the following with the designated script:
- In AWS Redshift console, open
Query Editor
, and choose the buttonScript Editor
. - Open existing or Create a new script (with +) and copy in the appropriate script.
- Update the date string this processing step with the first day of the current month, for example, replace all occurrences of 2024_01_01 with 2024_02_01.
- Run
GBIF Input
- Use the Global Biodiversity Information Facility (GBIF) Species Occurrences on the AWS Open Data Registry (ODR) in S3. https://registry.opendata.aws/gbif/
- These data are updated on the first day of every month, with the date string in the S3 address.
- The date string is appended to all outputs, and referenced in the subset scripts (Redshift and Glue)
- The data are available in each region, stay within the same AWS ODR region as the BISON bucket.
- Perform Redshift steps, using script:
aws_script/rs_subset_gbif
- In AWS Glue console, open
ETL jobs
, and choose the buttonScript Editor
. - Open existing or Upload the script
aws_script/glue_subset_gbif.sql
- Run
- If this method is used, must still load the results into Redshift for steps 7, 8
- Perform Redshift steps, using script:
aws_script/rs_load_ancillary_data.sql
- Perform Redshift steps, using script:
aws_script/rs_intersect_append.sql
- Takes 1-3 minutes per intersection into a temp table plus 1-6 min to use the temp table to annotate the bison subset
- Perform Redshift steps, using script:
aws_scripts/rs_aggregate_export
- Outputs annotated records as CSV files in bucket/folder
s3://bison-321942852011-us-east-1/out_data
- aiannh_lists__000.csv
- state_lists__000.csv
- county_lists__000.csv
- aiannh_counts__000.csv
- state_counts__000.csv
- county_counts__000.csv
-
On a local machine, with the virtual environment activated, run the script aws_scripts/bison_matrix_stats.py
python aws_scripts/bison_matrix_stats.py
Amazon Web Services account with access to EC2, S3, Glue, and Redshift
- For local setup, development, and testing, create and activate a python virtual environment to hold project dependencies from requirements.txt, and possibly requirements-test.txt.
python3 -m venv venv
. venv/bin/activate
pip3 install -r requirements.txt
pip3 install -r requirements-test.txt
- Auto-generate readthedocs: https://docs.readthedocs.io/en/stable/intro/getting-started-with-mkdocs.html
(venv)$ pip3 install mkdocs
Build documentation: https://docs.readthedocs.io/en/stable/intro/getting-started-with-sphinx.html