Processing scripts for The Microsetta Initiative
This repository houses the scripts that process Microsetta data. The output from processing can be served by the microsetta-public-api, allowing participants to access and explore their data. These outputs include alpha and beta diversity, as well as taxonomy details.
The processing scripts depend on a standard QIIME 2 environment, and redbiom, and assume a Torque/PBS submission environment.
A set of environment variables control the processing:
$QIIME_VERSIONis used onconda activate(e.g., 2020.11). [REQUIRED]$EMAILis a contact email to specify on job submission. [OPTIONAL]$TMI_NAMEis a shortname for the processing run (e.g., tmi-gut-16S) [REQUIRED]$TMI_TITLEis a human readable longname (e.g., "Microsetta fecal 16S data") [REQUIRED]$TMI_DATATYPEspecifies whether to process 16S or WGS [REQUIRED]$STUDIEScontains a dot delimited list of Qiita study IDs to process. For example, "10317.850"is the combination of the American Gut (Microsetta) data and the data from Yatsunenko et al Nature 2012. [REQUIRED]$ENV_PACKAGEcontains a dot delimited list of EBIenv_packagevalues to process (e.g., "human-gut") [REQUIRED]$AG_DEBUG, if set to true, limits processing to 1000 samples [OPTIONAL]$TMI_WEIGHTED_UNIFRAC, if set, compute weighted unifrac in addition to unweighted [OPTIONAL]$TMI_SINGLE_SUBJECT, if set, provide various outputs over an individual subject rather than all samples [OPTIONAL]
If running submit_all.sh directly, or one of the individual scripts, it is necessary to specify the above required environment variables.
Alternatively, the reprocess.sh script can be used which sets many of the variables above followed by executing submit_all.sh. If using reprocess.sh, it is still necessary to indicate $QIIME_VERSION.
The columns/ directory contains two types of files, .txt files that describe the variables to retain and .json files which manage normalizations.
So why does this exist and what is it?
We limit what variables we keep as the total number of variables is massive, and we’ve observed high resource needs related to representing large numbers of variables on microsetta-public-api.
The entries here represent the subset of columns needed for this meta-analysis and current/future use of results from the public-api
If an entry for a meta-analysis (e.g. lifestage) isn’t provided, the processing defaults to a set of general columns.
The .json files in the columns directory describe how to normalize variables. The upstream data resource (qiita/redbiom) do not ensure standard representation of variables across studies (this is a well known hard and long running problem). So we account for this with the studies we currently use on the fly.