We use Jupyter notebooks to integrate the metadata sources. This allows us to iterate quickly in a transparent and interactive manner as new metadata sources become available.
We use venv to create a virtual enironment.
To create a virtual environment run the command python3 -m venv <environment name>
.
I typically name my environment .env
, and configure .gitignore
to ignore .env
files. This prevents the environment libraries from being uploaded to the repository.
After the environment is created, run the command source <environment name>/bin/activate
to enter the environment.
Once in the environment, run the command pip install -r requirements.txt
to install the necessary python libraries.
You exit the environment by executing the command deactivate
in the terminal.
The translation pipeline notebook translates metadata to JSON. Operationally, it executes the following notebooks in order:
The final output is saved to the directory ./output/nmdc-json/
.
A high-level overview of the translation process is depicted below. At each step, metadata and the NMDC schema are input into the translation notebook, and JSON files are created. The output of the last step is forwarded to the web-development team for ingestion and display on the NMDC pilot site.
Metadata from GOLD and mapping information are input into the translation process notebook. The output of this initial step consists of a set JSON files with metadata about NMDC studies, omics processings, and biosamples.
The metadata files for this step are contained in:
- nmdc-version2.zip: contains GOLD's metadata.
- JGI-EMSL-FICUS-proposals.fnl.xlsx contains mappings between GOLD's studies and EMSL proposals.
Metadata from JAMO's are input into the translation notebook. Omics processing metadata is updated to include links between omics processing and the outputs (i.e., data objects) of omics processings.
The metadata files for this step are contained in:
- ficus_project_fastq.tsv (sequencing metadata)
- ficus_project_fna.tsv (nucleotide assembly metadata)
- ficus_project_faa.tsv (amino acid assembly metadata)
Metadata from EMSL are input into the translation notebook. Omics processing and data object metadata are updated to include links between omics processing and studies and the outputs i.e., data objects) of omics processings. The final output of this steps is a stet of JSON files are ingested by the NMDC pilot site.
The metadata files for this step are contained in:
- EMSL_FICUS_project_process_data_export.xlsx: contains EMSL experiment metadata.
- FICUS - JGI-EMSL Proposal - Gold Study - ID mapping and PI.xlsx: contains mappings between EMSL experiment metadata and GOLD studies.
Develop a more automated ETL pipeline. This may (or may not) include making use of Papermill to execute batch runs of notebooks.