Skip to content

Latest commit

 

History

History

notebooks

Notebooks

We use Jupyter notebooks to integrate the metadata sources. This allows us to iterate quickly in a transparent and interactive manner as new metadata sources become available.

Notebook setup

We use venv to create a virtual enironment.

To create a virtual environment run the command python3 -m venv <environment name> .
I typically name my environment .env, and configure .gitignore to ignore .env files. This prevents the environment libraries from being uploaded to the repository.

After the environment is created, run the command source <environment name>/bin/activate to enter the environment.
Once in the environment, run the command pip install -r requirements.txt to install the necessary python libraries.

You exit the environment by executing the command deactivate in the terminal.

Translation workflow

The translation pipeline notebook translates metadata to JSON. Operationally, it executes the following notebooks in order:

  1. Translate GOLD study, project, biosample
  2. Translate GOLD data objects
  3. Translate EMSL data

The final output is saved to the directory ./output/nmdc-json/.

A high-level overview of the translation process is depicted below. At each step, metadata and the NMDC schema are input into the translation notebook, and JSON files are created. The output of the last step is forwarded to the web-development team for ingestion and display on the NMDC pilot site.

img

Translate GOLD study, project, biosample

Metadata from GOLD and mapping information are input into the translation process notebook. The output of this initial step consists of a set JSON files with metadata about NMDC studies, omics processings, and biosamples.

The metadata files for this step are contained in:

img

Translate GOLD data objects

Metadata from JAMO's are input into the translation notebook. Omics processing metadata is updated to include links between omics processing and the outputs (i.e., data objects) of omics processings.

The metadata files for this step are contained in:

img

Translate EMSL data

Metadata from EMSL are input into the translation notebook. Omics processing and data object metadata are updated to include links between omics processing and studies and the outputs i.e., data objects) of omics processings. The final output of this steps is a stet of JSON files are ingested by the NMDC pilot site.

The metadata files for this step are contained in:

img

Future work

Develop a more automated ETL pipeline. This may (or may not) include making use of Papermill to execute batch runs of notebooks.