Name		Name	Last commit message	Last commit date
parent directory ..
archive		archive
data		data
images		images
lib		lib
output		output
README.md		README.md
gold-biosample-null-value-analysis.ipynb		gold-biosample-null-value-analysis.ipynb
requirements.txt		requirements.txt
translate-EMSL-data.ipynb		translate-EMSL-data.ipynb
translate-GOLD-data-objects.ipynb		translate-GOLD-data-objects.ipynb
translate-GOLD-study-project-biosample.ipynb		translate-GOLD-study-project-biosample.ipynb
translation-pipeline.ipynb		translation-pipeline.ipynb

README.md

Notebooks

We use Jupyter notebooks to integrate the metadata sources. This allows us to iterate quickly in a transparent and interactive manner as new metadata sources become available.

Notebook setup

We use venv to create a virtual enironment.

To create a virtual environment run the command python3 -m venv <environment name> .
I typically name my environment .env, and configure .gitignore to ignore .env files. This prevents the environment libraries from being uploaded to the repository.

After the environment is created, run the command source <environment name>/bin/activate to enter the environment.
Once in the environment, run the command pip install -r requirements.txt to install the necessary python libraries.

You exit the environment by executing the command deactivate in the terminal.

Translation workflow

The translation pipeline notebook translates metadata to JSON. Operationally, it executes the following notebooks in order:

The final output is saved to the directory ./output/nmdc-json/.

A high-level overview of the translation process is depicted below. At each step, metadata and the NMDC schema are input into the translation notebook, and JSON files are created. The output of the last step is forwarded to the web-development team for ingestion and display on the NMDC pilot site.

Translate GOLD study, project, biosample

Metadata from GOLD and mapping information are input into the translation process notebook. The output of this initial step consists of a set JSON files with metadata about NMDC studies, omics processings, and biosamples.

The metadata files for this step are contained in:

nmdc-version2.zip: contains GOLD's metadata.
JGI-EMSL-FICUS-proposals.fnl.xlsx contains mappings between GOLD's studies and EMSL proposals.

Translate GOLD data objects

Metadata from JAMO's are input into the translation notebook. Omics processing metadata is updated to include links between omics processing and the outputs (i.e., data objects) of omics processings.

The metadata files for this step are contained in:

ficus_project_fastq.tsv (sequencing metadata)
ficus_project_fna.tsv (nucleotide assembly metadata)
ficus_project_faa.tsv (amino acid assembly metadata)

Translate EMSL data

Metadata from EMSL are input into the translation notebook. Omics processing and data object metadata are updated to include links between omics processing and studies and the outputs i.e., data objects) of omics processings. The final output of this steps is a stet of JSON files are ingested by the NMDC pilot site.

The metadata files for this step are contained in:

EMSL_FICUS_project_process_data_export.xlsx: contains EMSL experiment metadata.
FICUS - JGI-EMSL Proposal - Gold Study - ID mapping and PI.xlsx: contains mappings between EMSL experiment metadata and GOLD studies.

Future work

Develop a more automated ETL pipeline. This may (or may not) include making use of Papermill to execute batch runs of notebooks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notebooks

notebooks

README.md

Notebooks

Notebook setup

Translation workflow

Translate GOLD study, project, biosample

Translate GOLD data objects

Translate EMSL data

Future work

Files

notebooks

Directory actions

More options

Directory actions

More options

Latest commit

History

notebooks

Folders and files

parent directory

README.md

Notebooks

Notebook setup

Translation workflow

Translate GOLD study, project, biosample

Translate GOLD data objects

Translate EMSL data

Future work