A data validation tool.
The abacus
repository includes scripts and tools that facilitate various forms
of validation between datasets and their data dictionaries(data expectiations).
-
Create and activate a virtual environment (recommended):
Here for more on virtual environments.# Step 1: cd into the directory to store the venv # Step 2: run this code. It will create the virtual env named abacus_venv in the current directory. python3 -m venv abacus_venv # Step 3: run this code. It will activate the abacus_venv environment source abacus_venv/bin/activate # On Windows: venv\Scripts\activate # You are ready for installations! # If you want to deactivate the venv run: deactivate
-
Install the package and dependencies:
pip install git+https://github.com/NIH-NCPI/abacus.git
-
Run a command/action :diz:
validate_csv
runs cerberus validation on a datadictionary/dataset pair and returns results of the validation in the terminal.
See data expectations herevalidate_csv -dd {path/to/datadictionary.csv} -dt {path/to/dataset.csv} -m {Format of missing values in the dataset choose one (i.e. NA, na, null, ...)} # example validate_csv -dd data/input/data_dictionary.csv -dt data/input/dataset.csv -m NA
summarize_csv
returns aggregates and attributes of the provided dataset which is exported as a yaml file.
See data expectations heresummarize_csv -dd {path/to/datadictionary.csv} -dt {path/to/dataset.csv} -m {Format of missing values in the dataset choose one (i.e. NA, na, null, ...)} -e {export/filepath/summary.yaml} # example summarize_csv -dd data/input/data_dictionary.csv -dt data/input/dataset.csv -m NA -e data/output/summary.yaml
validate_linkml
runs linkml validation on a datadictionary/dataset pair and returns results of the validation in the terminal from the directory that contains the datafiles. (datadictionary, dataset, AND iIMPORTS-adjoining datadictionaries)
See data expectations herevalidate_linkml -dd {path/to/datadictionary.csv} -dt {path/to/dataset.csv} -dc {data class - linkml tree_root} # example validate_linkml -dd data/input/assay.yaml -dt data/input/assay_data.yaml -dc Assay
Visit this link for more indepth specs
Datasets should be csvs, follow the format described by the data dictionary, and have consitant notation of missing data [NULL, NA, etc.].
Data dictionaries should be a yaml file formatted for linkml, and contain all dataset expectations for validation. Validation requires all data dictionaries referenced in the
imports
section present in the same file location. Imports beginning withlinkml:
can be ignored
Example seen below.id: https://w3id.org/include/assay imports: - linkml:types - include_core - include_participant - include_study
Datasets should be yaml, json or csv file formatted for linkml, follow the format described by the data dictionary, and have consitant notation of missing data [NULL, NA, etc.].
If the dataset is a csv, multivalue fields should have pipe separators
See examples below.# Yaml file representation # Instances of Biospecimen class - studyCode: "Study1" participantGlobalId: "PID123" ... ... ... - studyCode: "Study1" participantGlobalId: "PID123"
CSV representation
studyCode,studyTitle,program study_code,Study of Cancer,program1|program2
If working on a new feature it is possible to install a package version within the remote or local branch
# remote pip install git+https://github.com/NIH-NCPI/abacus.git@{branch_name} # local pip install -e .