A data validation tool.
The abacus repository includes scripts and tools that facilitate various forms
of validation between datasets and their data dictionaries(data expectiations).
- Run me:
pip install git+https://github.com/NIH-NCPI/abacus.git - Commands here
-
Create and activate a virtual environment (SKIP if installing as a package):
If you want to run the scripts locally it is recoomended you use a virtual environment to keep the imports used siloed. This could reduce future import issues.
Here for more on virtual environments.# Step 1: cd into the directory to store the venv # Step 2: run this code. It will create the virtual env named abacus_venv in the current directory. python3 -m venv abacus_venv # Step 3: run this code. It will activate the abacus_venv environment source abacus_venv/bin/activate # On Windows: venv\Scripts\activate # You are ready for installations! # If you want to deactivate the venv run: deactivate
-
Install the package and dependencies:
- If you have the repo cloned and attempting to run locally, this command should
be run in the root of the repository.
pip install git+https://github.com/NIH-NCPI/abacus.git
-
Run a command/action
-
NOTE: If you have the repo cloned and attempting to run locally, run these commands from abacus/src/abacus.
validate_csvruns cerberus validation on a datadictionary/dataset pair and returns results of the validation in the terminal.
See data expectations herevalidate_csv -dd {path/to/datadictionary.csv} -dt {path/to/dataset.csv} -m {Format of missing values in the dataset choose one (i.e. NA, na, null, ...)} # example validate_csv -dd data/input/data_dictionary.csv -dt data/input/dataset.csv -m NAsummarize_csvreturns aggregates and attributes of the provided dataset which is exported as a yaml file.
See data expectations heresummarize_csv -dd {path/to/datadictionary.csv} -dt {path/to/dataset.csv} -m {Format of missing values in the dataset choose one (i.e. NA, na, null, ...)} -e {export/filepath/summary.yaml} # example summarize_csv -dd data/input/data_dictionary.csv -dt data/input/dataset.csv -m NA -e data/output/summary.yamlvalidate_linkmlruns linkml validation on a datadictionary/dataset pair and returns results of the validation in the terminal from the directory that contains the datafiles. (datadictionary, dataset, AND iIMPORTS-adjoining datadictionaries)
See data expectations herevalidate_linkml -dd {path/to/datadictionary.csv} -dt {path/to/dataset.csv} -dc {data class - linkml tree_root} # example validate_linkml -dd data/input/assay.yaml -dt data/input/assay_data.yaml -dc Assayvalidate_ddvalidates a data dictionary CSV against a named schema (provided by the pre-pipeline-validator package). Each row in the data dictionary is checked for correct formatting, required fields, valid data types, and numeric constraint usage. Results are written to a CSV report.validate_dd <tgt_schema> <data_dictionary_path> <output_csv_path> # example validate_dd example_data_dictionary data/input/my_data_dictionary.csv data/output/dd_validation_results.csvvalidate_dfvalidates a data file CSV against a data dictionary. The schema is built dynamically from the data dictionary and checks column presence, data types, allowed values, numeric ranges, and required field completeness.validate_df <data_dictionary_path> <datafile_path> <output_csv_path> # example validate_df data/input/my_data_dictionary.csv data/input/my_datafile.csv data/output/df_validation_results.csvNote: Data files should not be validated against a failing data dictionary.
Visit this link for more indepth specs
Datasets should be csvs, follow the format described by the data dictionary, and have consitant notation of missing data [NULL, NA, etc.].
Data dictionaries should be a yaml file formatted for linkml, and contain all dataset expectations for validation. Validation requires all data dictionaries referenced in the
importssection present in the same file location. Imports beginning withlinkml:can be ignored
Example seen below.id: https://w3id.org/include/assay imports: - linkml:types - include_core - include_participant - include_studyDatasets should be yaml, json or csv file formatted for linkml, follow the format described by the data dictionary, and have consitant notation of missing data [NULL, NA, etc.].
If the dataset is a csv, multivalue fields should have pipe separators
See examples below.# Yaml file representation # Instances of Biospecimen class - studyCode: "Study1" participantGlobalId: "PID123" ... ... ... - studyCode: "Study1" participantGlobalId: "PID123"
CSV representation
studyCode,studyTitle,program study_code,Study of Cancer,program1|program2If working on a new feature it is possible to install a package version within the remote or local branch. These commands should be run from the project root.
# remote pip install git+https://github.com/NIH-NCPI/abacus.git@{branch_name} # local pip install -e . # handy troubleshooting commands when unsure of version. pip install --upgrade abacus pip install --upgrade abacus==2.0.0 pip uninstall abacus -y