Abacus

A data validation tool.

Overview

The abacus repository includes scripts and tools that facilitate various forms of validation between datasets and their data dictionaries(data expectiations).

TLDR/Quick start:

Run me: pip install git+https://github.com/NIH-NCPI/abacus.git
Commands here

Installation

Create and activate a virtual environment (SKIP if installing as a package):
If you want to run the scripts locally it is recoomended you use a virtual environment to keep the imports used siloed. This could reduce future import issues.
Here for more on virtual environments.

# Step 1: cd into the directory to store the venv

# Step 2: run this code. It will create the virtual env named abacus_venv in the current directory.
python3 -m venv abacus_venv

# Step 3: run this code. It will activate the abacus_venv environment
source abacus_venv/bin/activate # On Windows: venv\Scripts\activate

# You are ready for installations! 
# If you want to deactivate the venv run:
deactivate

Install the package and dependencies:

If you have the repo cloned and attempting to run locally, this command should be run in the root of the repository.
```
pip install git+https://github.com/NIH-NCPI/abacus.git
```

Run a command/action

Available actions:
Commands

NOTE: If you have the repo cloned and attempting to run locally, run these commands from abacus/src/abacus.

validate_csv

validate_csv runs cerberus validation on a datadictionary/dataset pair and returns results of the validation in the terminal.
See data expectations here
```
validate_csv -dd {path/to/datadictionary.csv} -dt {path/to/dataset.csv} -m {Format of missing values in the dataset choose one (i.e. NA, na, null, ...)}   

# example
validate_csv -dd data/input/data_dictionary.csv -dt data/input/dataset.csv -m NA 
```
summarize_csv

summarize_csv returns aggregates and attributes of the provided dataset which is exported as a yaml file.
See data expectations here
```
summarize_csv -dd {path/to/datadictionary.csv} -dt {path/to/dataset.csv} -m {Format of missing values in the dataset choose one (i.e. NA, na, null, ...)} -e {export/filepath/summary.yaml}

# example 
summarize_csv -dd data/input/data_dictionary.csv -dt data/input/dataset.csv -m NA -e data/output/summary.yaml
```
validate_linkml

validate_linkml runs linkml validation on a datadictionary/dataset pair and returns results of the validation in the terminal from the directory that contains the datafiles. (datadictionary, dataset, AND iIMPORTS-adjoining datadictionaries)
See data expectations here
```
validate_linkml -dd {path/to/datadictionary.csv} -dt {path/to/dataset.csv} -dc {data class - linkml tree_root}

# example 
validate_linkml -dd data/input/assay.yaml -dt data/input/assay_data.yaml -dc Assay
```
validate_dd

validate_dd validates a data dictionary CSV against a named schema (provided by the pre-pipeline-validator package). Each row in the data dictionary is checked for correct formatting, required fields, valid data types, and numeric constraint usage. Results are written to a CSV report.
```
validate_dd <tgt_schema> <data_dictionary_path> <output_csv_path>

# example
validate_dd example_data_dictionary data/input/my_data_dictionary.csv data/output/dd_validation_results.csv
```
validate_df

validate_df validates a data file CSV against a data dictionary. The schema is built dynamically from the data dictionary and checks column presence, data types, allowed values, numeric ranges, and required field completeness.
```
validate_df <data_dictionary_path> <datafile_path> <output_csv_path>

# example
validate_df data/input/my_data_dictionary.csv data/input/my_datafile.csv data/output/df_validation_results.csv
```
Note: Data files should not be validated against a failing data dictionary.

Data Expectations

csv - validation(cerberus) and summary

data dictionary format:

Visit this link for more indepth specs

dataset format:

Datasets should be csvs, follow the format described by the data dictionary, and have consitant notation of missing data [NULL, NA, etc.].

yaml/json - validation(linkml)

data dictionary format:

Data dictionaries should be a yaml file formatted for linkml, and contain all dataset expectations for validation. Validation requires all data dictionaries referenced in the imports section present in the same file location. Imports beginning with linkml: can be ignored
Example seen below.
```
id: https://w3id.org/include/assay
imports:
- linkml:types
- include_core
- include_participant
- include_study
```
dataset format:

Datasets should be yaml, json or csv file formatted for linkml, follow the format described by the data dictionary, and have consitant notation of missing data [NULL, NA, etc.].

If the dataset is a csv, multivalue fields should have pipe separators
See examples below.
```
# Yaml file representation
# Instances of Biospecimen class
- studyCode: "Study1"
  participantGlobalId: "PID123"
  ...
  ...
  ...
- studyCode: "Study1"
  participantGlobalId: "PID123"
```
CSV representation
```
studyCode,studyTitle,program
study_code,Study of Cancer,program1|program2
```
Working on a branch?

If working on a new feature it is possible to install a package version within the remote or local branch. These commands should be run from the project root.
```
# remote
pip install git+https://github.com/NIH-NCPI/abacus.git@{branch_name}

# local
pip install -e .

# handy troubleshooting commands when unsure of version.
pip install --upgrade abacus
pip install --upgrade abacus==2.0.0
pip uninstall abacus -y
```

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
.vscode		.vscode
src/abacus		src/abacus
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abacus

Overview

TLDR/Quick start:

Installation

Available actions:

Commands

validate_csv

summarize_csv

validate_linkml

validate_dd

validate_df

Data Expectations

csv - validation(cerberus) and summary

data dictionary format:

dataset format:

yaml/json - validation(linkml)

data dictionary format:

dataset format:

Working on a branch?

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Abacus

Overview

TLDR/Quick start:

Installation

Available actions:

Commands

validate_csv

summarize_csv

validate_linkml

validate_dd

validate_df

Data Expectations

csv - validation(cerberus) and summary

data dictionary format:

dataset format:

yaml/json - validation(linkml)

data dictionary format:

dataset format:

Working on a branch?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages