Skip to content

A Go-based version control system designed specifically for tracking changes in datasets (CSV, JSON, binary blobs) with content deduplication and metadata-aware diffs. Built for reproducibility in scientific and machine learning research. Version control for datasets, not code.

License

Notifications You must be signed in to change notification settings

BaseMax/go-dataset-vcs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dataset Version Control System (DSVCS)

A Go-based version control system designed specifically for tracking changes in datasets (CSV, JSON, binary blobs) with content deduplication and metadata-aware diffs. Built for reproducibility in scientific and machine learning research.

Features

Core Capabilities

  • Content-Addressable Storage: SHA-256 based storage with automatic deduplication
  • Format-Aware Diffs: Intelligent diffing for CSV, JSON, and binary data
  • Metadata Tracking: Full metadata support for reproducibility
  • Version History: Complete versioning with parent tracking
  • Lightweight: No database required - filesystem-based storage

Reproducibility Features

  • SHA-256 checksums for data integrity
  • Timestamp tracking for all versions
  • Custom metadata fields (experiment ID, researcher, parameters, etc.)
  • Parent-child version relationships
  • Deterministic content addressing

Installation

go install github.com/BaseMax/go-dataset-vcs/cmd/dsvcs@latest

Or build from source:

git clone https://github.com/BaseMax/go-dataset-vcs.git
cd go-dataset-vcs
go build -o dsvcs ./cmd/dsvcs

Quick Start

Initialize a Repository

dsvcs init

This creates a .dsvcs directory in your current folder.

Add a Dataset

dsvcs add experiment1 data/results.csv csv

Commit Changes

dsvcs commit experiment1 data/updated_results.csv "Added new samples" "Dr. Smith"

View History

dsvcs log experiment1

Compare Versions

dsvcs diff experiment1 <hash1> <hash2>

Checkout a Version

dsvcs checkout <hash> output.csv

Usage Examples

Example 1: Tracking ML Experiment Results

# Initialize repository
dsvcs init

# Add initial training results
dsvcs add model_accuracy training_results_v1.csv csv

# After hyperparameter tuning
dsvcs commit model_accuracy training_results_v2.csv "Tuned learning rate to 0.001" "alice@lab.edu"

# View changes
dsvcs log model_accuracy

Example 2: Managing Configuration Files

# Track model configuration
dsvcs add config model_config.json json

# Update configuration
dsvcs commit config updated_config.json "Increased layers to 5" "bob@research.org"

# Compare configurations
dsvcs diff config <old_hash> <new_hash>

Example 3: Binary Data Tracking

# Track model weights
dsvcs add weights model.pkl binary

# Update after training
dsvcs commit weights model_epoch_10.pkl "Epoch 10 checkpoint" "researcher"

CLI Commands

init [path]

Initialize a new repository (default: .dsvcs)

add <name> <file> [format]

Add a new dataset to track

  • name: Dataset identifier
  • file: Path to the data file
  • format: csv, json, or binary (auto-detected if omitted)

commit <name> <file> <message> [author]

Commit a new version of a dataset

  • name: Dataset identifier
  • file: Path to updated data file
  • message: Commit message
  • author: Author name (optional)

diff <name> <hash1> <hash2>

Show differences between two versions

  • Displays added/deleted/modified rows
  • Format-specific diff output

checkout <hash> [output]

Retrieve a specific version

  • If output is omitted, prints to stdout

log <name>

Show version history for a dataset

  • Displays version ID, hash, author, timestamp, and message

list

List all tracked datasets

status

Show repository status with all datasets and their versions

version

Display version information

Architecture

Storage Structure

.dsvcs/
├── config.json           # Repository configuration
├── objects/              # Content-addressable storage
│   └── ab/              # First 2 chars of hash
│       └── cdef...      # Remaining hash chars
├── refs/                # Dataset references
│   └── dataset_name     # Current state of each dataset
└── versions/            # Version metadata
    └── dataset_name/
        └── version_id.json

Deduplication

DSVCS uses content-addressable storage with SHA-256 hashing. Identical data is stored only once, regardless of how many datasets or versions reference it. This makes it extremely efficient for:

  • Tracking datasets with small incremental changes
  • Managing multiple experiments with similar base data
  • Storing checkpoints that share common data

Format-Aware Diffs

CSV Diffs

  • Row-by-row comparison
  • Tracks additions, deletions, and modifications
  • Maintains row position information

JSON Diffs

  • Recursive object and array comparison
  • Detects added, removed, and modified keys
  • Handles nested structures

Binary Diffs

  • Byte-level comparison
  • Size change tracking
  • Identity checks for efficiency

Use Cases

Scientific Research

  • Track experimental datasets over time
  • Maintain reproducibility with metadata
  • Compare results across different runs
  • Document data provenance

Machine Learning

  • Version training datasets
  • Track model hyperparameters
  • Compare experiment results
  • Checkpoint model weights with metadata

Data Analysis

  • Track data cleaning steps
  • Version control for preprocessed data
  • Compare analysis outputs
  • Maintain audit trail

API Usage

You can also use DSVCS as a Go library:

import (
    "github.com/BaseMax/go-dataset-vcs/pkg/repository"
    "github.com/BaseMax/go-dataset-vcs/pkg/dataset"
)

// Initialize repository
repo, err := repository.Init(".dsvcs")

// Add dataset
data := []byte("your,csv,data")
metadata := map[string]string{
    "experiment": "exp001",
    "researcher": "Alice",
}
ds, err := repo.Add("mydata", dataset.FormatCSV, data, metadata)

// Commit new version
newData := []byte("updated,csv,data")
version, err := repo.Commit("mydata", "Updated values", "Alice", newData, metadata)

// Get diff
diff, err := repo.Diff("mydata", oldHash, newHash)

Testing

Run the test suite:

go test ./...

Run with coverage:

go test ./... -cover

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the GPL-3.0 License - see the LICENSE file for details.

Acknowledgments

Designed for reproducibility in scientific and ML research workflows.

About

A Go-based version control system designed specifically for tracking changes in datasets (CSV, JSON, binary blobs) with content deduplication and metadata-aware diffs. Built for reproducibility in scientific and machine learning research. Version control for datasets, not code.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages