A Go-based version control system designed specifically for tracking changes in datasets (CSV, JSON, binary blobs) with content deduplication and metadata-aware diffs. Built for reproducibility in scientific and machine learning research.
- Content-Addressable Storage: SHA-256 based storage with automatic deduplication
- Format-Aware Diffs: Intelligent diffing for CSV, JSON, and binary data
- Metadata Tracking: Full metadata support for reproducibility
- Version History: Complete versioning with parent tracking
- Lightweight: No database required - filesystem-based storage
- SHA-256 checksums for data integrity
- Timestamp tracking for all versions
- Custom metadata fields (experiment ID, researcher, parameters, etc.)
- Parent-child version relationships
- Deterministic content addressing
go install github.com/BaseMax/go-dataset-vcs/cmd/dsvcs@latestOr build from source:
git clone https://github.com/BaseMax/go-dataset-vcs.git
cd go-dataset-vcs
go build -o dsvcs ./cmd/dsvcsdsvcs initThis creates a .dsvcs directory in your current folder.
dsvcs add experiment1 data/results.csv csvdsvcs commit experiment1 data/updated_results.csv "Added new samples" "Dr. Smith"dsvcs log experiment1dsvcs diff experiment1 <hash1> <hash2>dsvcs checkout <hash> output.csv# Initialize repository
dsvcs init
# Add initial training results
dsvcs add model_accuracy training_results_v1.csv csv
# After hyperparameter tuning
dsvcs commit model_accuracy training_results_v2.csv "Tuned learning rate to 0.001" "alice@lab.edu"
# View changes
dsvcs log model_accuracy# Track model configuration
dsvcs add config model_config.json json
# Update configuration
dsvcs commit config updated_config.json "Increased layers to 5" "bob@research.org"
# Compare configurations
dsvcs diff config <old_hash> <new_hash># Track model weights
dsvcs add weights model.pkl binary
# Update after training
dsvcs commit weights model_epoch_10.pkl "Epoch 10 checkpoint" "researcher"Initialize a new repository (default: .dsvcs)
Add a new dataset to track
name: Dataset identifierfile: Path to the data fileformat:csv,json, orbinary(auto-detected if omitted)
Commit a new version of a dataset
name: Dataset identifierfile: Path to updated data filemessage: Commit messageauthor: Author name (optional)
Show differences between two versions
- Displays added/deleted/modified rows
- Format-specific diff output
Retrieve a specific version
- If
outputis omitted, prints to stdout
Show version history for a dataset
- Displays version ID, hash, author, timestamp, and message
List all tracked datasets
Show repository status with all datasets and their versions
Display version information
.dsvcs/
├── config.json # Repository configuration
├── objects/ # Content-addressable storage
│ └── ab/ # First 2 chars of hash
│ └── cdef... # Remaining hash chars
├── refs/ # Dataset references
│ └── dataset_name # Current state of each dataset
└── versions/ # Version metadata
└── dataset_name/
└── version_id.json
DSVCS uses content-addressable storage with SHA-256 hashing. Identical data is stored only once, regardless of how many datasets or versions reference it. This makes it extremely efficient for:
- Tracking datasets with small incremental changes
- Managing multiple experiments with similar base data
- Storing checkpoints that share common data
- Row-by-row comparison
- Tracks additions, deletions, and modifications
- Maintains row position information
- Recursive object and array comparison
- Detects added, removed, and modified keys
- Handles nested structures
- Byte-level comparison
- Size change tracking
- Identity checks for efficiency
- Track experimental datasets over time
- Maintain reproducibility with metadata
- Compare results across different runs
- Document data provenance
- Version training datasets
- Track model hyperparameters
- Compare experiment results
- Checkpoint model weights with metadata
- Track data cleaning steps
- Version control for preprocessed data
- Compare analysis outputs
- Maintain audit trail
You can also use DSVCS as a Go library:
import (
"github.com/BaseMax/go-dataset-vcs/pkg/repository"
"github.com/BaseMax/go-dataset-vcs/pkg/dataset"
)
// Initialize repository
repo, err := repository.Init(".dsvcs")
// Add dataset
data := []byte("your,csv,data")
metadata := map[string]string{
"experiment": "exp001",
"researcher": "Alice",
}
ds, err := repo.Add("mydata", dataset.FormatCSV, data, metadata)
// Commit new version
newData := []byte("updated,csv,data")
version, err := repo.Commit("mydata", "Updated values", "Alice", newData, metadata)
// Get diff
diff, err := repo.Diff("mydata", oldHash, newHash)Run the test suite:
go test ./...Run with coverage:
go test ./... -coverContributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the GPL-3.0 License - see the LICENSE file for details.
Designed for reproducibility in scientific and ML research workflows.