This directory contains examples demonstrating the integration of Model Cards with Datasheets for Datasets.
Complete model card using Pattern 1: External References
This example shows the recommended approach for Phase 1 implementation:
- Uses the current Model Cards schema (
modelcards.yaml) - References external datasheets instead of importing the datasheets schema
- Avoids naming conflicts while providing comprehensive documentation
Key Features:
- ✅ Backward compatible with existing model cards schema
- ✅ No schema conflicts (Task, language, etc.)
- ✅ Clean separation: models documented in model cards, datasets in datasheets
- ✅ Single source of truth: datasets documented once, referenced many times
Integration Approach:
# In model card: reference external datasheet
dataset_documentation:
training_datasets:
- id: "imdb-sentiment-v1"
datasheet_url: "https://datasheets.example.org/imdb-sentiment-v1.yaml"
datasheet_format: "datasheets-for-datasets-v1.0"Complete dataset documentation using Datasheets for Datasets schema
This example demonstrates comprehensive dataset documentation following the Datasheets for Datasets framework.
Sections Documented:
- Motivation: Purpose, tasks, creators, funding
- Composition: 50,000 instances, balanced positive/negative, subsets
- Collection: Web scraping methodology, sampling strategy, timeframes
- Ethics: Public data, no PII, ethical review notes
- Preprocessing: HTML cleaning, label derivation, normalization
- Uses: Existing uses (1000+ citations), discouraged uses, impact analysis
- Distribution: Format, dates, licensing
- Maintenance: Maintainers, update policy, version access
- Variables: review_text, sentiment, review_id descriptions
Key Features:
- ✅ Comprehensive: 60+ fields vs model cards' simple 7-field
dataSetclass - ✅ Standardized: Follows established "Datasheets for Datasets" framework
- ✅ Reusable: Multiple models can reference this single datasheet
- ✅ Ethics-focused: Detailed privacy, consent, and ethical documentation
Pros:
- No schema conflicts
- Works with current tooling
- Clean separation of concerns
- Datasets documented once, referenced many times
Cons:
- Requires maintaining separate files
- Less convenient for quick prototyping
Use When:
- Production deployment
- Multiple models using same datasets
- Need for comprehensive dataset documentation
Pros:
- Single file
- Quick and simple
- Backward compatible
Cons:
- Limited dataset documentation (only 7 fields)
- Duplication across model cards
- No standardized format
Use When:
- Quick prototyping
- Simple use cases
- Datasets are not reused
Pros:
- Full schema integration
- Type checking and validation
- Rich IDE support
Cons:
- Requires resolving naming conflicts
- More complex tooling
- Not yet fully implemented
Use When:
- After naming conflicts resolved (Phase 2+)
- Need for strict validation
- Building advanced tooling
# View the model card
cat sentiment-classifier-with-datasheet-refs.yaml
# Key sections to review:
# - model_details: Basic model information
# - model_parameters.data: Minimal dataset info with datasheet reference
# - dataset_documentation: External datasheet references
# - quantitative_analysis: Performance metrics
# - considerations: Ethics, limitations, use cases# View the datasheet
cat imdb-sentiment-datasheet-v1.yaml
# Key sections to review:
# - motivation: Why the dataset was created
# - composition: What's in the dataset (50K reviews, balanced)
# - collection: How data was collected (web scraping)
# - ethics: Ethical considerations (public data, no PII)
# - uses: Known uses, discouraged uses, impact analysis-
Create Dataset Datasheet (once per dataset):
# Copy template cp imdb-sentiment-datasheet-v1.yaml my-dataset-datasheet.yaml # Edit to document your dataset # Fill in all sections: motivation, composition, collection, etc.
-
Reference in Model Card (for each model):
dataset_documentation: training_datasets: - id: "my-dataset-v1" datasheet_url: "https://datasheets.example.org/my-dataset-v1.yaml" datasheet_format: "datasheets-for-datasets-v1.0"
-
Publish Both Files:
- Model card → Model repository/registry
- Datasheet → Dataset repository/registry
These examples follow the current Model Cards schema and Datasheets schema respectively:
# Validate model card (uses modelcards.yaml schema)
linkml-validate -s src/linkml/modelcards.yaml \
src/data/examples/harmonized/sentiment-classifier-with-datasheet-refs.yaml
# Validate datasheet (would use datasheets schema - not shown here)
linkml-validate -s /path/to/data-sheets-schema/schema/data_sheets_schema_all.yaml \
src/data/examples/harmonized/imdb-sentiment-datasheet-v1.yamlBefore (current simple approach):
model_parameters:
data:
- name: "IMDb"
link: "https://example.com"
description: "Movie reviews"After (with datasheet reference):
model_parameters:
data:
- name: "IMDb Movie Reviews"
link: "https://ai.stanford.edu/~amaas/data/sentiment/"
description: "50,000 highly polar movie reviews"
dataset_documentation:
training_datasets:
- id: "imdb-sentiment-v1"
datasheet_url: "https://datasheets.example.org/imdb-sentiment-v1.yaml"Then: Create the corresponding datasheet file with comprehensive documentation.
- Comprehensive Documentation: 60+ fields vs 7 fields
- Reusability: Document dataset once, reference in many model cards
- Standards Compliance: Follows established Datasheets framework
- No Breaking Changes: Works with existing model cards schema
- Governance: Clear audit trail for datasets
- Compliance: Better support for GDPR, CCPA, ethics requirements
- Efficiency: Reduce documentation duplication
- Transparency: Comprehensive dataset documentation for stakeholders
- Reproducibility: Detailed methodology documentation
- Attribution: Proper creator attribution with ORCID
- Impact Analysis: Documented uses, limitations, ethical considerations
- Interoperability: Standard format for dataset documentation
See INTEGRATION_GUIDE.md in the repository root for:
- Detailed integration patterns
- Naming conflict resolution strategy
- Phase-by-phase implementation roadmap
- Migration tools and utilities
- Integration Guide:
/INTEGRATION_GUIDE.md - Alignment Analysis:
/ALIGNMENT_ANALYSIS.md - Model Cards Schema:
/src/linkml/modelcards.yaml - Harmonized Schema (conceptual):
/src/linkml/modelcards_harmonized.yaml - Datasheets Schema: https://github.com/bridge2ai/data-sheets-schema
See CLAUDE.md in the repository root for comprehensive guidance on working with this repository.