GitHub - adoistic/VOCData

VOC Data – IIIF Manifests with Transcriptions

This repository contains a Jupyter notebook for processing IIIF manifests from the NL-HaNA VOC Archives, augmented with page-level transcription text. The notebook downloads transcription TXT files and corresponding IIIF JSON manifests, merges the text into each manifest's canvases, and saves per-document outputs as well as a combined manifest.

Accessing the Dataset

The processed dataset (19GB of IIIF manifests with transcriptions) is hosted on Hugging Face Datasets for free, fast access:

🤗 Hugging Face: huggingface.co/datasets/adoistic/voc-data

Download the dataset:

from huggingface_hub import hf_hub_download

# Download the combined manifest (7GB)
file = hf_hub_download(
    repo_id="adoistic/voc-data",
    filename="combined_all_manifests.json",
    repo_type="dataset"
)

# Or download individual manifest files
file = hf_hub_download(
    repo_id="adoistic/voc-data",
    filename="merged_1111.json",
    repo_type="dataset"
)

# Load and use the data
import json
with open(file, 'r') as f:
    manifest = json.load(f)

List all available files:

from huggingface_hub import list_repo_files

files = list_repo_files("adoistic/voc-data", repo_type="dataset")
print(f"Available files: {len(files)}")

Data source and license

Manifests: https://data.globalise.huygens.knaw.nl/manifests/inventories/{id}.json
Images: IIIF Image API services referenced within each canvas
Rights: https://creativecommons.org/publicdomain/mark/1.0/ (as present in manifests)

Notebook overview

The notebook performs these steps:

Read a .tab file listing transcription TXT URLs (one per line).
For each URL:
- Download the TXT file; infer the document identifier (e.g., 1111) from the filename.
- Construct the corresponding IIIF manifest URL via a template.
- Parse the TXT into sections keyed by canvas label IDs (e.g., NL-HaNA_1.04.02_1111_0001).
- Merge the matched text into the manifest by setting a text field on each canvas.
- Save as merged_{id}.json under merged_manifests/.
Create combined_all_manifests.json that concatenates all canvases from the successfully processed manifests.
Zip the merged_manifests/ folder for convenience.

Output schema (IIIF Presentation 3 with additions)

Each merged_*.json is an IIIF Manifest with a non-standard addition: a text field on each canvas. Key fields:

Manifest
- @context: http://iiif.io/api/presentation/3/context.json
- id, type: Manifest
- label, metadata, rights
- items: Array of canvases
Canvas (per page)
- id, type: Canvas
- label.en[0]: Canvas identifier, e.g., NL-HaNA_1.04.02_1111_0001
- height, width
- items[0].items[0].body: IIIF Image resource
- text: String of merged transcription for this canvas (may be empty if no match)

Example (truncated):

{
  "type": "Manifest",
  "label": {"en": ["Inventory 1111"]},
  "items": [
    {
      "type": "Canvas",
      "label": {"en": ["NL-HaNA_1.04.02_1111_0001"]},
      "items": [
        {
          "type": "AnnotationPage",
          "items": [
            {"type": "Annotation", "motivation": "painting", "body": {"type": "Image", "format": "image/jpeg"}}
          ]
        }
      ],
      "text": "... merged transcription text ..."
    }
  ]
}

combined_all_manifests.json is also a Manifest whose items array is the concatenation of the items from all successful merged_*.json files. Its label and metadata summarize the collection.

How to Generate the Dataset Yourself

If you want to generate the dataset from scratch:

Install dependencies:

pip install requests huggingface_hub

Run the notebook: Open VOC_Data_Downloader.ipynb in Jupyter and run all cells, or call main(...):

main('globalise_transcriptions_v2_txt.tab')

Default concurrency is 15 workers; adjust via max_workers if needed:

main('globalise_transcriptions_v2_txt.tab', max_workers=20)

Outputs will be written to:

merged_manifests/ (individual manifests and combined_all_manifests.json)
merged_manifests.zip

Upload to Hugging Face (Optional):

To share your generated dataset:

a. Create a free account at huggingface.co b. Get your API token from Settings > Access Tokens c. Run the upload script:

python3 upload_to_huggingface.py \
    --token YOUR_HF_TOKEN \
    --repo-id your-username/voc-data

This uploads all files in merged_manifests/ to your Hugging Face dataset repository.

Environment requirements

The notebook depends on standard Python 3 libraries plus requests. Minimal setup:

pip install requests

Recommended: Run inside a virtual environment. The notebook uses network access to fetch TXT and JSON files.

Notes and caveats

Canvas text is added locally and is not part of the IIIF Presentation 3 standard; downstream tools should treat it as a custom extension.
TXT-to-canvas matching relies on labels like NL-HaNA_1.04.02_{id}_{page}; if labels or TXT formatting differ, some canvases may have empty text.
combined_all_manifests.json may be large; use streaming parsers for programmatic consumption.

Citation

If you use this data, please credit the NL-HaNA VOC Archives and the GLOBALISE project data services. Include a link to the manifest endpoints and the public domain mark indicated in the manifests.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
VOC_Data_Downloader.ipynb		VOC_Data_Downloader.ipynb
globalise_transcriptions_v2_txt.tab		globalise_transcriptions_v2_txt.tab
upload_to_huggingface.py		upload_to_huggingface.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VOC Data – IIIF Manifests with Transcriptions

Contents

Accessing the Dataset

Download the dataset:

List all available files:

Data source and license

Notebook overview

Output schema (IIIF Presentation 3 with additions)

How to Generate the Dataset Yourself

Environment requirements

Notes and caveats

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VOC Data – IIIF Manifests with Transcriptions

Contents

Accessing the Dataset

Download the dataset:

List all available files:

Data source and license

Notebook overview

Output schema (IIIF Presentation 3 with additions)

How to Generate the Dataset Yourself

Environment requirements

Notes and caveats

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages