Context
Upstream linkml maintainers (cmungall, sneakers-the-rat) closed linkml/linkml#3295 with clear guidance: the WL-based serialization logic should live in a separate RDF library, not inside linkml. Meanwhile, linkml/linkml#1943 shows upstream converging on pyoxigraph RDFC-1.0 for canonicalization — but not yet addressing diff stability (small schema change → small output diff).
Our fork's deterministic_turtle() + _wl_signatures() solves both problems via a hybrid pipeline. Extracting it into a standalone package:
- Directly addresses cmungall's request: "The implementation should live elsewhere, where others can take advantage of it"
- Positions the diff-stability layer as the answer when upstream discovers RDFC-1.0's cascading-renumbering problem
- Makes our fork's
--deterministic flag a thin wrapper around an import
What the package should contain
Core API
from rdflib_stable_turtle import deterministic_turtle, wl_signatures
# Full pipeline: RDFC-1.0 → WL hashing → idiomatic rdflib Turtle
ttl: str = deterministic_turtle(graph)
# Lower-level: just the WL signatures (for custom pipelines)
signatures: dict[str, str] = wl_signatures(quads)
Implementation (extract from generator.py)
_wl_signatures() → public wl_signatures()
deterministic_turtle() → public, same name
- The
_to_rdflib() helper and prefix-filtering logic
What stays in linkml
deterministic_json() — JSON-specific, no RDF dependency
- Collection sorting (owl:oneOf, sh:in, sh:ignoredProperties) — generator-specific fixes
--deterministic CLI flag — imports from the package
_deterministic_context_json() — JSON-LD context-specific ordering
Design considerations
1. RDFC-1.0 round-trip optimization
Current pipeline does 3 serialize/parse cycles (rdflib→NT→pyoxigraph→WL→rdflib→Turtle). Consider:
- Computing WL directly on rdflib Graph for the common no-collision case
- Only invoking pyoxigraph RDFC-1.0 for collision tiebreaking (automorphic nodes)
- This would eliminate the N-Triples→pyoxigraph→rdflib round-trip overhead
2. pyoxigraph as optional dependency
pyoxigraph >= 0.4.0 (for Dataset.canonicalize())
- Import lazily — raise ImportError with install instructions
- Test suite should skip gracefully when absent
- Consider:
pip install rdflib-stable-turtle[fast] for pyoxigraph
3. Diff stability properties to test and document
- Adding one blank node must NOT renumber existing nodes
- Removing a blank node must NOT renumber unrelated nodes
- Modifying a literal must only affect the containing subject block
- Benchmark: single-description-change diff ≤ 20 lines
- Benchmark: signal-to-noise ratio ≥ 5x vs non-deterministic (currently 13-344x)
4. Collection sorting is NOT part of this package
Collection sorting changes graph structure (different rdf:first/rdf:rest triples). This is a generator-level concern, not a serialization concern. The package should serialize faithfully.
5. Collision handling
- 48-bit truncated SHA-256: ~0.002% collision probability at 100K blank nodes
- Counter-based tiebreaker via sorted(bnode_ids) from RDFC-1.0 canonical ordering
- WL is provably complete for tree-structured BNodes (LinkML's case)
- Known failures (Cai-Fürer-Immerman symmetric graphs) don't occur in LinkML output
6. Package metadata
- Name:
rdflib-stable-turtle (follows rdflib-* convention)
- License: Apache-2.0
- Deps:
rdflib >= 7.0.0, optional pyoxigraph >= 0.4.0
- Python: >= 3.10
7. Upstream strategy
References
Context
Upstream linkml maintainers (cmungall, sneakers-the-rat) closed linkml/linkml#3295 with clear guidance: the WL-based serialization logic should live in a separate RDF library, not inside linkml. Meanwhile, linkml/linkml#1943 shows upstream converging on pyoxigraph RDFC-1.0 for canonicalization — but not yet addressing diff stability (small schema change → small output diff).
Our fork's
deterministic_turtle()+_wl_signatures()solves both problems via a hybrid pipeline. Extracting it into a standalone package:--deterministicflag a thin wrapper around an importWhat the package should contain
Core API
Implementation (extract from generator.py)
_wl_signatures()→ publicwl_signatures()deterministic_turtle()→ public, same name_to_rdflib()helper and prefix-filtering logicWhat stays in linkml
deterministic_json()— JSON-specific, no RDF dependency--deterministicCLI flag — imports from the package_deterministic_context_json()— JSON-LD context-specific orderingDesign considerations
1. RDFC-1.0 round-trip optimization
Current pipeline does 3 serialize/parse cycles (rdflib→NT→pyoxigraph→WL→rdflib→Turtle). Consider:
2. pyoxigraph as optional dependency
pyoxigraph >= 0.4.0(forDataset.canonicalize())pip install rdflib-stable-turtle[fast]for pyoxigraph3. Diff stability properties to test and document
4. Collection sorting is NOT part of this package
Collection sorting changes graph structure (different rdf:first/rdf:rest triples). This is a generator-level concern, not a serialization concern. The package should serialize faithfully.
5. Collision handling
6. Package metadata
rdflib-stable-turtle(follows rdflib-* convention)rdflib >= 7.0.0, optionalpyoxigraph >= 0.4.07. Upstream strategy
References