Skip to content

feat: extract deterministic_turtle into standalone PyPI package (rdflib-stable-turtle) #6

@jdsika

Description

@jdsika

Context

Upstream linkml maintainers (cmungall, sneakers-the-rat) closed linkml/linkml#3295 with clear guidance: the WL-based serialization logic should live in a separate RDF library, not inside linkml. Meanwhile, linkml/linkml#1943 shows upstream converging on pyoxigraph RDFC-1.0 for canonicalization — but not yet addressing diff stability (small schema change → small output diff).

Our fork's deterministic_turtle() + _wl_signatures() solves both problems via a hybrid pipeline. Extracting it into a standalone package:

  1. Directly addresses cmungall's request: "The implementation should live elsewhere, where others can take advantage of it"
  2. Positions the diff-stability layer as the answer when upstream discovers RDFC-1.0's cascading-renumbering problem
  3. Makes our fork's --deterministic flag a thin wrapper around an import

What the package should contain

Core API

from rdflib_stable_turtle import deterministic_turtle, wl_signatures

# Full pipeline: RDFC-1.0 → WL hashing → idiomatic rdflib Turtle
ttl: str = deterministic_turtle(graph)

# Lower-level: just the WL signatures (for custom pipelines)
signatures: dict[str, str] = wl_signatures(quads)

Implementation (extract from generator.py)

  • _wl_signatures() → public wl_signatures()
  • deterministic_turtle() → public, same name
  • The _to_rdflib() helper and prefix-filtering logic

What stays in linkml

  • deterministic_json() — JSON-specific, no RDF dependency
  • Collection sorting (owl:oneOf, sh:in, sh:ignoredProperties) — generator-specific fixes
  • --deterministic CLI flag — imports from the package
  • _deterministic_context_json() — JSON-LD context-specific ordering

Design considerations

1. RDFC-1.0 round-trip optimization

Current pipeline does 3 serialize/parse cycles (rdflib→NT→pyoxigraph→WL→rdflib→Turtle). Consider:

  • Computing WL directly on rdflib Graph for the common no-collision case
  • Only invoking pyoxigraph RDFC-1.0 for collision tiebreaking (automorphic nodes)
  • This would eliminate the N-Triples→pyoxigraph→rdflib round-trip overhead

2. pyoxigraph as optional dependency

  • pyoxigraph >= 0.4.0 (for Dataset.canonicalize())
  • Import lazily — raise ImportError with install instructions
  • Test suite should skip gracefully when absent
  • Consider: pip install rdflib-stable-turtle[fast] for pyoxigraph

3. Diff stability properties to test and document

  • Adding one blank node must NOT renumber existing nodes
  • Removing a blank node must NOT renumber unrelated nodes
  • Modifying a literal must only affect the containing subject block
  • Benchmark: single-description-change diff ≤ 20 lines
  • Benchmark: signal-to-noise ratio ≥ 5x vs non-deterministic (currently 13-344x)

4. Collection sorting is NOT part of this package

Collection sorting changes graph structure (different rdf:first/rdf:rest triples). This is a generator-level concern, not a serialization concern. The package should serialize faithfully.

5. Collision handling

  • 48-bit truncated SHA-256: ~0.002% collision probability at 100K blank nodes
  • Counter-based tiebreaker via sorted(bnode_ids) from RDFC-1.0 canonical ordering
  • WL is provably complete for tree-structured BNodes (LinkML's case)
  • Known failures (Cai-Fürer-Immerman symmetric graphs) don't occur in LinkML output

6. Package metadata

  • Name: rdflib-stable-turtle (follows rdflib-* convention)
  • License: Apache-2.0
  • Deps: rdflib >= 7.0.0, optional pyoxigraph >= 0.4.0
  • Python: >= 3.10

7. Upstream strategy

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions