feat(generators): add --deterministic flag with hybrid RDFC-1.0 + rdflib serialization#1
feat(generators): add --deterministic flag with hybrid RDFC-1.0 + rdflib serialization#1
Conversation
bdb0f7a to
6544b72
Compare
f0a081a to
7f529d6
Compare
7f529d6 to
37cafc8
Compare
fb47790 to
8016a4b
Compare
🔍 Adversarial Review — PR #1SummaryA well-engineered feature with strong documentation and benchmark data. The three-phase pipeline (RDFC-1.0 → WL → rdflib) is architecturally sound but introduces significant complexity. I found 2 bugs (dead code shipped as functional features), 1 algorithmic concern in collision handling, and several design/test gaps worth addressing before merge. 🐛 Bugs & Issues1. Dead code: The # Added to Generator dataclass but never checked anywhere:
normalize_prefixes: bool = False2. Dead code: The function is defined in 3. WL collision counter assignment depends on c14n ordering, not structure In for bid in sorted(bnode_ids): # sorted by c14n ID, NOT by structure
digest = hashlib.sha256(sig[bid].encode("utf-8")).hexdigest()[:12]
count = seen_hashes.get(digest, 0)
seen_hashes[digest] = count + 1
label = f"b{digest}" if count == 0 else f"b{digest}_{count}"Adding an unrelated triple can change RDFC-1.0 numbering, which changes which colliding node gets the base label vs for bid in sorted(bnode_ids, key=lambda b: (sig[b], b)):
|
d9c1a07 to
5da3f77
Compare
5da3f77 to
cfaba19
Compare
…lib serialization Add a --deterministic / --no-deterministic CLI flag (default off) to OWL, SHACL, JSON-LD Context, and JSON-LD generators that produces byte-identical output across invocations. Three-phase hybrid pipeline for Turtle generators: 1. RDFC-1.0 canonicalization (W3C Recommendation) via pyoxigraph 2. Weisfeiler-Lehman structural hashing for diff-stable blank node IDs 3. Hybrid rdflib re-serialization for idiomatic Turtle (inline blank nodes, collection syntax, prefix filtering) JSON generators use deterministic_json() with recursive deep-sort and JSON-LD-aware key ordering that preserves conventional @context structure. Collection items (owl:oneOf, sh:in, sh:ignoredProperties) are sorted when --deterministic is set to ensure reproducible RDF list order. pyoxigraph >= 0.4.0 is imported lazily only when --deterministic is used. Tests skip gracefully when pyoxigraph is unavailable. Refs: linkml#1847 Signed-off-by: Carlo van Driesten <carlo.van-driesten@bmw.de> Signed-off-by: jdsika <carlo.van-driesten@bmw.de>
cfaba19 to
c4ecf10
Compare
Summary
Add a
--deterministicflag to OWL, SHACL, and JSON-LD generators that produces byte-identical output across invocations, eliminating spurious diffs in version-controlled artifacts.This is a review-ready fork of the approach discussed in upstream linkml/linkml#3295, rebuilt to address maintainer feedback.
Problem
Generated OWL and SHACL artifacts contain blank nodes whose identifiers change between runs due to Python dict ordering and rdflib serialization non-determinism. This makes version-controlled artifacts show massive diffs even when the underlying schema change is trivial.
Solution
Three-Phase Hybrid Pipeline (
deterministic_turtle())_:c14nNidentifiers with content-based hashes. These depend only on predicate IRIs, literal values, and named-node IRIs — not on blank-node numbering — so adding or removing a triple only affects directly involved blank nodes.Graphand serializes with rdflib's native Turtle writer. This recovers idiomatic Turtle features that pyoxigraph cannot emit:[ … ]) for singly-referenced blank nodes (Turtle §2.7)( … )) forrdf:Listchains (Turtle §2.8)All triples from the source graph are preserved — the hybrid step only changes syntactic form, never semantic content. Plain string literals have their
xsd:stringdatatype stripped per RDF 1.1 §2.5.1 (simple literals are syntactic sugar forxsd:string).Additional Features
Collection sorting (gated behind
--deterministic):owl:oneOf,sh:in,sh:ignoredPropertiesitems are sorted when the flag is setdeterministic_json():Benchmark Results
Tested on the Gaia-X Trust Framework ontology (~68K OWL / ~165K SHACL triples) and schema.org (~18K triples):
Semantic Equivalence
rdflib.compare.isomorphic()TrueTrueTrueByte-Level Stability
Diff Quality (Signal-to-Noise Ratio)
Controlled mutations on a LinkML schema:
Output Size (Gaia-X Trust Framework)
The SHACL 18× size reduction comes from replacing 157,552 named
_:bHASHblank nodes with inline[ … ]syntax and 77,358 explicitrdf:first/rdf:resttriples with( … )collection shorthand — matching the upstream Gaia-X registry convention.Performance
Dependency
pyoxigraph >= 0.4.0is imported lazily only when--deterministicis used. It is not a core dependency, avoiding conflict withmorph-kgc's pin onpyoxigraph < 0.4.0. Tests skip gracefully when pyoxigraph >= 0.4.0 is unavailable.Relationship to upstream linkml#3295
The original PR was closed after maintainer feedback requesting an established canonicalization standard. This PR:
Testing
test_deterministic_output.py: 27 tests (stability, sorting, prefix format, enum ordering, kitchen_sink)test_deterministic_benchmark.py: 10 local + 4 network tests (schema.org equivalence, mutation diff quality, signal-to-noise assertions)Benchmark Test Assertions
The benchmark enforces quantitative properties:
References