Skip to content

fix(generators): add --normalize-prefixes flag for well-known prefix names#4

Open
jdsika wants to merge 1 commit intomainfrom
fix/normalize-well-known-prefixes
Open

fix(generators): add --normalize-prefixes flag for well-known prefix names#4
jdsika wants to merge 1 commit intomainfrom
fix/normalize-well-known-prefixes

Conversation

@jdsika
Copy link
Copy Markdown

@jdsika jdsika commented Mar 26, 2026

Summary

Add an opt-in --normalize-prefixes flag to all generators that normalises non-standard prefix aliases to well-known names from a static, version-independent prefix map (derived from rdflib 7.x defaults, cross-checked against prefix.cc consensus).

Motivation

Schemas often declare custom prefix aliases for well-known namespaces (e.g. sdo for https://schema.org/, dce for http://purl.org/dc/elements/1.1/). When generated artifacts are consumed by tools that expect conventional prefix names, this causes friction. This flag allows opt-in normalization without changing default behavior.

Changes

generator.py — Static prefix map and shared helper

  • _WELL_KNOWN_PREFIX_MAP: Frozen dict literal of ~30 well-known namespace-to-prefix mappings. Eliminates rdflib version dependency (defaults changed between 6.x and 7.x). Includes both http://schema.org/ and https://schema.org/schema (linkml-runtime uses HTTP, W3C prefers HTTPS).
  • well_known_prefix_map(): Returns a copy of the static map (no longer creates Graph() at runtime).
  • normalize_graph_prefixes(graph, schema_prefixes): Shared helper that rebinds non-standard aliases in an rdflib Graph to their standard names. Uses graph.bind(..., override=True, replace=True) for clean rebinding.
  • normalize_prefixes field on Generator base class + CLI flag.

owlgen.py — Explicit post-bind normalisation

  • Binds schema prefixes normally (override=True, the default), then calls normalize_graph_prefixes() when flag is set.
  • Replaces previous override=False approach which made all 29 rdflib defaults sticky (not just well-known ones).

shaclgen.py — Same explicit approach

  • Same pattern as OWL: bind normally → post-process with normalize_graph_prefixes().

jsonldcontextgen.py — Three-case remap logic

  • Initialises _prefix_remap in __post_init__() (replaces defensive getattr).
  • Handles three cases during remap:
    1. Standard prefix not yet bound → rebind from old to new.
    2. Standard prefix bound to different URI (stale binding) → remove stale, then rebind.
    3. Standard prefix bound to same URI (duplicate) → just drop the alias.
  • Applies remap to both prefix declarations and CURIE generation in _build_element_id().

Test Coverage

New test_normalize_prefixes.py with 18 tests across 5 test classes:

Class Tests What it verifies
TestOwlNormalizePrefixes 5 sdo→schema, flag-off, dce→dc (graph bindings), custom prefix, http variant
TestShaclNormalizePrefixes 5 Same coverage for SHACL generator
TestContextNormalizePrefixes 1 http://schema.org/ edge case
TestWellKnownPrefixMap 6 Static map integrity, copy safety, rdflib superset check
TestCrossGeneratorConsistency 1 All three generators agree on sdo→schema

Plus 3 existing context-gen tests retained unchanged.

Backward Compatibility

The flag defaults to off — existing behavior is fully preserved. No snapshot updates required.

@jdsika jdsika force-pushed the fix/normalize-well-known-prefixes branch 3 times, most recently from 42e007b to 71db97f Compare March 26, 2026 17:16
@jdsika jdsika changed the title fix(generators): normalize well-known prefixes in output fix(generators): add --normalize-prefixes flag for well-known prefix names Mar 26, 2026
@jdsika jdsika force-pushed the fix/normalize-well-known-prefixes branch 8 times, most recently from 60f62c4 to 1b36dc0 Compare April 2, 2026 09:38
@jdsika
Copy link
Copy Markdown
Author

jdsika commented Apr 2, 2026

🔍 Adversarial Review — PR #4

Summary

Solid feature with thorough test coverage, but contains a confirmed race condition between Phase 1 and Phase 2 of normalize_graph_prefixes() that can silently swap namespace URIs (HTTPS → HTTP) for schema.org. The cross-PR conflict with PR #1 (well_known_prefix_map() defined differently in both) also needs a merge strategy.

🐛 Bugs & Issues

1. Phase 2 overwrites Phase 1 normalization for HTTP/HTTPS variants (Confirmed)

When a schema declares sdo: https://schema.org/ and the metamodel runtime injects schema: http://schema.org/, the following sequence occurs:

  • Phase 1 correctly rebinds sdoschema: https://schema.org/ (user's HTTPS)
  • After Phase 1, the graph still contains schema1: http://schema.org/ (orphaned runtime HTTP binding)
  • Phase 2 sees schema1: http://schema.org/, looks up wk["http://schema.org/"] = "schema", checks collision ("schema" not in schema_prefixes — user declared sdo, not schema), and rebinds schema: http://schema.org/ with override=True, replace=True
  • This silently replaces the Phase 1 HTTPS binding with the HTTP variant

Reproduced locally with rdflib 7.x:

from rdflib import Graph, Namespace, URIRef
g = Graph()
g.bind('schema', Namespace('http://schema.org/'))      # metamodel
g.namespace_manager.bind('sdo', URIRef('https://schema.org/'))  # user

# Phase 1
g.bind('schema', Namespace('https://schema.org/'), override=True, replace=True)
# → schema: https://schema.org/ ✅, schema1: http://schema.org/ (orphan)

# Phase 2
g.bind('schema', Namespace('http://schema.org/'), override=True, replace=True)
# → schema: http://schema.org/ ❌  (HTTPS binding silently lost!)

Fix: Phase 2 should check whether std_pfx is already bound in the graph (not just schema_prefixes) before rebinding:

# In Phase 2, before rebinding:
current_ns_for_std = dict(graph.namespaces()).get(std_pfx)
if current_ns_for_std and str(current_ns_for_std) != ns_str:
    continue  # std_pfx already claimed by Phase 1 for a different URI

2. test_sdo_normalised_to_schema may be passing for the wrong reason

The test asserts pfx["schema"] == "https://schema.org/", but if Phase 2 overwrites this with HTTP, the test should fail. If it passes, it may be because the metamodel's emit_prefixes don't include schema for the test schema, meaning the HTTP binding is never injected — hiding the bug in production schemas that DO trigger both bindings. The test should explicitly set up both HTTP and HTTPS bindings to exercise Phase 2 properly.

⚠️ Concerns

3. Cross-PR conflict with PR #1well_known_prefix_map() defined differently

PR #1 PR #4
Approach Dynamic: {str(ns): str(pfx) for pfx, ns in RdfGraph().namespaces()} Static: MappingProxyType({...}) with ~30 entries
rdflib dependency Changes with rdflib version Frozen, version-independent
HTTP/HTTPS variants Only what rdflib provides Explicitly includes both

PR #4's static approach is explicitly designed to solve a problem with PR #1's dynamic approach (rdflib defaults changed between 6.x and 7.x). These will conflict on merge. Recommendation: merge PR #4 first, or rebase PR #1 to use PR #4's static map.

4. Cross-PR conflict — flag forwarding in jsonldgen.py

PR #4 forwards both normalize_prefixes AND deterministic to the embedded ContextGenerator:

for flag in ("normalize_prefixes", "deterministic"):
    if hasattr(self, flag):
        context_kwargs.setdefault(flag, getattr(self, flag))

PR #1 adds deterministic handling in a different location (post-serialization in end_schema). PR #4's forwarding of deterministic creates a dependency on PR #1's field existing. The hasattr guard prevents a crash, but semantically PR #4 shouldn't forward a flag it doesn't define. Either:

5. PR title says fix: but commit says feat: — which is it?

The commit message uses feat(generators): while the PR title uses fix(generators):. This is a new opt-in flag, which is a feature — feat: is correct per conventional commits.

6. Static map maintenance burden

The _WELL_KNOWN_PREFIX_MAP needs manual updates when rdflib adds new curated defaults. The test_matches_rdflib_defaults test catches additions but not removals. Consider adding a reverse check: entries in the static map that are NOT in rdflib's defaults could indicate stale entries.

🧪 Test Coverage Assessment

Strong coverage (18 tests across 7 classes):

  • ✅ OWL, SHACL, JSON-LD context normalization (sdo→schema, dce→dc)
  • ✅ HTTP/HTTPS schema.org variants
  • ✅ Collision detection with warnings (caplog)
  • ✅ Cross-generator consistency
  • ✅ Static map integrity and copy safety
  • ✅ JSONLDGenerator flag forwarding
  • ✅ Flag-off backward compatibility

Gaps:

  • ❌ No test for normalize_prefixes + exclude_imports interaction
  • ❌ No test for empty schema (zero prefixes) — is _prefix_remap.clear() safe?
  • ❌ No test that explicitly sets up both HTTP and HTTPS runtime bindings to exercise Phase 2 overwrite (the scenario from Bug feat(generators): add --deterministic flag with hybrid RDFC-1.0 + rdflib serialization #1)
  • ❌ No test for add_prefix being called before __post_init__ completes (defensive, but _prefix_remap access would crash)
  • ❌ CURIE remapping test (test_normalize_prefixes_curie_remapping) only checks that Person @id starts with schema: — doesn't verify full_name slot's @id uses schema:name instead of sdo:name

📋 Fix Plan

  1. Fix Phase 2 overwrite bug — add a graph-binding check before rebinding in Phase 2 to prevent overwriting Phase 1's normalization
  2. Add regression test — explicitly set up a graph with both schema: http://schema.org/ and sdo: https://schema.org/, run normalize, and assert the HTTPS binding survives
  3. Remove "deterministic" from forwarding loop — or gate it behind hasattr only (already done) but add a comment explaining the cross-PR dependency
  4. Fix PR title — change fix: to feat: for consistency with the commit
  5. Add empty-schema test — verify no crash when schema has zero prefixes
  6. Add exclude_imports interaction test — verify normalization works when imports are excluded
  7. Strengthen CURIE remapping test — assert that slot URIs (not just class URIs) are remapped in the context

✅ What's Good

  • Excellent documentation: the 3-case handling in visit_schema is well-commented with case numbers
  • Defensive _prefix_remap initialization: correctly placed before super().__post_init__() with clear comment explaining why
  • Collision detection: warning on skip is the right behavior — no silent data loss
  • MappingProxyType for the static map: prevents accidental mutation
  • well_known_prefix_map() returns a copy: callers can't corrupt the canonical map
  • Phase 2 concept is sound: cleaning up runtime-injected orphan bindings is the right idea, it just needs the overwrite guard
  • Cross-generator consistency test: ensures OWL, SHACL, and JSON-LD agree — excellent
  • Backward compatibility: flag defaults to off, existing behavior preserved

… names

Add an opt-in --normalize-prefixes flag to OWL, SHACL, and JSON-LD
Context generators that normalises non-standard prefix aliases to
well-known names from a static prefix map (derived from rdflib 7.x
defaults, cross-checked against prefix.cc consensus).

Key design decisions:
- Static frozen map (MappingProxyType) instead of runtime
  Graph().namespaces() lookup eliminates rdflib version dependency
- Both http://schema.org/ and https://schema.org/ map to 'schema'
- Shared normalize_graph_prefixes() helper used by OWL and SHACL
- Two-phase graph normalisation: Phase 1 normalises schema-declared
  prefixes, Phase 2 cleans up runtime-injected bindings
- Collision detection: skip with warning when standard prefix name
  is already user-declared for a different namespace
- Phase 2 guard prevents overwriting HTTPS bindings with HTTP variants

The flag defaults to off, preserving existing behaviour.

Tests cover OWL, SHACL, and context generators with sdo->schema,
dce->dc, http/https edge case, custom prefix preservation, flag-off
backward compatibility, cross-generator consistency, prefix collision
detection, schema1 regression prevention, Phase 2 HTTPS guard, empty
schema edge case, and static map integrity.

Signed-off-by: jdsika <carlo.van-driesten@bmw.de>
@jdsika jdsika force-pushed the fix/normalize-well-known-prefixes branch from d9fb738 to dd0ce0c Compare April 2, 2026 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant