An auditable safety layer for any LLM. SAT-verifiable Boolean rules and counterfactual evidence recourse via dual-graph Tsetlin Machines. Built for EU AI Act Article 13/14 compliance.
University of Agder (UiA).
Work in progress. Implementation is complete and all 44 unit and integration tests pass. The 5-seed GPU training run on FEVER, HaluEval, FActScore, and MedHallBench is not yet executed; the numbers in the results section are planned targets, not measurements. Expected wall-clock on a single Tesla V100-SXM3-32GB is approximately one week (~57 GPU-hours per seed × 5 seeds, with overlap). The paper is in draft.
| Component | State |
|---|---|
Architecture frozen (docs/ARCHITECTURE.md) |
done |
Novelty audit (docs/novelty_audit.md) |
done |
Research plan (docs/research_plan.md) |
done |
| Core implementation (8 modules) | done |
| Triple extractor (regex + Qwen2.5-1.5B) | done |
| Dual-graph builder | done |
| Graph Tsetlin Machine verifier | done |
| Counterfactual recourse module | done |
| SAT-verifiable clause receipts | done |
| Adversarial robustness harness | done |
| 44 unit + integration tests | passing |
Package installs cleanly (pip install -e .) |
done |
| Triple extraction at scale on FEVER train | not started |
| 5-seed FEVER training run | not started, ~1 week V100 |
| Zero-shot transfer to HaluEval / FActScore / MedHallBench | not started |
| Adversarial robustness evaluation | not started |
| Auditor user study (n ≥ 10) | not started |
| Paper submission | in draft (PAPER.md) |
Large language models hallucinate. Every major deployment incident on record is a verification failure:
| Incident | Cost / outcome |
|---|---|
| Mata v. Avianca (S.D.N.Y., 2023) | $5,000 sanction; bar-discipline referral; ChatGPT-fabricated citations |
| Moffatt v. Air Canada (BC CRT, 2024) | Airline held legally liable for chatbot misrepresentation |
| Stanford SDLP, Legal RAG Hallucinations (2025) | 17-33% citation hallucination in dedicated legal-RAG products |
| PubMed PMC12518350 (2024) | 28% citation hallucination on GPT-4 medical queries |
The existing guardrails are themselves black boxes. From August 2026, EU AI Act Article 13 (transparency) and Article 14 (effective human oversight) become enforceable for high-risk Annex-III deployers (medical, legal, finance, biometrics, HR, education). No current guardrail satisfies them, because a deployer cannot effectively oversee a 12 B-parameter neural classifier whose reasoning is opaque.
ClauseGuard is the first guardrail whose every decision is a finite, enumerable, SAT-verifiable propositional formula learned from data.
ClauseGuard
+-----------------------+
LLM output ──► | decompose into |
(any model) | atomic (s,r,o) |
| claim triples |
+-----------+-----------+
│
KG / RAG ──► +-----------+-----------+
context | build dual graph |
| (claim + evidence |
| + cross-graph edges) |
+-----------+-----------+
│
+-----------+-----------+
| graph-walking HGTM | ── outputs ──► Support / Refute / NEI
| (dual-graph, | +
| depth-5, VSA-bound) | firing clauses (rules)
+-----------+-----------+ +
│ SAT-verifiable receipt
│ (replayable in ~50 ms)
│ +
└─── for refuted claims ──► counterfactual
evidence-graph edit
("if you added this
triple, the verdict
would flip")
flowchart LR
L["LLM output<br/>(any model)"] --> D["Atomic-claim<br/>decomposer"]
D --> CT["Claim triples"]
CT --> CG["Claim graph"]
K["Knowledge graph<br/>or RAG corpus"] --> ET["Evidence triples"]
ET --> EG["Evidence graph"]
CG --> DG["Dual graph<br/>+ cross-graph edges"]
EG --> DG
DG --> V["Dual-graph<br/>GraphTM"]
V --> Y["Support / Refute /<br/>NotEnoughInfo"]
V --> R["Firing clauses<br/>+ SAT receipt"]
V --> C["Counterfactual<br/>evidence recourse"]
The claim graph and the evidence graph share an entity vocabulary and are joined by three typed cross-graph edges:
| Cross-graph edge | Fires when |
|---|---|
align:entity |
the same canonical entity surface form appears on both sides |
align:relation |
a relation between two aligned entities matches across sides |
contradict:negation |
the same (s, r, o) triple appears with opposite polarity on the two sides |
Clauses learn over these patterns directly. No prior GraphTM walks two graphs.
sequenceDiagram
participant LLM
participant CG as ClauseGuard
participant KG as Evidence KG
participant SAT as Glucose 4 (auditor)
LLM->>CG: "Ibuprofen treats active GI bleed."
CG->>CG: Decompose into atomic claims
CG->>CG: Extract (ibuprofen, treats, gi_bleed)
CG->>KG: Lookup
KG-->>CG: (ibuprofen, contraindicated_for, gi_bleed)
CG->>CG: Build dual graph
Note over CG: contradict:negation edge fires
CG->>CG: GraphTM emits "Refute"
CG-->>LLM: Refute<br/>+ firing clauses<br/>+ signed receipt
Note over CG: For Article 14 oversight
CG->>CG: Greedy minimum-edit recourse
CG-->>LLM: "Add triple (ibuprofen, treats, X) where X ≠ gi_bleed"
SAT->>CG: Verify receipt
CG-->>SAT: CNF + class scores + literal vector
SAT-->>SAT: Solve in ~50 ms
SAT-->>SAT: Recorded label is consistent
flowchart TB
subgraph M1[M1 extraction]
D[Atomic-claim<br/>decomposer]
T[Triple extractor]
end
subgraph M2[M2 graphs]
CG[Claim graph]
EG[Evidence graph]
DG[Dual graph]
end
subgraph M3[M3 tm]
V[GraphTMVerifier]
K[K-seed ensemble]
DI[LLM-judge distillation]
end
subgraph M4[M4 recourse]
CN[Candidate generator]
SE[Greedy search]
end
subgraph M5[M5 verify]
SAT[SAT encoder]
REC[Signed receipt]
end
subgraph M6[M6 data]
FE[FEVER]
HA[HaluEval]
FA[FActScore]
MH[MedHallBench]
end
subgraph M7[M7 eval]
ME[Metrics]
AD[Adversarial harness]
LG[JSONL logger]
end
subgraph M8[M8 utils]
SD[Seeding]
IO[Canonical IO]
end
M1 --> M2 --> M3 --> M4
M3 --> M5
M6 --> M1
M3 --> M7
M5 --> M7
A frozen interface contract lives in
docs/ARCHITECTURE.md. Five invariants are
checked by the integration tests:
- Triple round-trip determinism.
- Graph isomorphism under triple permutation.
- CUDA vs CPU reference parity (within 1 vote unit).
- No silent CPU fallback when the user requests CUDA.
- Clause receipts replay bit-for-bit.
| Capability | NeMo Guardrails | Llama Guard 3 | ConceptGuard | AlignScore | Semantic Entropy | ClauseGuard |
|---|---|---|---|---|---|---|
| Programmable rule layer | Yes (Colang) | No | No | No | No | Yes (learned) |
| Output is human-readable | Partial | No | No | No | No | Yes (Boolean clauses) |
| Per-decision SAT-verifiable receipt | No | No | No | No | No | Yes (≤ 50 ms) |
| Counterfactual recourse | No | No | No | No | No | Yes (≤ 3 evidence edits) |
| Detects KG contradictions natively | No | No | No | No | No | Yes |
| Article 13 transparency artifact | No | No | No | No | No | Yes |
| Article 14 effective oversight | No | No | No | No | No | Yes (recourse) |
| Inference cost vs the LLM itself | small | 8-12 B LLM | LLM + SAE | LLM | LLM × N samples | TM only |
| Open weights / reproducible | Yes | Yes | Yes | Yes | Yes | Yes |
Verified empty literature intersection (Tsetlin Machine ∪ KG fact
verification ∪ SAT-verifiable receipt ∪ counterfactual evidence
recourse) per
docs/novelty_audit.md.
Honest, unmeasured targets across the four benchmarks. The accuracy target is competitive, not state-of-the-art: ClauseGuard's pitch is the audit layer, not the headline number. Numbers will be locked after the 5-seed runs complete.
| Benchmark | Metric | Strongest baseline | ClauseGuard target |
|---|---|---|---|
| FEVER blind | label acc / FEVER score | DeBERTa-large 78.3 / 70.0 | ≥ 75 / ≥ 64 |
| HaluEval-QA | AUROC | AlignScore 79.5; Semantic Entropy 79.1 | ≥ 75 with ≤ 500 clauses |
| FActScore | F1 | GPT-4-judge 0.84 | ≥ 0.74 with rule export |
| MedHallBench | AUROC | GPT-4-judge 0.79 | ≥ 0.70 with medical-rule trace |
| Adversarial paraphrase | robust acc | DeBERTa 51% | ≥ 70% |
| Clause-set size for 95% recall | n_clauses | n/a (no competitor exports rules) | ≤ 500 |
| Median clause length | literals | n/a | ≤ 6 |
| SAT-verification latency | p50 ms | n/a | ≤ 50 ms |
| Auditor reproducibility | n ≥ 10 study | n/a | ≥ 80% predict-from-receipt |
gantt
title ClauseGuard roadmap
dateFormat YYYY-MM-DD
axisFormat %b
section Done
Architecture freeze :done, arch, 2026-05-20, 2d
Core implementation :done, impl, 2026-05-22, 1d
44 unit + integration tests :done, test, 2026-05-23, 1d
Novelty audit :done, nov, 2026-05-23, 1d
section Next, single V100
Triple extraction at scale :active, ext, 2026-05-24, 2d
5-seed FEVER training :fev, after ext, 7d
Zero-shot HaluEval, FActScore, MedHall :zs, after fev, 2d
Adversarial robustness :adv, after zs, 2d
SAT-receipt latency study :sat, after adv, 1d
section Paper
Lock numbers + figures :lock, after sat, 3d
Auditor user study :aud, after lock, 14d
Submission :sub, after aud, 7d
# 1. Install (editable mode)
pip install -e .
# 2. Run the test suite (no GPU required)
pytest tests/ -q
# 3. Download the four benchmarks (FEVER, HaluEval, FActScore, MedHallBench)
bash scripts/download_datasets.sh
# 4. Extract triples on a small slice (regex extractor, no GPU)
python scripts/extract_triples.py --dataset fever --split train --max-samples 5000
# 5. Train the dual-graph GraphTM on FEVER train (one seed)
python experiments/train_fever.py --seed 42 --max-train 10000 --epochs 25
# 6. Run the full 5-seed protocol (matches paper-a / paper-b / paper-c protocol)
bash scripts/run_all_seeds.sh
python scripts/aggregate_results.pyFor the production extractor (Qwen2.5-1.5B-Instruct, open weights),
add --extractor qwen to any of the above commands.
clauseguard/
├── README.md, LICENSE, CITATION.cff, .gitignore
├── pyproject.toml, requirements.txt
├── PAPER.md paper draft (targets, not measurements)
├── docs/
│ ├── ARCHITECTURE.md frozen interface contract
│ ├── research_plan.md experiment phases + compute budget
│ └── novelty_audit.md novelty proof, ~25 citations
├── src/clauseguard/
│ ├── extraction/ atomic-claim decomposer + (s, r, o) triple extractor
│ ├── graphs/ claim graph, evidence graph, dual graph
│ ├── builders/ pluggable evidence-graph back ends (default, subword_dep)
│ ├── tm/ graph-walking HGTM verifier + distillation + ensemble
│ ├── recourse/ candidate gen + greedy minimum-edit search + report
│ ├── verify/ SAT encoder (Glucose 4) + signed clause receipts
│ ├── data/ FEVER, HaluEval, FActScore, MedHallBench loaders
│ ├── eval/ metrics + adversarial harness + JSONL logger
│ └── utils/ seeding + canonical IO + hashing
├── experiments/ train_fever, eval_halueval, eval_factscore,
│ eval_medhall, recourse_eval, adversarial_eval,
│ sat_receipt_demo
├── scripts/ download_datasets.sh, extract_triples.py,
│ save_verifier.py, run_all_seeds.sh,
│ aggregate_results.py
├── tests/ 6 test files, 44 passing
├── configs/ base.yaml + per-experiment overrides
└── results/ per-seed JSON, aggregated CSV, paper figures
| Reused from | Component | Role in ClauseGuard |
|---|---|---|
paper-a-subword-dep-graphtm |
typed-edge subword graph builder | claim graph / evidence graph construction |
paper-b-attention-distill-graphtm |
BERT-attention-as-topology distillation | optional dense-edge variant; ablation baseline |
paper-c-tm-robustness |
TextAttack adapter + per-sample logger | adversarial stress-test of the audit layer |
decoder-attention-distill-graphtm |
Qwen / LLaMA attention extraction | LLM-side claim decomposition + attention prior |
graphtm-cbr |
graph-walking HGTM + CUDA-C kernels + greedy recourse | the verification engine + the recourse module (repurposed for evidence-graph edits) |
ClauseGuard ships with two evidence-graph back ends, selectable through
a single evidence_builder toggle in configs/base.yaml:
evidence_builder |
Node type | Edge types |
|---|---|---|
default |
one canonical entity per evidence triple | rel:<canonical_relation> and inverses |
subword_dep |
BPE subword tokens of the rendered evidence text | seq_next, seq_prev, dep:<reltype>, dep:<reltype>_inv |
The subword_dep builder is ported from paper-a (arXiv 2510.XXXXX,
subword-dependency GraphTM). The motivation is finer-grained linguistic
structure on the evidence side: where the default builder collapses the
sentence "Drug X treats disease Y" into the single triple
(drug_x, treats, disease_y), the subword-dep builder keeps the
sub-token sequence and the dependency tree available to the GraphTM
clauses.
Whether finer-grained linguistic structure on the evidence side actually improves verification accuracy is an open empirical question. It will be answered by the planned 5-seed FEVER GPU run; the README will be updated with measured numbers once that run completes.
Code lives in src/clauseguard/builders/subword_dep_evidence.py. The
public interface mirrors the default evidence-graph builder, so
existing callers see no behaviour change while evidence_builder is
left at default.
- Accuracy ceiling. Continuous-embedding methods (semantic entropy, AlignScore) may retain a small AUROC edge on HaluEval. ClauseGuard's pitch is not "highest number"; it is "the only guardrail an EU AI Act notified body or an FDA reviewer can audit." Where regulation forces auditability, ClauseGuard is the only option; where it doesn't, the market may keep the black-box detector. This is documented openly.
- Triple-extraction bottleneck. A poorly-extracted (s, r, o) tuple makes any downstream rule wrong. The default extractor is Qwen2.5-1.5B-Instruct (open weights, reproducible); every extraction call is logged with the model SHA, the prompt, and the raw output. Extraction precision/recall is reported separately so any verification error can be attributed.
- Open-world entity vocabulary. Wikidata / UMLS span millions of entities. The first release starts with top-N most frequent per benchmark and reports how clause-set size scales with N. If the vocab blow-up makes rule sets uninterpretable in practice (> 10⁴ clauses, length > 20), the auditability claim is re-scoped.
- Single-author repository. All code, docs, and paper drafts in this repository are by Anwar (University of Agder).
@misc{anwar2026clauseguard,
author = {Anwar},
title = {ClauseGuard: A Dual-Graph Tsetlin Machine Audit Layer for LLM Output Verification with SAT-Verifiable Clause Receipts and Counterfactual Evidence Recourse},
year = {2026},
howpublished = {\url{https://github.com/AnwarDebes/clauseguard}},
note = {University of Agder. Work in progress.}
}MIT. Code, configs, and result JSONs are free for academic and
commercial use. See LICENSE.
Anwar, University of Agder (UiA).