Dualify

Dualify is a research framework for bidirectional verification:

Spec-to-Logic (LLM extracts formal logic from informal specs)
Code-to-Logic (LLM extracts formal logic from implementation)
SMT-Checking (Z3 compares formulas and finds counterexamples)
Refinement (LLM suggests spec/code improvements from mismatches)

Pipeline overview

Quick start

Install all dependencies and checks (includes Ollama install/start check and model pull):
```
./setup.sh
```
Optional model override:
```
DUALIFY_MODEL=qwen2.5:3b-instruct ./setup.sh
```

Run full experiment:

poetry run python scripts/run_experiment.py --model qwen2.5:3b-instruct --benchmark synthetic

Run against OpenAI-compatible API (e.g. vLLM /v1/completions):

poetry run dualify-run \
  --provider openai \
  --base-url http://10.100.30.241:8800 \
  --api-key API_KEY \
  --model Qwen/Qwen3-Coder-Next-FP8 \
  --benchmark synthetic

Run mismatch demo benchmark (intentionally inconsistent specs):

poetry run python scripts/run_experiment.py --model qwen2.5:3b-instruct --benchmark mismatch

Scan a real Python repository (iterative inconsistency search):
```
poetry run dualify-run --model qwen2.5:3b-instruct --repo-path ./path/to/repo --iterations 2
```
- Interactive mode (default): for each function, Dualify shows SMT comparison, asks you to choose an action (p04), then executes selected action prompt (p05).
- Non-interactive mode:
```
poetry run dualify-run --model qwen2.5:3b-instruct \
  --repo-path ./path/to/repo --iterations 2 --non-interactive
```

Record, replay, and resume LLM calls

Every run hits an LLM. That makes runs slow, costly, and non-deterministic across model versions. Dualify can wrap the live LLM client so each call is written to a JSONL transcript on disk; later runs can replay that transcript instead of calling the API. Useful for:

Artifact reproducibility — commit a transcript, anyone can re-derive the report without an API key.
Offline CI — drive the full pipeline from a fixed transcript.
Bisecting — tweak the code, replay the same transcript, see exactly which case flipped.

Three flags, mutually exclusive:

Flag	What it does
`--record-transcript PATH`	Run live; append every call to `PATH` (truncates if it exists).
`--replay PATH`	Serve responses from `PATH`; no live LLM, no API key needed.
`--resume-transcript PATH`	Serve cached prefix from `PATH`, fall through to the live LLM for any remaining calls, and append the new ones to the same file. Use this to continue after an interrupted recording.

Worked example: record, interrupt, resume

# Initial recording -- run against the live API.
poetry run dualify-run \
  --repo-path ./repos/python-barcode --non-interactive \
  --record-transcript transcripts/barcode.jsonl
# ... ctrl-C halfway through.

# Resume -- replays what's already in the file, calls the live LLM for the rest.
poetry run dualify-run \
  --repo-path ./repos/python-barcode --non-interactive \
  --resume-transcript transcripts/barcode.jsonl

# Once complete, anyone can re-derive the report offline:
poetry run dualify-run \
  --repo-path ./repos/python-barcode --non-interactive \
  --replay transcripts/barcode.jsonl

Transcripts are line-buffered, so each record is on disk the moment its call returns — safe to tail -f and safe to resume even after a hard kill.

Drift detection

Every record carries the SHA-256 of its prompt. On --replay and on the cached prefix of --resume-transcript:

TranscriptPromptMismatchError — a prompt template changed since the transcript was recorded. Re-record.
TranscriptExhaustedError (replay only) — the current code path issues more LLM calls than the transcript holds (e.g. you widened the case filter). Re-record, or narrow the input back to the recorded scope.

Both errors fire at the exact call that drifted, never silently.

Benchmark input format (no JSON required)

Put Python files into benchmark/synthetic/.
Add a short natural-language description in comments directly above each function.
Add optional Context: comment lines above the function for assumptions/preconditions.
Keep explicit type annotations (int/bool) in function signature.

Example:

# Return True when x is positive.
def is_positive(x: int) -> bool: ...

Output artifacts

Single run report per launch:
- results/<benchmark>_<yyyy_mm_dd_hh_mm_ss>.json

Development quality checks

Lint: poetry run ruff check .
Format: poetry run ruff format .
Type-check: poetry run mypy
Tests: poetry run pytest
Install git hooks: poetry run pre-commit install

CI runs the same checks on push/PR via .github/workflows/ci.yml.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/workflows		.github/workflows
benchmark		benchmark
conferences		conferences
scripts		scripts
src/dualify		src/dualify
tests		tests
tools/gold_editor		tools/gold_editor
.env.example		.env.example
.env.local-ollama.example		.env.local-ollama.example
.env.remote-api.example		.env.remote-api.example
.env.remote-vllm.example		.env.remote-vllm.example
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml
scheme.drawio.png		scheme.drawio.png
scheme.drawio.xml		scheme.drawio.xml
setup.sh		setup.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dualify

Pipeline overview

Quick start

Record, replay, and resume LLM calls

Worked example: record, interrupt, resume

Drift detection

Benchmark input format (no JSON required)

Output artifacts

Development quality checks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dualify

Pipeline overview

Quick start

Record, replay, and resume LLM calls

Worked example: record, interrupt, resume

Drift detection

Benchmark input format (no JSON required)

Output artifacts

Development quality checks

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages