feat: test harness with LLM judge, replacing smartest framework#562
Draft
yanekyuk wants to merge 10 commits intoindexnetwork:devfrom
Draft
feat: test harness with LLM judge, replacing smartest framework#562yanekyuk wants to merge 10 commits intoindexnetwork:devfrom
yanekyuk wants to merge 10 commits intoindexnetwork:devfrom
Conversation
…st to test-harness
Replace SMARTEST_VERIFIER_MODEL/SMARTEST_GENERATOR_MODEL with DATABASE_TEST_URL and TEST_JUDGE_MODEL.
…rface - assertLLMEvaluate returns passing no-op result instead of throwing when OPENROUTER_API_KEY is not set (prevents CI failures) - TestHarness.cache typed as Cache interface, not RedisCacheAdapter
…p docs - Add console.warn when criterion matching falls back to positional index, making unreliable LLM judge responses visible in test output - Remove stale path comment from judge.prompt.ts - Expand TSDoc on assertLLMEvaluate documenting the no-op skip behavior and recommending describe.skipIf for LLM-primary tests
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
protocol/src/lib/test-harness/library that replaces the smartest framework with a leaner approach that feels like standard bun testscreateTestHarness()factory wires real adapters (Drizzle DB, EmbedderAdapter, RedisCacheAdapter) against a test database with setup/reset/teardown lifecycleassertMatchesSchema(value, zodSchema)for deterministic Zod validationassertLLMEvaluate(value, config)for scored semantic criteria via LLM judge — supports per-criterionrequiredflag, per-criterionminthreshold, and overallminScoregoogle/gemini-2.5-flashvia OpenRouter with structured output (Zod response format)OPENROUTER_API_KEYis missing (warns, doesn't fail)New Features
createTestHarness()— real adapter injection for integration testsassertMatchesSchema()— Zod schema validation assertionassertLLMEvaluate()— scored semantic criteria with LLM judgecallJudge()— core LLM judge function with structured outputRefactors
opportunity.evaluator.smartest.spec.ts→ stress test block inopportunity.evaluator.spec.tsopportunity.graph.direct-connection.smartest.spec.ts→opportunity.graph.direct-connection.spec.tsDocumentation
.env.examplewithDATABASE_TEST_URLandTEST_JUDGE_MODEL(replacingSMARTEST_VERIFIER_MODEL/SMARTEST_GENERATOR_MODEL)Deferred
lib/smartest/) stays until remaining 7 test files are migrated (follow-up)queueandgraphsproperties onTestHarnessdeferred until graph-level integration tests need themTest plan
bun test src/lib/test-harness/tests/judge.spec.ts— 2 tests (LLM scoring, missing API key)bun test src/lib/test-harness/tests/assertions.spec.ts— 4 tests (schema valid/invalid, LLM pass/fail)bun test src/lib/test-harness/tests/harness.spec.ts— 3 tests (db connection, embedder, reset)bun test src/lib/protocol/agents/tests/opportunity.evaluator.spec.ts— verify stress test blockbun test src/lib/protocol/graphs/tests/opportunity.graph.direct-connection.spec.ts— verify migrated testtsc --noEmitpasses