feat(specs): SDK spec support matrix (genspecs pipeline + first judged dataset) by aaron-zeisler · Pull Request #441 · launchdarkly/sdk-meta

aaron-zeisler · 2026-05-12T18:06:06Z

Summary

Adds a new genspecs pipeline to sdk-meta that produces a high-level "does this SDK support this spec?" matrix across all 35 SDKs and the top-level specs in launchdarkly/sdk-specs. This is an experimental data product — input for an upcoming project that automates SDK work based on each SDK's spec coverage.

The pipeline consists of five composable subcommands plus orchestration:

Stage	Command	Output
1	`genspecs sync-repos`	clones missing SDK repos, ff-pulls existing ones
2	`genspecs catalog`	parses `sdk-specs` READMEs → `products/specs.json`
3	`genspecs harness`	parses `sdk-test-harness` capabilities + per-SDK `testharness-suppressions*.txt` → `products/harness_signals.json`
4	`genspecs judge`	applies-to filter + LLM judge per applicable cell → `products/spec_support.json`
5	`genspecs html`	renders `_site/spec-support.html` (matrix heatmap) and `_site/spec-support-by-sdk.html` (per-SDK detail)

End-to-end via make spec-support or scripts/generate-spec-support.sh.

LLM judge

Providers: AWS Bedrock (AWS_BEARER_TOKEN_BEDROCK, default), direct Anthropic (ANTHROPIC_API_KEY), or noop for placeholder runs.
Caching: prompt input pack is SHA-256 hashed into judged_against; re-runs skip cells whose inputs haven't changed.
Reliability: transient failures (connection reset, 408/429/5xx) are retried with exponential backoff + jitter, honoring Retry-After. Auth errors (401/403) and context cancellation are not retried. Unit tests in judge_retry_test.go.

First judged dataset (`products/spec_support.json`)

Generated against:

sdk-specs@1b103d72
sdk-test-harness@c457afb8
us.anthropic.claude-sonnet-4-5-20250929-v1:0 (via Bedrock, us-east-2, Development account)
prompt_version: v1

State distribution across all (SDK × spec) cells:

State	Count
supported	361
partial	70
not-supported	334
not-applicable	460

Schemas

New: schemas/specs.json, schemas/harness_signals.json, schemas/spec_support.json. All wired into scripts/ci/check-json-schemas.sh.
specs.json#/specs/status is intentionally permissive — real-world spec metadata uses APPROVED, CURRENT, versioned values like v1:DRAFT.
spec_support.json#/$defs/Evidence/kind is a free-form string with documented canonical values. The LLM emits descriptive kinds beyond the canonical set (spec_metadata, sdk_metadata, harness_participation, sdk_features); capturing them verbatim is more useful than rejecting otherwise-valid judgments.

Other touches

Updates tool/cmd/genhtml/templates/by-{feature,sdk}.html nav tabs to link to the new spec-support pages.
Adds tool/specs/.judge-cache/ and the compiled genspecs binary to .gitignore.

How to verify

make spec-html              # rerender HTML from existing JSON
open _site/spec-support.html

# Full pipeline (requires AWS_BEARER_TOKEN_BEDROCK + AWS_REGION=us-east-2):
make spec-support

# Schema validation:
bash scripts/ci/check-json-schemas.sh

# Retry unit tests:
cd tool && go test ./cmd/genspecs/...

Test plan

make spec-html renders both pages and they look reasonable
bash scripts/ci/check-json-schemas.sh reports all products valid
cd tool && go test ./cmd/genspecs/... passes (retry logic)
Spot-check a handful of partial and not-supported cells in products/spec_support.json against what you know — does the rationale + evidence pass the smell test?
Spot-check the matrix view: do the obvious cases (e.g. server-side specs marked not-applicable on client SDKs) line up?

Notes for reviewers

This is experimental: the goal was a repeatable pipeline + a credible first dataset, not a polished production artifact. Expect some judgments to be wrong; the rationale + evidence + cache let us iterate on the prompt without redoing work.
The judge cache lives at tool/specs/.judge-cache/ (gitignored). Deleting it forces a full re-judgment.
For Bedrock setup specifics (Development account, us-east-2, 12-hour token TTL), see bedrockJudge's doc comment in tool/cmd/genspecs/judge.go.

via LD Research 🤖

Made with Cursor

…son, spec_support.json) Adds genspecs, a Go tool that classifies how well each LaunchDarkly SDK supports each top-level spec from launchdarkly/sdk-specs. Subcommands: - sync-repos: clone any missing SDK repo (plus sdk-specs and sdk-test-harness) and fast-forward existing checkouts. - catalog: walk sdk-specs and emit products/specs.json (id, status, applies-to, requirement_count, versions, sub-specs). - harness: extract Capability* constants and top-level test groups from sdk-test-harness, plus per-SDK testharness-suppressions* files (with inline comments preserved), into products/harness_signals.json. - judge: apply a deterministic applies-to filter, then for every remaining (sdk, spec) cell call an LLM (Anthropic or noop) with the spec README, the SDK metadata, the SDK's features, the harness signals, and a depth-limited repo listing. Output goes to products/spec_support.json. Caches by SHA-256 of the prompt input pack so re-runs only hit the LLM for cells whose inputs changed. - html: render _site/spec-support.html (filterable matrix) and _site/spec-support-by-sdk.html (per-SDK detail with rationale). Also wires the three new schemas into scripts/ci/check-json-schemas.sh and adds Makefile targets (spec-sync-repos, spec-catalog, spec-harness, spec-judge, spec-html, spec-support) plus scripts/generate-spec-support.sh that runs the whole pipeline end-to-end. The committed products/spec_support.json was generated with --provider=noop so it's a placeholder where every applicable cell is "unknown". Re-run `make spec-judge` (or scripts/generate-spec-support.sh) with ANTHROPIC_API_KEY set to populate it with real classifications. Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>

…ne script `go run` (and `go -C dir run`) inherits cwd from the shell, so when the Makefile targets ran `cd tool && go run ./cmd/genspecs ...` the binary saw `tool/` as cwd, and its default `products/specs.json` flag value resolved to the non-existent `tool/products/specs.json`. Fixes both invocation sites by: - Adding explicit ../-prefixed paths to every input/output flag in the spec-* Makefile targets (matching the convention already used by the existing `html` target). - Doing the same in scripts/generate-spec-support.sh, wrapping each go-run in a (cd tool && ...) subshell so the script's cwd stays at the repo root. Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>

LaunchDarkly accesses Anthropic models through AWS Bedrock, not the direct Anthropic API. Adds a third judge provider, 'bedrock', that authenticates with the short-term bearer token AWS exposes via "Bedrock console -> API Keys -> Generate short-term API keys" (12-hour expiry). Provider selection is automatic based on env: AWS_BEARER_TOKEN_BEDROCK -> bedrock (preferred — LD's setup) ANTHROPIC_API_KEY -> anthropic (neither set) -> noop The Bedrock branch posts the same Anthropic Messages API body that the direct branch sends, with two adjustments per Anthropic's Bedrock docs: - anthropic_version is the literal "bedrock-2023-05-31" - the model id is in the URL path, not the body Default model on bedrock is us.anthropic.claude-sonnet-4-5-20250929-v1:0 (cross-region inference profile, same model as the Anthropic-direct default for easy comparison). AWS_REGION overrides the endpoint region; defaults to us-east-1. Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>

…gion in startup log When the bedrock judge fails with a generic 'Authentication failed' or 'CallWithBearerToken' AccessDeniedException, the cause is almost always that the bearer token was generated in the wrong AWS account or for a different region than AWS_REGION points to. Two small changes to make that obvious: 1. The judge startup log line now reads: "Judging N cells with provider=bedrock region=us-east-2 model=..." so any region mismatch is visible without flipping on debug output. 2. The bedrockJudge doc comment now spells out the LD-specific facts confirmed in #proj-building-with-ai and during a 2026-05-12 debug: - Generate from the Development account, not SDK (PowerUser there lacks bedrock:CallWithBearerToken). - Tokens are scoped to the account+region they were issued in. us-east-2 is known-good in Development; AWS_REGION must match. - 12-hour TTL. Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>

…ial backoff The judge had no retry logic, so a single TCP "connection reset by peer" (seen in the wild when laptops sleep / Wi-Fi roams mid-run) would mark the cell as judge_failed and skip it for the rest of the batch. Adds a retryingJudge wrapper around the bedrock and anthropic providers: - Up to 4 attempts with 750ms * 2^(attempt-1) backoff (capped at 16s) plus ±25% jitter - Honors server-supplied Retry-After (mainly useful for 429s) - Retries on: net timeouts, EOF, "connection reset", "connection refused", "broken pipe", "i/o timeout", "no such host", and HTTP 408/429/5xx - Does NOT retry on: context cancellation, 4xx (auth/config errors that won't resolve themselves), or non-retryable errors Both providers now return a typed *retryableHTTPError instead of an opaque fmt.Errorf string so the wrapper can inspect status + headers via errors.As. Includes unit tests for the classifier and the wrapper itself (transient recovery, max-attempts cap, non-retryable short-circuit, context cancellation). This is purely additive: no behavior change on the success path; cache keys are unaffected. Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>

…-sonnet-4.5) First full pass of the genspecs judge across all 35 SDKs against the top-level spec catalog. Counts: 361 supported, 70 partial, 334 not-supported, 460 not-applicable. Also relaxes schemas/spec_support.json#/$defs/Evidence/kind from a fixed enum to a free-form string. The LLM emits descriptive kinds beyond our recommended canonical set (e.g. spec_metadata, sdk_metadata, harness_participation, sdk_features). Capturing them verbatim is more valuable than rejecting otherwise-valid judgments; the canonical values are now documented in the schema description. Generated with: specs_commit: 1b103d72 (sdk-specs) harness_commit: c457afb8 (sdk-test-harness) model: us.anthropic.claude-sonnet-4-5-20250929-v1:0 prompt_version: v1 Co-authored-by: Cursor <cursoragent@cursor.com>

…ionale/notes Drops per-cell `source`, `evidence`, `judged_at`, and `judged_against`. Audit metadata (judge model, commit hashes, evidence list) is still preserved on disk in the judge cache at tool/specs/.judge-cache/, so nothing is permanently lost — but the public product is now ~4x smaller and easier to consume. Before: 30,282 lines, 2.0M After: 7,429 lines, 548K Touched: - schemas/spec_support.json: dropped fields + the Evidence \$def - tool/cmd/genspecs/types.go: slim Cell, drop JudgedAgainst/Source* - tool/cmd/genspecs/judge.go: stop writing the dropped fields; cache hits still work because gob/json silently ignore extra fields in previously-cached cells - spec-support.html / spec-support-by-sdk.html: drop the source pill and the evidence list; render notes_for_human instead - .gitignore: ignore the stray tool/genspecs build artifact The LLM judge prompt is unchanged — we still ask the model to cite evidence, because chain-of-thought ("show your work") tends to produce better answers. We just no longer persist the citations. Co-authored-by: Cursor <cursoragent@cursor.com>

Companion to spec_support.json: where spec_support is one row per (sdk, spec) at rollup granularity, this is one block per (sdk, spec) with a per-requirement breakdown plus a rollup that the consumer (Spectre) overwrites the high-level row with. This first iteration covers a single pair — go-server-sdk x PLUGIN — generated by a manual retroactive check during the hackathon (judge: "human (azeisler + assistant)"). The 17 requirements come from the post-renumbering PLUGIN spec (depends on launchdarkly/sdk-specs#167, which renumbers the duplicate `1.2.3` heading to `1.2.10`). Block structure per (sdk, spec): - Provenance: generated_at, judge, prompt_version, spec_path, spec_sha, spec_renumber_pr, sdk_repo, sdk_repo_commit, sdk_branch. - evidence_sources_considered: every file/dir we looked at. - rollup: { applies, veto_reason, state, complexity, rationale, supported_since, supported_since_date, supported_since_evidence, counts }. The state here ("partial") is what Spectre promotes into sdk_spec_support. - requirements[]: per-requirement entries with id, severity, state, evidence types, rationale, findings, notes_for_human. Schema ergonomics: - spec_version is an empty string ("") rather than "v1" so the primary key matches the (sdk_name, spec_id, spec_version, kind) convention already used by spec_support.json — avoids a join / alias dance on the Spectre side. - evidence_sources_considered, supported_since_evidence, and the per-requirement findings live here so a downstream Spectre workflow can pick them up as an artifact when it re-judges the pair, but they intentionally do NOT propagate into the durable Spectre tables (those are re-derived per workflow run). Headline finding for go-server-sdk x PLUGIN: "partial" rollup, 10 full / 1 partial / 2 not_supported / 4 not_applicable. The two MUST- severity gaps are 1.1.5 (no onPluginsReady method on the Plugin interface) and 1.2.6 (no registration-complete callback dispatch). via LD Research 🤖 Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Adds 21 spec entries to the go-server-sdk × spec matrix on top of the existing PLUGIN entry, covering every spec in sdk-specs that genspecs classified as applicable to a server SDK plus the four previously- unclassified specs (AIDOC, FDV2PL, FDV2REL, CSFDV2). Total: 99 per-requirement rows and 22 rollups. Breakdown: - 8 not-applicable rollups (AUTOCONFIG, CSFDV2, CSI, CSSE, EM, RKSTM, RPENDPOINTS, SPEC) with veto rationales tying back to the spec's own applies-to list. - 5 brief rollup-only entries (AIDOC + AISDK + AUTOENVATTR + CODES + EXAM + FDV2PL + FDV2REL + STACK) where the per-requirement walk is deferred but the high-level state is grounded in code/repo evidence. - 5 deep dives with per-requirement findings: * ARCO (rollup-only, supported) * ATREF (4/4 full, supported, since v6.0.0) * BIGSEG (65 full + 3 partial, supported, since v5.5.0) * CLM (7 partial + 2 not_supp + 1 NA, partial, since v2.0.0) * CONTEXT (rollup-only, supported) Each finding cites the implementing files plus the dependency SHA the behavior actually lives in (go-sdk-common@v3.5.0/3727dba, go-server-sdk-evaluation@v3.4.0/v3.4.1, go-sdk-events@v3.6.0). Generator: research/artifacts/spec-analysis/build_json.py. Spectre seeding: separate alembic migration g7b8c9d0e1f2 in spectre. Co-authored-by: Cursor <cursoragent@cursor.com>

Adds the 17 batch-2 spec entries for go-server-sdk to products/spec_requirement_support.json, completing the v7-spec scan. Roster (17): CSPE, DATASYSTEM, DIAG, ENVFILTER, EVENTS, FLGDM, FLGEA, FLGERM, FLGMES, HOOK, MIGRATIONS, OTEL, PS, RELEASE, SCMP, TDS, TXNS. Each entry follows the same shape as PLUGIN/batch1 — per-requirement state, severity, rationale, code findings (file:line + kind), and a rollup. Four prose-based specs (FLGDM/FLGEA/FLGERM/FLGMES) carry empty requirements[] arrays as rollup-only entries because the spec READMEs lack numbered requirements. Notable findings (corrections to bulk-seed bedrock-claude-sonnet-4-5 LLM judgments after deep dive): - SCMP not_supported (no X-LaunchDarkly-InstanceID or X-LaunchDarkly-PollingIntervalMs polling header). - ENVFILTER partial (filter-key regex validation missing). - DATASYSTEM partial (Initializer.Fetch doesnt surface X-LD-FD-Fallback; Basis lacks RevertToFDv1; case-sensitive comparison). - RELEASE partial (no first-party Hello App; FDv2 surfaces unstable). - DIAG.1.6.3.1 partial (spec README marks samplingInterval as TODO). Generated by artifacts/spec-analysis/build_json.py in launchdarkly/research. Spectre seed PR: launchdarkly/spectre#42 (depends on #41 which depends on #34). via LD Research Co-authored-by: Cursor <cursoragent@cursor.com>

Replaces three previously rollup-only spec rows for go-server-sdk with full per-requirement deep-dive analyses, matching the SCMP-style treatment requested for the remaining "awaiting deep-dive" backlog. Per-requirement coverage added (39 new requirement entries): AISDK → 2 reqs (1.2.1, 1.2.2 — both `full`; rollup remains `partial` because §1.1's six listed sub-spec components — AICONF, AITRACK, AIGRAPH, AIRUNNER, AIGRAPHTRACK, AIEVALS — aren't all implemented in `ldai/`, and the package itself is pre-1.0) FDV2PL → 16 reqs (11 full / 1 partial / 2 not_supported / 1 N/A / 1 unknown). Surfaces two real bugs in the FDv2 streaming source: §3.3.5 and §3.3.6 log `goodbye` and `error` events at error level with non-spec text, when the spec mandates info level with prescribed text. Trivial fix. Also flags §3.4.1 as `partial` (single-payload struct shape isn't strictly future-proof for multi-payload server-intents) and §3.3.4 as `unknown` because the spec README has an empty Requirement 3.3.4 heading. FDV2REL → 21 reqs ALL `not_applicable`. Every requirement is phrased "the relay proxy MUST..." and binds to the Relay Proxy implementation, not the SDK. Bulk LLM judge had this as `partial`; deep-dive corrects rollup to `not_applicable`. Recommend narrowing applies-to in the spec. This closes out the three "awaiting deep-dive" specs — go-server-sdk now has full per-requirement coverage for every spec where it makes sense to have it (20 deep-dive entries, 4 prose-only rollups, 7 N/A). via LD Research 🤖 Co-Authored-By: claude-opus-4.7 <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Audit (rollup state vs per-requirement counts) surfaced five (sdk, spec) rows where the rollup didn't match the data underneath it: AISDK partial -> supported (2/2 numbered reqs are full) BIGSEG supported -> partial (3 partial reqs) DIAG supported -> partial (2 partial reqs) OTEL supported -> partial (1 partial req) TDS supported -> partial (3 not_supported MAY-tagged reqs) Per-requirement entries are unchanged; only rollup.state and the rationale text are touched. The rule applied: if any req is not_supported or partial, rollup is at most partial; if all reqs are full, rollup is supported. Confirmed against TXNS (4/4 full) which already correctly rolled up to supported. Companion change in spectre adds j0e1f2g3h4i5_fix_go_sdk_rollup_ consistency.py to apply these UPDATEs to sdk_spec_support. via LD Research 🤖 Co-Authored-By: claude-opus-4.7 <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>

AIDOC was added to spec_requirement_support.json in batch 1 as a brief not_supported block, but the spec only exists on a local sdk-specs branch — `git ls-tree -r origin/main | grep AIDOC` returns nothing. Tracking specs that aren't upstream pollutes the matrix and creates false debt for the SDK ("28 unsatisfied requirements" against a spec no SDK could be expected to satisfy because it doesn't exist yet). Removes the AIDOC NEW_BLOCKS assignment in build_json.py (along with its entry in the counts-zeroing loop), adds a defensive REMOVED_SPECS sweep in main() so any pre-existing AIDOC entry in the JSON is dropped on regeneration, and updates an incidental mention in AISDK's evidence sources to reflect that AISDK is the only AI-family spec on main. Companion migration in spectre PR #43 deletes the corresponding row from sdk_spec_support. go-server-sdk now has 38 spec entries (was 39). Co-authored-by: Cursor <cursoragent@cursor.com>

aaron-zeisler and others added 13 commits May 11, 2026 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(specs): SDK spec support matrix (genspecs pipeline + first judged dataset)#441

feat(specs): SDK spec support matrix (genspecs pipeline + first judged dataset)#441
aaron-zeisler wants to merge 13 commits into
mainfrom
feat/spec-support-matrix

aaron-zeisler commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aaron-zeisler commented May 12, 2026

Summary

LLM judge

First judged dataset (products/spec_support.json)

Schemas

Other touches

How to verify

Test plan

Notes for reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

First judged dataset (`products/spec_support.json`)