feat(specs): SDK spec support matrix (genspecs pipeline + first judged dataset)#441
Draft
aaron-zeisler wants to merge 13 commits into
Draft
feat(specs): SDK spec support matrix (genspecs pipeline + first judged dataset)#441aaron-zeisler wants to merge 13 commits into
aaron-zeisler wants to merge 13 commits into
Conversation
…son, spec_support.json)
Adds genspecs, a Go tool that classifies how well each LaunchDarkly SDK
supports each top-level spec from launchdarkly/sdk-specs.
Subcommands:
- sync-repos: clone any missing SDK repo (plus sdk-specs and
sdk-test-harness) and fast-forward existing checkouts.
- catalog: walk sdk-specs and emit products/specs.json (id, status,
applies-to, requirement_count, versions, sub-specs).
- harness: extract Capability* constants and top-level test groups from
sdk-test-harness, plus per-SDK testharness-suppressions* files (with
inline comments preserved), into products/harness_signals.json.
- judge: apply a deterministic applies-to filter, then for every
remaining (sdk, spec) cell call an LLM (Anthropic or noop) with the
spec README, the SDK metadata, the SDK's features, the harness
signals, and a depth-limited repo listing. Output goes to
products/spec_support.json. Caches by SHA-256 of the prompt input
pack so re-runs only hit the LLM for cells whose inputs changed.
- html: render _site/spec-support.html (filterable matrix) and
_site/spec-support-by-sdk.html (per-SDK detail with rationale).
Also wires the three new schemas into scripts/ci/check-json-schemas.sh
and adds Makefile targets (spec-sync-repos, spec-catalog, spec-harness,
spec-judge, spec-html, spec-support) plus scripts/generate-spec-support.sh
that runs the whole pipeline end-to-end.
The committed products/spec_support.json was generated with --provider=noop
so it's a placeholder where every applicable cell is "unknown". Re-run
`make spec-judge` (or scripts/generate-spec-support.sh) with
ANTHROPIC_API_KEY set to populate it with real classifications.
Co-Authored-By: Claude <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…ne script `go run` (and `go -C dir run`) inherits cwd from the shell, so when the Makefile targets ran `cd tool && go run ./cmd/genspecs ...` the binary saw `tool/` as cwd, and its default `products/specs.json` flag value resolved to the non-existent `tool/products/specs.json`. Fixes both invocation sites by: - Adding explicit ../-prefixed paths to every input/output flag in the spec-* Makefile targets (matching the convention already used by the existing `html` target). - Doing the same in scripts/generate-spec-support.sh, wrapping each go-run in a (cd tool && ...) subshell so the script's cwd stays at the repo root. Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>
LaunchDarkly accesses Anthropic models through AWS Bedrock, not the direct Anthropic API. Adds a third judge provider, 'bedrock', that authenticates with the short-term bearer token AWS exposes via "Bedrock console -> API Keys -> Generate short-term API keys" (12-hour expiry). Provider selection is automatic based on env: AWS_BEARER_TOKEN_BEDROCK -> bedrock (preferred — LD's setup) ANTHROPIC_API_KEY -> anthropic (neither set) -> noop The Bedrock branch posts the same Anthropic Messages API body that the direct branch sends, with two adjustments per Anthropic's Bedrock docs: - anthropic_version is the literal "bedrock-2023-05-31" - the model id is in the URL path, not the body Default model on bedrock is us.anthropic.claude-sonnet-4-5-20250929-v1:0 (cross-region inference profile, same model as the Anthropic-direct default for easy comparison). AWS_REGION overrides the endpoint region; defaults to us-east-1. Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>
…gion in startup log
When the bedrock judge fails with a generic 'Authentication failed' or
'CallWithBearerToken' AccessDeniedException, the cause is almost always
that the bearer token was generated in the wrong AWS account or for a
different region than AWS_REGION points to. Two small changes to make
that obvious:
1. The judge startup log line now reads:
"Judging N cells with provider=bedrock region=us-east-2 model=..."
so any region mismatch is visible without flipping on debug output.
2. The bedrockJudge doc comment now spells out the LD-specific facts
confirmed in #proj-building-with-ai and during a 2026-05-12 debug:
- Generate from the Development account, not SDK (PowerUser there
lacks bedrock:CallWithBearerToken).
- Tokens are scoped to the account+region they were issued in.
us-east-2 is known-good in Development; AWS_REGION must match.
- 12-hour TTL.
Co-Authored-By: Claude <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…ial backoff
The judge had no retry logic, so a single TCP "connection reset by peer"
(seen in the wild when laptops sleep / Wi-Fi roams mid-run) would mark
the cell as judge_failed and skip it for the rest of the batch.
Adds a retryingJudge wrapper around the bedrock and anthropic providers:
- Up to 4 attempts with 750ms * 2^(attempt-1) backoff (capped at 16s)
plus ±25% jitter
- Honors server-supplied Retry-After (mainly useful for 429s)
- Retries on: net timeouts, EOF, "connection reset", "connection
refused", "broken pipe", "i/o timeout", "no such host", and HTTP
408/429/5xx
- Does NOT retry on: context cancellation, 4xx (auth/config errors
that won't resolve themselves), or non-retryable errors
Both providers now return a typed *retryableHTTPError instead of an
opaque fmt.Errorf string so the wrapper can inspect status + headers
via errors.As.
Includes unit tests for the classifier and the wrapper itself
(transient recovery, max-attempts cap, non-retryable short-circuit,
context cancellation).
This is purely additive: no behavior change on the success path; cache
keys are unaffected.
Co-Authored-By: Claude <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…-sonnet-4.5) First full pass of the genspecs judge across all 35 SDKs against the top-level spec catalog. Counts: 361 supported, 70 partial, 334 not-supported, 460 not-applicable. Also relaxes schemas/spec_support.json#/$defs/Evidence/kind from a fixed enum to a free-form string. The LLM emits descriptive kinds beyond our recommended canonical set (e.g. spec_metadata, sdk_metadata, harness_participation, sdk_features). Capturing them verbatim is more valuable than rejecting otherwise-valid judgments; the canonical values are now documented in the schema description. Generated with: specs_commit: 1b103d72 (sdk-specs) harness_commit: c457afb8 (sdk-test-harness) model: us.anthropic.claude-sonnet-4-5-20250929-v1:0 prompt_version: v1 Co-authored-by: Cursor <cursoragent@cursor.com>
…ionale/notes
Drops per-cell `source`, `evidence`, `judged_at`, and `judged_against`.
Audit metadata (judge model, commit hashes, evidence list) is still
preserved on disk in the judge cache at tool/specs/.judge-cache/, so
nothing is permanently lost — but the public product is now ~4x
smaller and easier to consume.
Before: 30,282 lines, 2.0M
After: 7,429 lines, 548K
Touched:
- schemas/spec_support.json: dropped fields + the Evidence \$def
- tool/cmd/genspecs/types.go: slim Cell, drop JudgedAgainst/Source*
- tool/cmd/genspecs/judge.go: stop writing the dropped fields; cache
hits still work because gob/json silently ignore extra fields in
previously-cached cells
- spec-support.html / spec-support-by-sdk.html: drop the source pill
and the evidence list; render notes_for_human instead
- .gitignore: ignore the stray tool/genspecs build artifact
The LLM judge prompt is unchanged — we still ask the model to cite
evidence, because chain-of-thought ("show your work") tends to
produce better answers. We just no longer persist the citations.
Co-authored-by: Cursor <cursoragent@cursor.com>
Companion to spec_support.json: where spec_support is one row per (sdk, spec) at rollup granularity, this is one block per (sdk, spec) with a per-requirement breakdown plus a rollup that the consumer (Spectre) overwrites the high-level row with. This first iteration covers a single pair — go-server-sdk x PLUGIN — generated by a manual retroactive check during the hackathon (judge: "human (azeisler + assistant)"). The 17 requirements come from the post-renumbering PLUGIN spec (depends on launchdarkly/sdk-specs#167, which renumbers the duplicate `1.2.3` heading to `1.2.10`). Block structure per (sdk, spec): - Provenance: generated_at, judge, prompt_version, spec_path, spec_sha, spec_renumber_pr, sdk_repo, sdk_repo_commit, sdk_branch. - evidence_sources_considered: every file/dir we looked at. - rollup: { applies, veto_reason, state, complexity, rationale, supported_since, supported_since_date, supported_since_evidence, counts }. The state here ("partial") is what Spectre promotes into sdk_spec_support. - requirements[]: per-requirement entries with id, severity, state, evidence types, rationale, findings, notes_for_human. Schema ergonomics: - spec_version is an empty string ("") rather than "v1" so the primary key matches the (sdk_name, spec_id, spec_version, kind) convention already used by spec_support.json — avoids a join / alias dance on the Spectre side. - evidence_sources_considered, supported_since_evidence, and the per-requirement findings live here so a downstream Spectre workflow can pick them up as an artifact when it re-judges the pair, but they intentionally do NOT propagate into the durable Spectre tables (those are re-derived per workflow run). Headline finding for go-server-sdk x PLUGIN: "partial" rollup, 10 full / 1 partial / 2 not_supported / 4 not_applicable. The two MUST- severity gaps are 1.1.5 (no onPluginsReady method on the Plugin interface) and 1.2.6 (no registration-complete callback dispatch). via LD Research 🤖 Co-Authored-By: Claude <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Adds 21 spec entries to the go-server-sdk × spec matrix on top of the
existing PLUGIN entry, covering every spec in sdk-specs that genspecs
classified as applicable to a server SDK plus the four previously-
unclassified specs (AIDOC, FDV2PL, FDV2REL, CSFDV2). Total: 99
per-requirement rows and 22 rollups.
Breakdown:
- 8 not-applicable rollups (AUTOCONFIG, CSFDV2, CSI, CSSE, EM, RKSTM,
RPENDPOINTS, SPEC) with veto rationales tying back to the spec's
own applies-to list.
- 5 brief rollup-only entries (AIDOC + AISDK + AUTOENVATTR + CODES +
EXAM + FDV2PL + FDV2REL + STACK) where the per-requirement walk is
deferred but the high-level state is grounded in code/repo evidence.
- 5 deep dives with per-requirement findings:
* ARCO (rollup-only, supported)
* ATREF (4/4 full, supported, since v6.0.0)
* BIGSEG (65 full + 3 partial, supported, since v5.5.0)
* CLM (7 partial + 2 not_supp + 1 NA, partial, since v2.0.0)
* CONTEXT (rollup-only, supported)
Each finding cites the implementing files plus the dependency SHA the
behavior actually lives in (go-sdk-common@v3.5.0/3727dba,
go-server-sdk-evaluation@v3.4.0/v3.4.1, go-sdk-events@v3.6.0).
Generator: research/artifacts/spec-analysis/build_json.py.
Spectre seeding: separate alembic migration g7b8c9d0e1f2 in spectre.
Co-authored-by: Cursor <cursoragent@cursor.com>
Adds the 17 batch-2 spec entries for go-server-sdk to products/spec_requirement_support.json, completing the v7-spec scan. Roster (17): CSPE, DATASYSTEM, DIAG, ENVFILTER, EVENTS, FLGDM, FLGEA, FLGERM, FLGMES, HOOK, MIGRATIONS, OTEL, PS, RELEASE, SCMP, TDS, TXNS. Each entry follows the same shape as PLUGIN/batch1 — per-requirement state, severity, rationale, code findings (file:line + kind), and a rollup. Four prose-based specs (FLGDM/FLGEA/FLGERM/FLGMES) carry empty requirements[] arrays as rollup-only entries because the spec READMEs lack numbered requirements. Notable findings (corrections to bulk-seed bedrock-claude-sonnet-4-5 LLM judgments after deep dive): - SCMP not_supported (no X-LaunchDarkly-InstanceID or X-LaunchDarkly-PollingIntervalMs polling header). - ENVFILTER partial (filter-key regex validation missing). - DATASYSTEM partial (Initializer.Fetch doesnt surface X-LD-FD-Fallback; Basis lacks RevertToFDv1; case-sensitive comparison). - RELEASE partial (no first-party Hello App; FDv2 surfaces unstable). - DIAG.1.6.3.1 partial (spec README marks samplingInterval as TODO). Generated by artifacts/spec-analysis/build_json.py in launchdarkly/research. Spectre seed PR: launchdarkly/spectre#42 (depends on #41 which depends on #34). via LD Research Co-authored-by: Cursor <cursoragent@cursor.com>
Replaces three previously rollup-only spec rows for go-server-sdk with
full per-requirement deep-dive analyses, matching the SCMP-style
treatment requested for the remaining "awaiting deep-dive" backlog.
Per-requirement coverage added (39 new requirement entries):
AISDK → 2 reqs (1.2.1, 1.2.2 — both `full`; rollup remains
`partial` because §1.1's six listed sub-spec
components — AICONF, AITRACK, AIGRAPH, AIRUNNER,
AIGRAPHTRACK, AIEVALS — aren't all implemented
in `ldai/`, and the package itself is pre-1.0)
FDV2PL → 16 reqs (11 full / 1 partial / 2 not_supported / 1 N/A
/ 1 unknown). Surfaces two real bugs in the
FDv2 streaming source: §3.3.5 and §3.3.6 log
`goodbye` and `error` events at error level
with non-spec text, when the spec mandates
info level with prescribed text. Trivial fix.
Also flags §3.4.1 as `partial` (single-payload
struct shape isn't strictly future-proof for
multi-payload server-intents) and §3.3.4 as
`unknown` because the spec README has an empty
Requirement 3.3.4 heading.
FDV2REL → 21 reqs ALL `not_applicable`. Every requirement is
phrased "the relay proxy MUST..." and binds
to the Relay Proxy implementation, not the
SDK. Bulk LLM judge had this as `partial`;
deep-dive corrects rollup to `not_applicable`.
Recommend narrowing applies-to in the spec.
This closes out the three "awaiting deep-dive" specs — go-server-sdk
now has full per-requirement coverage for every spec where it makes
sense to have it (20 deep-dive entries, 4 prose-only rollups, 7 N/A).
via LD Research 🤖
Co-Authored-By: claude-opus-4.7 <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Audit (rollup state vs per-requirement counts) surfaced five (sdk, spec) rows where the rollup didn't match the data underneath it: AISDK partial -> supported (2/2 numbered reqs are full) BIGSEG supported -> partial (3 partial reqs) DIAG supported -> partial (2 partial reqs) OTEL supported -> partial (1 partial req) TDS supported -> partial (3 not_supported MAY-tagged reqs) Per-requirement entries are unchanged; only rollup.state and the rationale text are touched. The rule applied: if any req is not_supported or partial, rollup is at most partial; if all reqs are full, rollup is supported. Confirmed against TXNS (4/4 full) which already correctly rolled up to supported. Companion change in spectre adds j0e1f2g3h4i5_fix_go_sdk_rollup_ consistency.py to apply these UPDATEs to sdk_spec_support. via LD Research 🤖 Co-Authored-By: claude-opus-4.7 <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>
AIDOC was added to spec_requirement_support.json in batch 1 as a brief
not_supported block, but the spec only exists on a local sdk-specs
branch — `git ls-tree -r origin/main | grep AIDOC` returns nothing.
Tracking specs that aren't upstream pollutes the matrix and creates
false debt for the SDK ("28 unsatisfied requirements" against a spec
no SDK could be expected to satisfy because it doesn't exist yet).
Removes the AIDOC NEW_BLOCKS assignment in build_json.py (along with
its entry in the counts-zeroing loop), adds a defensive REMOVED_SPECS
sweep in main() so any pre-existing AIDOC entry in the JSON is dropped
on regeneration, and updates an incidental mention in AISDK's evidence
sources to reflect that AISDK is the only AI-family spec on main.
Companion migration in spectre PR #43 deletes the corresponding row
from sdk_spec_support.
go-server-sdk now has 38 spec entries (was 39).
Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new
genspecspipeline tosdk-metathat produces a high-level "does this SDK support this spec?" matrix across all 35 SDKs and the top-level specs inlaunchdarkly/sdk-specs. This is an experimental data product — input for an upcoming project that automates SDK work based on each SDK's spec coverage.The pipeline consists of five composable subcommands plus orchestration:
genspecs sync-reposgenspecs catalogsdk-specsREADMEs →products/specs.jsongenspecs harnesssdk-test-harnesscapabilities + per-SDKtestharness-suppressions*.txt→products/harness_signals.jsongenspecs judgeproducts/spec_support.jsongenspecs html_site/spec-support.html(matrix heatmap) and_site/spec-support-by-sdk.html(per-SDK detail)End-to-end via
make spec-supportorscripts/generate-spec-support.sh.LLM judge
AWS_BEARER_TOKEN_BEDROCK, default), direct Anthropic (ANTHROPIC_API_KEY), ornoopfor placeholder runs.judged_against; re-runs skip cells whose inputs haven't changed.Retry-After. Auth errors (401/403) and context cancellation are not retried. Unit tests injudge_retry_test.go.First judged dataset (
products/spec_support.json)Generated against:
sdk-specs@1b103d72sdk-test-harness@c457afb8us.anthropic.claude-sonnet-4-5-20250929-v1:0(via Bedrock,us-east-2, Development account)prompt_version: v1State distribution across all (SDK × spec) cells:
Schemas
schemas/specs.json,schemas/harness_signals.json,schemas/spec_support.json. All wired intoscripts/ci/check-json-schemas.sh.specs.json#/specs/statusis intentionally permissive — real-world spec metadata usesAPPROVED,CURRENT, versioned values likev1:DRAFT.spec_support.json#/$defs/Evidence/kindis a free-form string with documented canonical values. The LLM emits descriptive kinds beyond the canonical set (spec_metadata,sdk_metadata,harness_participation,sdk_features); capturing them verbatim is more useful than rejecting otherwise-valid judgments.Other touches
tool/cmd/genhtml/templates/by-{feature,sdk}.htmlnav tabs to link to the new spec-support pages.tool/specs/.judge-cache/and the compiledgenspecsbinary to.gitignore.How to verify
Test plan
make spec-htmlrenders both pages and they look reasonablebash scripts/ci/check-json-schemas.shreports all products validcd tool && go test ./cmd/genspecs/...passes (retry logic)partialandnot-supportedcells inproducts/spec_support.jsonagainst what you know — does the rationale + evidence pass the smell test?not-applicableon client SDKs) line up?Notes for reviewers
tool/specs/.judge-cache/(gitignored). Deleting it forces a full re-judgment.us-east-2, 12-hour token TTL), seebedrockJudge's doc comment intool/cmd/genspecs/judge.go.via LD Research 🤖
Made with Cursor