Skip to content

feat(specs): SDK spec support matrix (genspecs pipeline + first judged dataset)#441

Draft
aaron-zeisler wants to merge 13 commits into
mainfrom
feat/spec-support-matrix
Draft

feat(specs): SDK spec support matrix (genspecs pipeline + first judged dataset)#441
aaron-zeisler wants to merge 13 commits into
mainfrom
feat/spec-support-matrix

Conversation

@aaron-zeisler
Copy link
Copy Markdown
Contributor

Summary

Adds a new genspecs pipeline to sdk-meta that produces a high-level "does this SDK support this spec?" matrix across all 35 SDKs and the top-level specs in launchdarkly/sdk-specs. This is an experimental data product — input for an upcoming project that automates SDK work based on each SDK's spec coverage.

The pipeline consists of five composable subcommands plus orchestration:

Stage Command Output
1 genspecs sync-repos clones missing SDK repos, ff-pulls existing ones
2 genspecs catalog parses sdk-specs READMEs → products/specs.json
3 genspecs harness parses sdk-test-harness capabilities + per-SDK testharness-suppressions*.txtproducts/harness_signals.json
4 genspecs judge applies-to filter + LLM judge per applicable cell → products/spec_support.json
5 genspecs html renders _site/spec-support.html (matrix heatmap) and _site/spec-support-by-sdk.html (per-SDK detail)

End-to-end via make spec-support or scripts/generate-spec-support.sh.

LLM judge

  • Providers: AWS Bedrock (AWS_BEARER_TOKEN_BEDROCK, default), direct Anthropic (ANTHROPIC_API_KEY), or noop for placeholder runs.
  • Caching: prompt input pack is SHA-256 hashed into judged_against; re-runs skip cells whose inputs haven't changed.
  • Reliability: transient failures (connection reset, 408/429/5xx) are retried with exponential backoff + jitter, honoring Retry-After. Auth errors (401/403) and context cancellation are not retried. Unit tests in judge_retry_test.go.

First judged dataset (products/spec_support.json)

Generated against:

  • sdk-specs@1b103d72
  • sdk-test-harness@c457afb8
  • us.anthropic.claude-sonnet-4-5-20250929-v1:0 (via Bedrock, us-east-2, Development account)
  • prompt_version: v1

State distribution across all (SDK × spec) cells:

State Count
supported 361
partial 70
not-supported 334
not-applicable 460

Schemas

  • New: schemas/specs.json, schemas/harness_signals.json, schemas/spec_support.json. All wired into scripts/ci/check-json-schemas.sh.
  • specs.json#/specs/status is intentionally permissive — real-world spec metadata uses APPROVED, CURRENT, versioned values like v1:DRAFT.
  • spec_support.json#/$defs/Evidence/kind is a free-form string with documented canonical values. The LLM emits descriptive kinds beyond the canonical set (spec_metadata, sdk_metadata, harness_participation, sdk_features); capturing them verbatim is more useful than rejecting otherwise-valid judgments.

Other touches

  • Updates tool/cmd/genhtml/templates/by-{feature,sdk}.html nav tabs to link to the new spec-support pages.
  • Adds tool/specs/.judge-cache/ and the compiled genspecs binary to .gitignore.

How to verify

make spec-html              # rerender HTML from existing JSON
open _site/spec-support.html

# Full pipeline (requires AWS_BEARER_TOKEN_BEDROCK + AWS_REGION=us-east-2):
make spec-support

# Schema validation:
bash scripts/ci/check-json-schemas.sh

# Retry unit tests:
cd tool && go test ./cmd/genspecs/...

Test plan

  • make spec-html renders both pages and they look reasonable
  • bash scripts/ci/check-json-schemas.sh reports all products valid
  • cd tool && go test ./cmd/genspecs/... passes (retry logic)
  • Spot-check a handful of partial and not-supported cells in products/spec_support.json against what you know — does the rationale + evidence pass the smell test?
  • Spot-check the matrix view: do the obvious cases (e.g. server-side specs marked not-applicable on client SDKs) line up?

Notes for reviewers

  • This is experimental: the goal was a repeatable pipeline + a credible first dataset, not a polished production artifact. Expect some judgments to be wrong; the rationale + evidence + cache let us iterate on the prompt without redoing work.
  • The judge cache lives at tool/specs/.judge-cache/ (gitignored). Deleting it forces a full re-judgment.
  • For Bedrock setup specifics (Development account, us-east-2, 12-hour token TTL), see bedrockJudge's doc comment in tool/cmd/genspecs/judge.go.

via LD Research 🤖

Made with Cursor

aaron-zeisler and others added 13 commits May 11, 2026 15:17
…son, spec_support.json)

Adds genspecs, a Go tool that classifies how well each LaunchDarkly SDK
supports each top-level spec from launchdarkly/sdk-specs.

Subcommands:
  - sync-repos: clone any missing SDK repo (plus sdk-specs and
    sdk-test-harness) and fast-forward existing checkouts.
  - catalog: walk sdk-specs and emit products/specs.json (id, status,
    applies-to, requirement_count, versions, sub-specs).
  - harness: extract Capability* constants and top-level test groups from
    sdk-test-harness, plus per-SDK testharness-suppressions* files (with
    inline comments preserved), into products/harness_signals.json.
  - judge: apply a deterministic applies-to filter, then for every
    remaining (sdk, spec) cell call an LLM (Anthropic or noop) with the
    spec README, the SDK metadata, the SDK's features, the harness
    signals, and a depth-limited repo listing. Output goes to
    products/spec_support.json. Caches by SHA-256 of the prompt input
    pack so re-runs only hit the LLM for cells whose inputs changed.
  - html: render _site/spec-support.html (filterable matrix) and
    _site/spec-support-by-sdk.html (per-SDK detail with rationale).

Also wires the three new schemas into scripts/ci/check-json-schemas.sh
and adds Makefile targets (spec-sync-repos, spec-catalog, spec-harness,
spec-judge, spec-html, spec-support) plus scripts/generate-spec-support.sh
that runs the whole pipeline end-to-end.

The committed products/spec_support.json was generated with --provider=noop
so it's a placeholder where every applicable cell is "unknown". Re-run
`make spec-judge` (or scripts/generate-spec-support.sh) with
ANTHROPIC_API_KEY set to populate it with real classifications.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…ne script

`go run` (and `go -C dir run`) inherits cwd from the shell, so when the
Makefile targets ran `cd tool && go run ./cmd/genspecs ...` the binary
saw `tool/` as cwd, and its default `products/specs.json` flag value
resolved to the non-existent `tool/products/specs.json`.

Fixes both invocation sites by:
- Adding explicit ../-prefixed paths to every input/output flag in the
  spec-* Makefile targets (matching the convention already used by the
  existing `html` target).
- Doing the same in scripts/generate-spec-support.sh, wrapping each
  go-run in a (cd tool && ...) subshell so the script's cwd stays at
  the repo root.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
LaunchDarkly accesses Anthropic models through AWS Bedrock, not the
direct Anthropic API. Adds a third judge provider, 'bedrock', that
authenticates with the short-term bearer token AWS exposes via
"Bedrock console -> API Keys -> Generate short-term API keys" (12-hour
expiry).

Provider selection is automatic based on env:
  AWS_BEARER_TOKEN_BEDROCK -> bedrock  (preferred — LD's setup)
  ANTHROPIC_API_KEY        -> anthropic
  (neither set)            -> noop

The Bedrock branch posts the same Anthropic Messages API body that the
direct branch sends, with two adjustments per Anthropic's Bedrock docs:
  - anthropic_version is the literal "bedrock-2023-05-31"
  - the model id is in the URL path, not the body

Default model on bedrock is us.anthropic.claude-sonnet-4-5-20250929-v1:0
(cross-region inference profile, same model as the Anthropic-direct
default for easy comparison). AWS_REGION overrides the endpoint region;
defaults to us-east-1.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…gion in startup log

When the bedrock judge fails with a generic 'Authentication failed' or
'CallWithBearerToken' AccessDeniedException, the cause is almost always
that the bearer token was generated in the wrong AWS account or for a
different region than AWS_REGION points to. Two small changes to make
that obvious:

1. The judge startup log line now reads:
     "Judging N cells with provider=bedrock region=us-east-2 model=..."
   so any region mismatch is visible without flipping on debug output.

2. The bedrockJudge doc comment now spells out the LD-specific facts
   confirmed in #proj-building-with-ai and during a 2026-05-12 debug:
   - Generate from the Development account, not SDK (PowerUser there
     lacks bedrock:CallWithBearerToken).
   - Tokens are scoped to the account+region they were issued in.
     us-east-2 is known-good in Development; AWS_REGION must match.
   - 12-hour TTL.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…ial backoff

The judge had no retry logic, so a single TCP "connection reset by peer"
(seen in the wild when laptops sleep / Wi-Fi roams mid-run) would mark
the cell as judge_failed and skip it for the rest of the batch.

Adds a retryingJudge wrapper around the bedrock and anthropic providers:
  - Up to 4 attempts with 750ms * 2^(attempt-1) backoff (capped at 16s)
    plus ±25% jitter
  - Honors server-supplied Retry-After (mainly useful for 429s)
  - Retries on: net timeouts, EOF, "connection reset", "connection
    refused", "broken pipe", "i/o timeout", "no such host", and HTTP
    408/429/5xx
  - Does NOT retry on: context cancellation, 4xx (auth/config errors
    that won't resolve themselves), or non-retryable errors

Both providers now return a typed *retryableHTTPError instead of an
opaque fmt.Errorf string so the wrapper can inspect status + headers
via errors.As.

Includes unit tests for the classifier and the wrapper itself
(transient recovery, max-attempts cap, non-retryable short-circuit,
context cancellation).

This is purely additive: no behavior change on the success path; cache
keys are unaffected.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…-sonnet-4.5)

First full pass of the genspecs judge across all 35 SDKs against the
top-level spec catalog. Counts: 361 supported, 70 partial, 334
not-supported, 460 not-applicable.

Also relaxes schemas/spec_support.json#/$defs/Evidence/kind from a
fixed enum to a free-form string. The LLM emits descriptive kinds
beyond our recommended canonical set (e.g. spec_metadata,
sdk_metadata, harness_participation, sdk_features). Capturing them
verbatim is more valuable than rejecting otherwise-valid judgments;
the canonical values are now documented in the schema description.

Generated with:
  specs_commit:   1b103d72 (sdk-specs)
  harness_commit: c457afb8 (sdk-test-harness)
  model:          us.anthropic.claude-sonnet-4-5-20250929-v1:0
  prompt_version: v1

Co-authored-by: Cursor <cursoragent@cursor.com>
…ionale/notes

Drops per-cell `source`, `evidence`, `judged_at`, and `judged_against`.
Audit metadata (judge model, commit hashes, evidence list) is still
preserved on disk in the judge cache at tool/specs/.judge-cache/, so
nothing is permanently lost — but the public product is now ~4x
smaller and easier to consume.

Before: 30,282 lines, 2.0M
After:  7,429 lines, 548K

Touched:
- schemas/spec_support.json: dropped fields + the Evidence \$def
- tool/cmd/genspecs/types.go: slim Cell, drop JudgedAgainst/Source*
- tool/cmd/genspecs/judge.go: stop writing the dropped fields; cache
  hits still work because gob/json silently ignore extra fields in
  previously-cached cells
- spec-support.html / spec-support-by-sdk.html: drop the source pill
  and the evidence list; render notes_for_human instead
- .gitignore: ignore the stray tool/genspecs build artifact

The LLM judge prompt is unchanged — we still ask the model to cite
evidence, because chain-of-thought ("show your work") tends to
produce better answers. We just no longer persist the citations.

Co-authored-by: Cursor <cursoragent@cursor.com>
Companion to spec_support.json: where spec_support is one row per
(sdk, spec) at rollup granularity, this is one block per (sdk, spec)
with a per-requirement breakdown plus a rollup that the consumer
(Spectre) overwrites the high-level row with.

This first iteration covers a single pair — go-server-sdk x PLUGIN —
generated by a manual retroactive check during the hackathon
(judge: "human (azeisler + assistant)"). The 17 requirements come
from the post-renumbering PLUGIN spec (depends on
launchdarkly/sdk-specs#167, which renumbers the duplicate `1.2.3`
heading to `1.2.10`).

Block structure per (sdk, spec):

- Provenance: generated_at, judge, prompt_version, spec_path,
  spec_sha, spec_renumber_pr, sdk_repo, sdk_repo_commit, sdk_branch.
- evidence_sources_considered: every file/dir we looked at.
- rollup: { applies, veto_reason, state, complexity, rationale,
  supported_since, supported_since_date, supported_since_evidence,
  counts }. The state here ("partial") is what Spectre promotes into
  sdk_spec_support.
- requirements[]: per-requirement entries with id, severity, state,
  evidence types, rationale, findings, notes_for_human.

Schema ergonomics:

- spec_version is an empty string ("") rather than "v1" so the
  primary key matches the (sdk_name, spec_id, spec_version, kind)
  convention already used by spec_support.json — avoids a join /
  alias dance on the Spectre side.
- evidence_sources_considered, supported_since_evidence, and the
  per-requirement findings live here so a downstream Spectre
  workflow can pick them up as an artifact when it re-judges the
  pair, but they intentionally do NOT propagate into the durable
  Spectre tables (those are re-derived per workflow run).

Headline finding for go-server-sdk x PLUGIN: "partial" rollup, 10
full / 1 partial / 2 not_supported / 4 not_applicable. The two MUST-
severity gaps are 1.1.5 (no onPluginsReady method on the Plugin
interface) and 1.2.6 (no registration-complete callback dispatch).

via LD Research 🤖

Co-Authored-By: Claude <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Adds 21 spec entries to the go-server-sdk × spec matrix on top of the
existing PLUGIN entry, covering every spec in sdk-specs that genspecs
classified as applicable to a server SDK plus the four previously-
unclassified specs (AIDOC, FDV2PL, FDV2REL, CSFDV2). Total: 99
per-requirement rows and 22 rollups.

Breakdown:
- 8 not-applicable rollups (AUTOCONFIG, CSFDV2, CSI, CSSE, EM, RKSTM,
  RPENDPOINTS, SPEC) with veto rationales tying back to the spec's
  own applies-to list.
- 5 brief rollup-only entries (AIDOC + AISDK + AUTOENVATTR + CODES +
  EXAM + FDV2PL + FDV2REL + STACK) where the per-requirement walk is
  deferred but the high-level state is grounded in code/repo evidence.
- 5 deep dives with per-requirement findings:
    * ARCO       (rollup-only, supported)
    * ATREF      (4/4 full,        supported, since v6.0.0)
    * BIGSEG     (65 full + 3 partial,  supported, since v5.5.0)
    * CLM        (7 partial + 2 not_supp + 1 NA, partial, since v2.0.0)
    * CONTEXT    (rollup-only, supported)

Each finding cites the implementing files plus the dependency SHA the
behavior actually lives in (go-sdk-common@v3.5.0/3727dba,
go-server-sdk-evaluation@v3.4.0/v3.4.1, go-sdk-events@v3.6.0).

Generator: research/artifacts/spec-analysis/build_json.py.
Spectre seeding: separate alembic migration g7b8c9d0e1f2 in spectre.

Co-authored-by: Cursor <cursoragent@cursor.com>
Adds the 17 batch-2 spec entries for go-server-sdk to
products/spec_requirement_support.json, completing the v7-spec scan.

Roster (17): CSPE, DATASYSTEM, DIAG, ENVFILTER, EVENTS, FLGDM, FLGEA,
FLGERM, FLGMES, HOOK, MIGRATIONS, OTEL, PS, RELEASE, SCMP, TDS, TXNS.

Each entry follows the same shape as PLUGIN/batch1 — per-requirement
state, severity, rationale, code findings (file:line + kind), and a
rollup. Four prose-based specs (FLGDM/FLGEA/FLGERM/FLGMES) carry
empty requirements[] arrays as rollup-only entries because the spec
READMEs lack numbered requirements.

Notable findings (corrections to bulk-seed bedrock-claude-sonnet-4-5
LLM judgments after deep dive):
- SCMP not_supported (no X-LaunchDarkly-InstanceID or
  X-LaunchDarkly-PollingIntervalMs polling header).
- ENVFILTER partial (filter-key regex validation missing).
- DATASYSTEM partial (Initializer.Fetch doesnt surface
  X-LD-FD-Fallback; Basis lacks RevertToFDv1; case-sensitive
  comparison).
- RELEASE partial (no first-party Hello App; FDv2 surfaces unstable).
- DIAG.1.6.3.1 partial (spec README marks samplingInterval as TODO).

Generated by artifacts/spec-analysis/build_json.py in launchdarkly/research.

Spectre seed PR: launchdarkly/spectre#42 (depends on #41 which depends on #34).

via LD Research

Co-authored-by: Cursor <cursoragent@cursor.com>
Replaces three previously rollup-only spec rows for go-server-sdk with
full per-requirement deep-dive analyses, matching the SCMP-style
treatment requested for the remaining "awaiting deep-dive" backlog.

Per-requirement coverage added (39 new requirement entries):

  AISDK    →  2 reqs   (1.2.1, 1.2.2 — both `full`; rollup remains
                        `partial` because §1.1's six listed sub-spec
                        components — AICONF, AITRACK, AIGRAPH, AIRUNNER,
                        AIGRAPHTRACK, AIEVALS — aren't all implemented
                        in `ldai/`, and the package itself is pre-1.0)
  FDV2PL   → 16 reqs   (11 full / 1 partial / 2 not_supported / 1 N/A
                        / 1 unknown). Surfaces two real bugs in the
                        FDv2 streaming source: §3.3.5 and §3.3.6 log
                        `goodbye` and `error` events at error level
                        with non-spec text, when the spec mandates
                        info level with prescribed text. Trivial fix.
                        Also flags §3.4.1 as `partial` (single-payload
                        struct shape isn't strictly future-proof for
                        multi-payload server-intents) and §3.3.4 as
                        `unknown` because the spec README has an empty
                        Requirement 3.3.4 heading.
  FDV2REL  → 21 reqs   ALL `not_applicable`. Every requirement is
                        phrased "the relay proxy MUST..." and binds
                        to the Relay Proxy implementation, not the
                        SDK. Bulk LLM judge had this as `partial`;
                        deep-dive corrects rollup to `not_applicable`.
                        Recommend narrowing applies-to in the spec.

This closes out the three "awaiting deep-dive" specs — go-server-sdk
now has full per-requirement coverage for every spec where it makes
sense to have it (20 deep-dive entries, 4 prose-only rollups, 7 N/A).

via LD Research 🤖

Co-Authored-By: claude-opus-4.7 <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Audit (rollup state vs per-requirement counts) surfaced five (sdk, spec)
rows where the rollup didn't match the data underneath it:

  AISDK   partial  -> supported  (2/2 numbered reqs are full)
  BIGSEG  supported -> partial   (3 partial reqs)
  DIAG    supported -> partial   (2 partial reqs)
  OTEL    supported -> partial   (1 partial req)
  TDS     supported -> partial   (3 not_supported MAY-tagged reqs)

Per-requirement entries are unchanged; only rollup.state and the
rationale text are touched. The rule applied: if any req is
not_supported or partial, rollup is at most partial; if all reqs are
full, rollup is supported. Confirmed against TXNS (4/4 full) which
already correctly rolled up to supported.

Companion change in spectre adds j0e1f2g3h4i5_fix_go_sdk_rollup_
consistency.py to apply these UPDATEs to sdk_spec_support.

via LD Research 🤖

Co-Authored-By: claude-opus-4.7 <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
AIDOC was added to spec_requirement_support.json in batch 1 as a brief
not_supported block, but the spec only exists on a local sdk-specs
branch — `git ls-tree -r origin/main | grep AIDOC` returns nothing.
Tracking specs that aren't upstream pollutes the matrix and creates
false debt for the SDK ("28 unsatisfied requirements" against a spec
no SDK could be expected to satisfy because it doesn't exist yet).

Removes the AIDOC NEW_BLOCKS assignment in build_json.py (along with
its entry in the counts-zeroing loop), adds a defensive REMOVED_SPECS
sweep in main() so any pre-existing AIDOC entry in the JSON is dropped
on regeneration, and updates an incidental mention in AISDK's evidence
sources to reflect that AISDK is the only AI-family spec on main.

Companion migration in spectre PR #43 deletes the corresponding row
from sdk_spec_support.

go-server-sdk now has 38 spec entries (was 39).

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant