continuum transfunctioner: TSS peer mesh with keygen, signing, resharing#796
continuum transfunctioner: TSS peer mesh with keygen, signing, resharing#796marcopeereboom wants to merge 125 commits into
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
0584452 to
6c21c60
Compare
05ef18a to
1bbe4f5
Compare
bf68add to
18a0a7f
Compare
5e1957e to
5104340
Compare
| bcast = 0x01 | ||
| } | ||
|
|
||
| data := make([]byte, 2+len(wireData)) |
Check failure
Code scanning / CodeQL
Size computation for allocation may overflow High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 2 months ago
In general, the way to fix this kind of problem is to ensure that any arithmetic used to compute allocation sizes cannot overflow. That typically means (a) enforcing an upper bound on the length of incoming or constructed data; and/or (b) checking that len + constant does not exceed a chosen maximum before performing the allocation. Using a constant, protocol‑appropriate size limit also prevents memory exhaustion from oversized messages.
Here, the best low‑impact fix is to introduce a maximum allowed serialized TSS message size (in bytes) and validate len(wireData) against that limit before computing 2 + len(wireData). We can choose a conservative limit that is safely below math.MaxInt even on 32‑bit platforms (for example, 16 MB or 32 MB). We then:
- Add a package‑level constant in
tss_round.go, e.g.const maxWireDataLen = 16 * 1024 * 1024. - In
sendReshareRound, after obtainingwireData, check:- If
len(wireData) > maxWireDataLen, return an error (or handle according to your policy). - Optionally, also check
if len(wireData) > math.MaxInt-2to be mathematically complete, but with a much smallermaxWireDataLenthis is redundant.
- If
- Only if the check passes, perform
data := make([]byte, 2+len(wireData))as before.
We should also update the imports in tss_round.go if we decide to use math for math.MaxInt, but if we simply choose a fixed, safe upper bound we can avoid extra imports, keeping the change minimal. All logic and behavior for valid‑sized messages stays the same; we just fail early on pathological or maliciously large values, eliminating the overflow risk.
| @@ -11,6 +11,11 @@ | ||
| "github.com/hemilabs/x/tss-lib/v3/tss" | ||
| ) | ||
|
|
||
| // maxWireDataLen bounds the size of a marshaled TSS message payload. | ||
| // This prevents integer overflow in size computations and guards | ||
| // against pathological memory usage from oversized messages. | ||
| const maxWireDataLen = 16 * 1024 * 1024 // 16 MiB | ||
|
|
||
| // msgBuf accumulates inbound messages across rounds. Messages that | ||
| // arrive early (from a faster peer in a later round) are kept in | ||
| // the buffer until the local node reaches that round. | ||
| @@ -179,6 +184,9 @@ | ||
| if err != nil { | ||
| return fmt.Errorf("marshal content: %w", err) | ||
| } | ||
| if len(wireData) > maxWireDataLen { | ||
| return fmt.Errorf("marshal content: wire data too large (%d bytes)", len(wireData)) | ||
| } | ||
|
|
||
| // Build committee flags. | ||
| var cflags byte |
There was a problem hiding this comment.
Added explicit overflow guard. The sum is wireHeaderLen(2) + len(wireData) which is bounded to 16 MiB on line 214 — cannot overflow on any 64-bit platform. CodeQL just cannot track constant bounds. See a57739f.
7827377 to
0a72392
Compare
8389d88 to
886414b
Compare
Remove local filesystem replace directive — CI has no access to /home/marco/Documents/src/x/tss-lib. Resolve to the pushed commit on origin/max/tss_changes (30339d0b0ce1). Bump go directive from 1.25.0 to 1.26.0 to match main (577d577). CI runs GOTOOLCHAIN=local with go 1.25.4 which refuses modules requiring >= 1.26. Remove stale nolint:prealloc directive — golangci-lint v2 dropped the prealloc linter. Add missing trailing newline to preparams.json fixture files. Add CHANGELOG entry for #796.
886414b to
ad56a51
Compare
Convert tssImpl.Reshare from channel-based tss-lib LocalParty instances to explicit round-function calls (ReshareRound1-5), completing the pattern established by keygen/sign in 45d1762. Production code: - ceremony struct: remove party, outCh, errCh, oldParty, oldKeyToID, newKeyToID; ceremony lifecycle uses ctx/cancel derived from caller context (no termination channels) - Reshare(): 5-round driver with msgBuf.collect gated on committee membership (old-only nodes skip new->new message collection) - HandleMessage(ctx, ...): ctx threaded through interface and all callers; channel sends select on ctx.Done() + c.ctx.Done() - sendReshareRound(): new helper encodes committee flags from MessageRouting and routes to both committee PID sets - Delete handleReshareMessage() and pumpReshareMessages() - FillBytes for pubkey encoding (X/Y padded to 32 bytes) Server fixes: - handle(): goroutine watches sessionCtx.Done() and closes transport to unblock ReadEnvelope on shutdown - deleteSession/deleteAllSessions: demote close errors to Debug (double-close during shutdown is expected) - connectRandom: dial gap-many shuffled candidates per maintain cycle instead of one random pick (fixes 100-node convergence) Tests: - Delete tss_transport_test.go (channel-based, redundant with RPC) - Delete rpc_integration_test.go; port 3 unique error-path tests and 2 fuzz tests to rpc_tss_test.go - Rewrite rpc_tss_test.go: test nodes use production tssImpl via rpcTransportAdapter over encrypted TCP; all 11 tests preserved - All context.Background() in test code replaced with t.Context() - All ceremony struct literals in tests carry ctx/cancel - TestHundredNodeMesh: set InitialPingTimeout=30s, increase convergence timeouts to 60s (prevents chain link kills under CPU contention) - .golangci.yaml: replace-local: true for tss-lib fork
Update all imports from tss-lib/v2 to tss-lib/v3. The v3 module deletes the channel-based Party/Round/BaseUpdate API and retains only the pure round function API that continuum already uses. tss_examples: move old v2 channel-based examples to testdata/v2_channel_reference/ as documentation (does not compile against v3). Add v3_reference_test.go demonstrating keygen+sign using the round function API.
Remove local filesystem replace directive — CI has no access to /home/marco/Documents/src/x/tss-lib. Resolve to the pushed commit on origin/max/tss_changes (30339d0b0ce1). Bump go directive from 1.25.0 to 1.26.0 to match main (577d577). CI runs GOTOOLCHAIN=local with go 1.25.4 which refuses modules requiring >= 1.26. Remove stale nolint:prealloc directive — golangci-lint v2 dropped the prealloc linter. Add missing trailing newline to preparams.json fixture files. Add CHANGELOG entry for #796.
Wire format byte 0 (message type) and byte 1 (committee flags) were sharing the wireFlag prefix and colliding at 0x01. Split into two namespaces: msgTypeP2P/msgTypeBroadcast for byte 0, cflagToOld/cflagToNew/cflagFromNew for byte 1. Add maxWireDataLen (16 MiB) bounds check before the allocation in sendReshareRound (CodeQL integer-overflow finding). Name remaining bare literals: dialTimeout, promPollInterval in continuum.go; secp256k1KeySize, handshakeTimeout in protocol.go. Update all production code and test files.
runtime.Caller(0) does not resolve in CI test binaries, causing loadTestPreParams and loadPreParams to silently fall back to live Paillier generation (~30s per node, exceeds test timeout). Embed tss_examples/preparams.json via go:embed into preparams_test.go. Both tss_test.go and rpc_tss_test.go now call testPreParams() which fails hard on missing or corrupt fixture data.
The race detector adds ~10x overhead to goroutine scheduling. With 100 nodes on a CI runner, maintain cycles fire before handshakes complete, causing duplicate-identity rejections and convergence timeout. The test validates gossip scaling, not concurrency correctness — the smaller mesh tests already cover race safety.
The tss_examples sub-package existed to hold v2/v3 reference implementations and pre-computed Paillier fixtures. The v2 channel reference is dead code (v3 replaced it entirely) and the v3 reference test is redundant with the x repo's own example tests. Move preparams.json to testdata/ (used by go:embed in preparams_test.go). Delete everything else: v3_reference_test.go, v2_channel_reference/, README. -2,912 lines.
Suppress G118 false positive in registerCeremony — cancel is stored in CeremonyInfo and called on ceremony completion. Eliminate G115 int-to-uint64 conversion in election shuffle by keeping remaining as int. Annotate safe test conversions with nolint:gosec.
Replace bytes.Equal with subtle.ConstantTimeCompare at four sites where attacker-controlled input is compared against security-critical values: signature identity verification, payload hash verification, and both DNS identity checks. Leave bytes.Equal for zero-sentinel checks (ZeroChallenge, zeroKey) where the compared value is a public constant.
HashTSSMessage: add "continuum-tss-msg-v1" domain separator and 4-byte length prefix before data. Prevents cross-protocol signature replay and ambiguous field boundaries. Transport.Close: zero encryptKey, decryptKey, and nonce key on session teardown. Nil the ephemeral private key. Limits key material exposure in swap files and core dumps. Handshake challenge: add "continuum-challenge-v1" domain separator to Hash256(challenge || ETP) on both signing and verification sides. Prevents cross-protocol challenge-response replay. maintainConnections: replace math/rand/v2 Shuffle with crypto/rand Fisher-Yates. Remove math/rand/v2 import from production code.
TestVerifyRejectsWrongIdentity — exercises subtle.ConstantTimeCompare in Verify(), tests correct/wrong/bit-flipped identity paths. TestHashTSSMessageDomainSeparation — known-answer test proving the domain separator is present, verifies it differs from raw hash. TestHashTSSMessageLengthPrefix — different data lengths produce different hashes, determinism check. TestTransportCloseZerosKeys — asserts encryptKey, decryptKey, and nonce.key are zeroed after Close(), ephemeral private key is nil. TestChallengeHashDomainSeparation — proves domain-separated challenge hash differs from unseparated. TestSealBoxOpenBoxRoundTrip — e2e encryption round trip, positive path and wrong-sender-key rejection. Fix TestConnKeyExchange: move clientTransport.Close() after key assertions since Close() now zeros keys. Strip internal document references from comments.
Wire-initiated ceremony requests (KeygenRequest, SignRequest, ReshareRequest) are now only processed when built with the continuum_debug tag. Production binaries compile debug_off.go which returns nil from serverDebugInit(); debug builds compile debug_on.go which returns a debugInitiator. Previously newDebugInitiator() was called unconditionally in NewServer(), making the nil-checks in dispatch.go dead code. Any peer could trigger a ceremony over the wire. Add noopInitiator for production ceremonyLoop — blocks on nil channel until blockchain watcher is wired in. Tests wire up debug initiation explicitly in newTestServer().
Replace cleartext 3-byte size prefix with two-phase secretbox framing. Phase 1 is a fixed 44-byte encrypted header containing the body size. Phase 2 is the encrypted payload. Wire format (v2): [24-byte nonce_h][secretbox(4-byte body_size)] <- 44 bytes [24-byte nonce_p][secretbox(payload)] <- body_size bytes An attacker corrupting any byte of the header causes secretbox.Open to fail. The receiver never trusts an unauthenticated length. TransportVersion bumped from 1 to 2. TransportMaxSize reduced from 16 MB to 1 MB (sufficient for 100-party TSS keygen).
Replace static sender NaCl key with per-message ephemeral X25519
keypair (sealed-box pattern). Sender generates fresh keypair,
encrypts with nacl.box to the recipient's static X25519 key,
ships ephemeral public key in EncryptedPayload, destroys private
key immediately.
Sender authentication via secp256k1 compact signature over
SHA256("continuum-e2e-sig-v1" || EphemeralPub || Nonce ||
Ciphertext). Receiver verifies signature against Sender identity
before opening the box. Prevents forged payloads from anyone who
knows the recipient's gossip-advertised X25519 public key.
Provides sender-side forward secrecy: compromising a sender after
the fact cannot recover past ephemeral keys.
SealBox takes *Secret (signs envelope), OpenBox unchanged.
EncryptedPayload adds EphemeralPub and Signature fields.
decryptPayload verifies signature before decrypting.
Mix both parties' ephemeral public keys into the HKDF salt in canonical order (server first, client second). Salt: "continuum-hkdf-salt-v2" || serverPub || clientPub Public keys are fixed-length per curve, no length delimiters needed. Validated against the curve's actual key size. The caller provides them based on Transport.isServer. Zero the ECDH shared secret after key derivation. Go 1.26 runtime.ZeroMemory will provide a proper guarantee; for now we zero the slice contents but cannot prevent GC copies. Eliminates the static salt shared across all sessions. Ephemeral ECDH already guarantees unique shared secrets but session-specific salt prevents theoretical cross-session key derivation collision.
Add use-after-close guard: all store operations return ErrStoreClosed after Close(). Previously Close() zeroed the key and subsequent encrypts silently produced unrecoverable ciphertext with no error. Add keyID binding: encrypt prepends a length-prefixed keyID to plaintext before sealing. decrypt verifies the bound keyID matches the expected keyID. Prevents file-swap attacks where an attacker with filesystem access renames key files. Add atomic writes: writeAtomic uses temp file + fsync + rename. A crash at any point leaves either the old or new file, never a partial write. Add ErrEmptyKeyID validation on all Save/Load/Delete paths. Zero the encryption key copy after use in encrypt/decrypt. Copy encKey under mutex to avoid holding the lock during secretbox operations.
TSS transport falls back to SendEncrypted when no direct session exists, enabling ceremony completion across sparse meshes where committee members lack direct TCP sessions. Link-state routing via gossip topology: PeerRecord carries session adjacency, generation-gated BFS routing table rebuilt lazily on topology changes. SendTo and forward use route table with flood fallback when stale. Admin listener on dedicated port bypasses PeersWanted capacity limits. No gossip, no ping lifecycle — ceremony injection only. handle() takes isAdmin flag; admin sessions skip gossip exchange and rate limiting. Transport DDoS mitigations: per-session rate limiter drops messages exceeding messageRate, read deadlines on all I/O, reconnection cooldown for rejected peers. notifyAllPeers no longer closes transports on write failure; dead sessions are reaped by pingExpired instead. PrivateKeyHex neutered in release builds; test code uses DebugPrivateKeyHex (build-tagged continuum_debug).
Spins up 10 daemons with PeersWanted=3 (sparse mesh) in chain topology, runs keygen, sign, reshare, post-reshare sign, and second sign. Forces multi-hop encrypted envelope delivery for TSS messages between non-adjacent committee members. Build-tagged continuum_debug; uses admin listener for ceremony injection.
Transport.encrypt() reads encryptKey and nonce.key without holding t.mtx. Concurrent Close() zeroes those fields under the lock, causing a data race detected by -race in TestRPCTSSKeygenCorruptPostSign. Move lock acquisition in write() above the encrypt() call so the entire encrypt+write sequence is synchronized with Close(). Reorder Close() to close the conn before zeroing key material so that in-flight readers blocked in readExact unblock with an I/O error before decrypt/decryptFrameHeader can touch zeroed keys.
Go function signatures must be on a single line. Godoc requires it. Wrapped parameters are not idiomatic Go. Flatten 9 functions across tss.go, tss_round.go, rpc_tss_test.go, and continuum_e2e_test.go.
Unexport SendTo to sendTo — all TSS traffic uses SendEncrypted, sendTo is internal delivery for already-encrypted envelopes. Consolidate scattered const declarations into the main const block. Use reflect.TypeFor instead of reflect.TypeOf((*X)(nil)) in dispatch table and registration. Use early-continue in forward and forwardBroadcast instead of if/else on error. Unwrap three if statements in handle() for readability. Move spew.Sdump calls to Tracef to avoid evaluation when trace logging is disabled. Short-circuit isHostname behind DNS config check. Invert preparams file logic for readability: try open first, fall through to create on ErrNotExist. Use json.NewEncoder instead of MarshalIndent to avoid buffering. Fix SetIndent prefix. Simplify TTL cache initialization — direct field assignment. Simplify initPaillierPrimes call. Unwrap if in hemictl continuumStatus. Remove resolved XXX in continuum_ceremony.go.
Convert all four e2e polling loops to ticker + t.Context() checks. Fix e2e preparams path to use testdata/preparams.json. Use reflect.TypeFor in dispatch test. Merge TestDispatchMapCompleteness and signature test into single test function. Use json.NewEncoder in continuum_test.go preparams helper.
A TSSMessage must never carry a routing header (Destination != nil). Legitimate cleartext TSS is one-hop only (Destination == nil, sent via Write between direct peers). Multi-hop TSS must be wrapped in an EncryptedPayload. A routed cleartext TSSMessage means the sending peer is either buggy or actively leaking TSS round data to the mesh. Both intermediaries and destinations now reject it: the check runs in the handle() loop before forwarding or dispatch, and the offending peer is disconnected immediately (handle returns, triggering session cleanup).
wireHeaderLen is 2, wireData is bounded to 16 MiB, and keyID is a short identifier — none of these sums can overflow on a 64-bit int. CodeQL cannot track constant bounds across variables, so add explicit overflow checks to silence the false positives.
Transport.Close() zeroed decryptKey under mtx.Lock while the read loop accessed it via NaCl decrypt without the mutex. Change mtx to sync.RWMutex; read path takes RLock around decrypt calls (not during blocking I/O), Close takes Lock which waits for in-flight decrypts to finish before zeroing keys. Replace post-arithmetic overflow guard in encryptKeyShare with pre-arithmetic input bounds (maxEncryptSize = 4 MiB). CodeQL flagged the sum as potentially overflowing before the guard could catch it.
Implements the continuum TSS service end-to-end: a peer mesh network that runs threshold ECDSA/EdDSA ceremonies (keygen, signing, resharing) over encrypted RPC transport.
Architecture
Peer mesh — TCP transport with X25519 ECDH key agreement and NaCl secretbox encryption. Peers discover each other via DNS seeding with forward verification, maintain connections through gossip and liveness pings, and track idle/stale peers via TTL-based eviction. The mesh targets PeersWanted total connections (inbound + outbound) and fills gaps each maintenance cycle by dialing shuffled candidates.
Ceremony lifecycle — Coordinator election picks the peer with the lowest key hash. The elected coordinator dispatches ceremonies (keygen/sign/reshare) to participants, who execute TSS rounds and exchange messages over the encrypted mesh. Ceremony state is context-scoped with proper cancellation propagation. Results are persisted to a NaCl-encrypted key store (HKDF-derived storage key).
TSS integration — Uses hemilabs/x tss-lib v3 channel-free round functions. Each ceremony is a loop over explicit round calls with message collection gated on committee membership. Resharing supports overlapping old/new committees. Wire format uses package-prefixed type discriminators for 32 message types (21 ECDSA + 11 EdDSA).
What's included
Core service (service/continuum/):
continuum.go — server lifecycle, peer tracking, gossip, maintenance, session management
protocol.go — RPC envelope format, handshake, message routing with hash verification
tss.go — TSS ceremony abstraction (tssImpl), Paillier precompute, key store with NaCl encryption
tss_round.go — round-function ceremony drivers for keygen/sign/reshare
tss_rpc.go — ceremony RPC message types and handlers
tss_wire.go — JSON wire format: marshal/unmarshal with type discriminators
ceremony.go — ceremony struct, context/cancel lifecycle
dispatch.go — type-switch dispatch map replacing monolithic handle()
election.go — coordinator election by lowest key hash
doc.go — package godoc with broadcast scaling analysis
Admin tooling:
cmd/hemictl/continuum.go — hemictl continuum subcommand: status, peers, key info
cmd/hemictl/continuum_ceremony.go — keygen, sign, reshare ceremony commands (gated behind continuum_debug build tag)
cmd/transfunctionerd/ — daemon entry point updates
docker/transfunctionerd/Dockerfile
Infrastructure:
Prometheus:
Metrics for ceremony counts, peer gauge, broadcast latency
Testing:
Integration tests (continuum_test.go, rpc_test.go, rpc_tss_test.go) — 5-node keygen with broadcast verification, full keygen→sign→reshare lifecycle, transport write/DNS/outbound verify paths, ceremony dispatch error paths, election fuzzing
Unit tests — dispatch map, wire format (38 round-trip + exhaustive type tests), TTL error paths, hemictl ceremony commands
Reference tests (tss_examples/v3_reference_test.go) — v3 round function API usage examples with pre-computed Paillier params
All test nodes use production tssImpl via rpcTransportAdapter over encrypted TCP
Zero time.Sleep in tests — all synchronization via context waits
Key fixes along the way
Unlock-before-cancel to prevent deadlock during broadcast I/O
handleCeremonyResult must not race SaveKeyShare
Sentinel errors and status constants for ceremony lifecycle
Transport payload hash verification (replay/tampering protection)
Session busy response instead of silent drop
Handshake semaphore to bound concurrent connection setup
Forward DNS verification as a policy gate (configurable, loopback exempt)