Skip to content

Custom validators and selective start profiles#1

Open
lmcorbalan wants to merge 42 commits into
mainfrom
feat/validator-projects-v2
Open

Custom validators and selective start profiles#1
lmcorbalan wants to merge 42 commits into
mainfrom
feat/validator-projects-v2

Conversation

@lmcorbalan
Copy link
Copy Markdown
Collaborator

@lmcorbalan lmcorbalan commented May 15, 2026

What changed

  • canton builder start --with/--without/--validators picks which built-ins boot. Default with no flags is SV + app-provider only.
  • New canton builder validator add/start/stop/rm/list/info. Each custom validator runs as its own Docker Compose project on a shared localnet network, joining the same local SV.
  • Registry at ~/.canton-builder/validators.json (atomic writes under flock) is the source of truth for what exists and what's running. start, stop, deploy, status all read from it.
  • deploy targets every running validator (excluding SV) by default; --validator <name> picks one.
  • reset --purge wipes only known runtime paths instead of rm -rf on the install dir.

Why

Hackathon setting: each team wants their own validator joining a single local SV. The original three-validator setup works fine until you add a fourth participant, at which point you either fork the compose stack per team or push everyone to DevNet (which needs whitelisting, which is the whole reason this tool exists).

Selective profiles came up separately. SV plus one validator is enough for most "build against Canton" sessions and saves a few hundred MB of RAM, but the old static --profile sv --profile app-provider --profile app-user line didn't let us skip any.

73 unit bats pass. Verified on macOS with Splice bundle v0.5.18: start for all-profiles and cold-default paths, validator add, and deploy against multiple running validators.

lmcorbalan added 30 commits May 14, 2026 09:46
The wrapper dispatched to scripts/start.sh, scripts/stop.sh, scripts/status.sh,
and scripts/reset.sh without shifting past 'devrel <cmd>' or passing "$@", so
flags like --with, --without, --validators, and --purge were silently dropped.
Match the existing deploy/logs/validator cases that already shift 2 and forward.
…ctor

The original COMPOSE_CMD always passed --profile sv/app-provider/app-user.
infra_compose_argv() dropped them and only exported *_PROFILE env vars, which
Docker Compose does not consult — so 'docker compose up' selected no services
and aborted with "no service selected".

Always emit --profile sv, then conditionally --profile app-provider /
--profile app-user based on the corresponding env var. Unset defaults to "on"
so non-start callers (stop, logs, reset) keep broad coverage.
…ation

The Splice LocalNet bundle's env/app-user-auth-on.env and
env/app-provider-auth-on.env set:
  APP_USER_PARTY_HINT=app_user_${PARTY_HINT}
  APP_PROVIDER_PARTY_HINT=app_provider_${PARTY_HINT}

Splice v0.5.18 validates these against <alphanumeric>-<alphanumeric>-<int>.
The literal 'app_user_' / 'app_provider_' prefixes contain underscores, so
the resulting hints always fail validation regardless of PARTY_HINT. Splice
exits 0 on init failure and restart: always loops it forever, masquerading
as a healthcheck/dep-wait/memory issue.

Add an overlay that sets APP_USER_PARTY_HINT and APP_PROVIDER_PARTY_HINT on
the splice service via environment: (which beats env_file:). This is a
bundle bug worth filing upstream; remove the overlay once it's fixed there.
Two bugs:

1. Line 42 used "$CANTON_DEVREL_DIR…" where the trailing UTF-8 ellipsis got
   parsed as part of the variable name, so 'set -u' rejected it as unbound.
   Brace the expansion: "${CANTON_DEVREL_DIR}…".

2. --purge ran 'rm -rf "$CANTON_DEVREL_DIR"', which for a normal install
   wipes the scripts/binary and for a symlink install removes the symlink.
   Either way it leaves the user unable to run 'canton devrel start' until
   they reinstall. Target only known runtime-state paths: bundle/,
   validators/, nginx-customs/, validators.json, .registry.lock. Preserve
   the install files and .env.
'canton devrel status' rendered the wallet URL unconditionally, so a
profile that was off (e.g. app-user with the default active set) showed
http://wallet.localhost:2000 in the WALLET column with HEALTH=DOWN —
implying the link was live. Replace the URL with an em-dash when the
readyz check fails.
The 'LocalNet is up!' summary listed App User and App Provider wallet UIs
and JSON Ledger APIs unconditionally. With a partial-profile boot (e.g.
default 'app-provider only'), the App User URLs were dead links. Guard
each line with the corresponding *_PROFILE env var.
…aults

Three defaults that previously silently produced a broken validator:

- IMAGE_REPO defaulted to a DA-internal JFrog URL; switch to the same
  ghcr.io path the bundle's localnet uses.
- IMAGE_TAG defaulted to empty; resolve it from bundle/splice-node/VERSION
  (matches bundle/start.sh) when the caller hasn't set it.
- SPLICE_APP_UI_* keys were unset; the bundle's start.sh fetches them via
  the scan API, but the canton-devrel custom path bypasses that. Mirror the
  static values from localnet/env/common.env so wallet/ANS UIs render.
lmcorbalan added 12 commits May 14, 2026 15:53
…ojects

The bundle's validator/compose.yaml only host-binds the per-validator nginx
(port 80), and attach-localnet.overlay.yaml unbinds that nginx to avoid :80
collisions across multiple validator-<name> projects. Nothing was rebinding
validator:5003, participant:5001, or participant:7575, so readyz polling and
the SDK ledger clients hit dead air. Re-bind them via the per-validator host
ports already produced by render_custom_env (port_base + {3, 1, 75}).
Bundle's validator/.env ships PARTICIPANT_IDENTIFIER= empty and compose.yaml
passes it through as SPLICE_APP_VALIDATOR_PARTICIPANT_IDENTIFIER. With a blank
namespace, NodeInitializer's UniqueIdentifier.tryCreate throws
IllegalArgumentException("Daml-LF Party is empty") and crash-loops under
restart: always — masquerading as a slow boot / healthcheck timeout. Bundle's
start.sh defaults it to PARTY_HINT when no -P is given; mirror that.
…ability

The bundle's localnet SV ships sv-apps.sv.scan.{public-url,internal-url} both
as http://localhost:5012. Works for the SV's own loopback and the bundle's
co-located built-in validators, but unreachable from peer containers — a
custom validator on the shared localnet network gets ECONNREFUSED when
BftScanConnection refreshes its scan list from DsoRules.

Patch both URLs to http://splice:5012 via an overlay mount of sv/app.conf.
public-url is the one peers actually read from DsoRules (despite the name);
internal-url is rewritten alongside it to match the bundle's own peer-mode SV
setup at docker-compose/sv/compose.yaml. Host browsers reach scan via nginx
(scan.localhost:4000), not direct, so rewriting public-url is safe.

OVERLAYS_DIR is now exported by common.sh so docker compose interpolates the
absolute path for the volume mount.
…ndpoint

Bundle's SV hard-codes expected-validator-onboardings to just two pre-shared
secrets (APP_PROVIDER, APP_USER). A custom validator sending anything else —
including empty — gets HTTP 401 "Unknown secret" during onboarding. The fix
uses the documented DevNet self-service: POST /api/sv/v0/devnet/onboard/
validator/prepare returns a one-hour opaque token whose entire body is the
secret to send (a base64 JSON envelope the SV decodes server-side).

_fetch_onboarding_secret POSTs to http://sv.localhost:4000 (override via
SV_SPONSOR_HOST_URL) and validator_add calls it between the registry write
and render_custom_env, threading the body through as a new 4th param of
render_custom_env.
Two compounding problems on the custom wallet route. Both fixed together:

(a) macOS Monterey+ binds port 5000 to AirPlay Receiver by default. Requests
    to wallet.<name>.localhost:5000 never reach docker — AirPlay intercepts
    and returns 403 with `Server: AirTunes/*`. Moved to :5500 (rendered conf
    listen, overlay publish, advertised URLs, lifecycle test).

(b) Bundle's nginx.conf has `include /etc/nginx/conf.d/*.conf`, a
    non-recursive glob. Our customs land in conf.d/customs/, so they were
    invisible — nginx would have returned 404/444 even past the AirPlay hop.
    Ship an extended nginx.conf overlay (bundle's content verbatim plus
    `include /etc/nginx/conf.d/customs/*.conf;`) and mount it via
    customs.overlay.yaml.
…dump on rollback

Cold-booting a 4th custom validator on top of an already-running localnet
exercises DB migration, participant identity init, and the SV onboarding
handshake — comfortably over 90s on a memory-constrained Docker Desktop.
Bump _wait_for_custom_ready to 300s default (configurable via
CANTON_DEVREL_VALIDATOR_READY_TIMEOUT_S) and print a progress line every
~30s so the user knows it isn't hung.

When readyz times out and _undo rolls back validators/<name>/, the user has
nothing to debug from. Add _dump_failure_logs that captures docker ps +
docker logs for each container into $CANTON_DEVREL_DIR/last-validator-add-
failure-<name>.log (outside the rolled-back dir so it survives), and
.gitignore them in case the project tree symlink puts them in the repo.
The default deploy path hardcoded `app-provider + app-user` for both the
readyz precheck and the upload step. After validator-projects-v2 a cold
`canton devrel start` only brings `app-provider` up, and customs (e.g.
`acme`) are legitimate deploy targets — so the bare `canton devrel deploy
<dar>` would bail on port 2903 (app-user readyz) even when the user
intentionally has app-user down.

Resolve targets up-front from the registry (`running == true`, excluding
sv), then precheck + upload per actual target. `--validator <name>`
still selects a single target. Trailing summary now lists the JSON APIs
of the validators we actually uploaded to.
Pulls in main's "Canton DevRel" -> "Canton Builder Tool" rebrand
(canton devrel -> canton builder, ~/.canton-devrel -> ~/.canton-builder)
on top of the validator-projects feature work.

Resolution notes:
- Kept all feature semantics from this branch: --with/--without/--validators
  flags, registry-driven start/stop/deploy targets, custom validator
  lifecycle, --purge that only wipes known runtime paths.
- Adopted main's polish: dropped top-of-file purpose comments, decorative
  section bars, and the "Canton DevRel" naming.
- Propagated the rebrand through files main never touched (validator.sh,
  customenv.sh, registry.sh, resolve.sh, compose.sh, nginxcustom.sh,
  customs.overlay.yaml, .env.example, lifecycle bats).
- Kept env-var names unchanged (CANTON_DEVREL_DIR, DEVREL_DIR,
  CANTON_DEVREL_VALIDATOR_READY_TIMEOUT_S, DEVREL_COMMAND) matching
  main's choice to rename only user-facing surface.

Verification: 73/73 unit bats pass; \`canton builder help\` and
\`canton builder version\` work; \`canton devrel <anything>\` is now
rejected.
…VERSION/SPLICE_DB_*

docker compose disables auto-loading of the project-dir .env as soon as any
--env-file is passed. custom_compose_argv passed only the rendered per-validator
env, so NGINX_VERSION, SPLICE_DB_USER, and other bundle defaults evaluated to
empty — manifesting as `unable to get image 'nginx:': invalid reference format`
on validator add. Listing the bundle .env first (rendered env last so it wins)
matches the bundle's own start.sh behaviour.
@lmcorbalan lmcorbalan changed the title validators Custom validators and selective start profiles May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant