feat: local-first function schedulers — declarative config + --image (VOL-271)#18
Conversation
There was a problem hiding this comment.
Pull request overview
Adds “local-first” function scheduler support to the CLI by extending the project manifest to declare schedulers and reconciling them during config deploy, while also enhancing localmode startup to support a locally-built server image via --image for end-to-end testing.
Changes:
- Extend
volcano-config.yamlfunction entries to supportschedulers:(with validation for required scheduler fields and duplicate names). - Add scheduler reconciliation to
config deploy(create/update/unchanged; non-destructive) and include scheduler counts in deploy output. - Add localmode
--imageflag with clear precedence and a fail-fast preflight when a non-default image is not present locally.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| internal/projectconfig/manifest.go | Adds scheduler manifest types and validation; relaxes function validation to allow schedulers-only entries. |
| internal/projectconfig/manifest_test.go | Adds/updates validation test cases for schedulers and schedulers-only function entries. |
| internal/projectconfig/deploy.go | Introduces scheduler reconciliation logic and scheduler counters in Summary. |
| internal/projectconfig/deploy_test.go | Adds fake scheduler reconciler + tests for create/update/unchanged/non-destructive/idempotency behavior. |
| internal/output/projectconfig.go | Prints scheduler reconciliation counts in config deploy summary output. |
| internal/localmode/start_test.go | Adds tests for image precedence and failing when a custom image is missing locally. |
| internal/localmode/service.go | Adds WithImage option and preflight check for overridden images before compose up. |
| internal/localmode/compose.go | Centralizes image selection logic (resolveImage) and adds local image existence check helper. |
| internal/function/function.go | Adds UpdateSchedulerByID to update schedulers in-place (preserving UUID). |
| internal/cmd/localmode/localmode.go | Wires --image flag into volcano start and volcano restart commands and updates help text. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| existingEnabled := false | ||
| if existing.Enabled != nil { | ||
| existingEnabled = *existing.Enabled | ||
| } |
There was a problem hiding this comment.
Fixed in ca8a247. schedulerNeedsUpdate now defaults a nil existing.Enabled to true, matching schedulerState's rendering (nil = enabled), so an omitted enabled no longer forces a spurious update on every deploy. Added a regression subtest (nil existing enabled treated as enabled).
swkeever
left a comment
There was a problem hiding this comment.
Found one scheduler reconciliation convergence issue.
| return api.FunctionSchedulerInput{ | ||
| Name: manifest.Name, | ||
| CronExpression: manifest.Cron, | ||
| Payload: manifest.Payload, |
There was a problem hiding this comment.
Because omitted payloads are treated as the desired empty-object default in schedulerNeedsUpdate, an existing scheduler with a non-empty payload and a manifest that omits payload will enter the update path. This input leaves Payload nil, and api.UpdateFunctionScheduler only serializes payload when input.Payload != nil, so the server preserves the old payload. The deploy then reports an update on every run and never converges. Could we send an explicit empty payload (or otherwise distinguish omitted vs empty) when the desired state is no payload?
There was a problem hiding this comment.
Fixed in c9de2cb. schedulerNeedsUpdate now only diffs payload when the manifest declares one. An omitted payload is treated as server-managed: api.UpdateFunctionScheduler won't serialize a nil payload (so we can't clear it) and the server keeps its value, so flagging an update could never converge. This mirrors how omitted regions are already handled. Added a regression subtest (omitted payload is server-managed).
marckong
left a comment
There was a problem hiding this comment.
Summary: Adds declarative schedulers: under functions in volcano-config.yaml with a non-destructive config deploy reconcile (create/update by name, never deletes), plus a --image flag for localmode that fails fast when a custom image isn't present locally.
Solid, well-tested work — the idempotency handling (canonical-JSON payload comparison, server-managed regions, nil-Enabled→true) is thoughtful and go build/vet/test all pass. One minor UX issue worth a look.
[P2] restart --image <missing> tears down the environment before the image preflight runs
internal/localmode/service.go — Restart calls Stop then Start, and the "image not found locally" preflight lives inside Start (after the stop has already happened):
func (s Service) Restart(ctx context.Context, w io.Writer) error {
if err := s.Stop(ctx, w, false); err != nil {
return err
}
fmt.Fprintln(w)
return s.Start(ctx, w)
}When a user runs volcano restart --image <ref> with a custom image that isn't built locally, the running environment is stopped first and only then does the command fail, leaving them worse off than before. This partially defeats the fail-fast intent described in the PR. Consider validating image existence before Stop on the restart path.
Overall correctness: correct. The finding above is a non-blocking UX nit; the core reconcile and image-resolution logic is sound and covered by tests.
…OL-271) Add schedulers as a first-class declarative resource so they join the declare -> test-local -> promote-to-cloud flow alongside buckets/functions. - manifest: SchedulerManifest nested under functions (name+cron required, optional enabled/payload/regions); validateFunctions relaxed to 'public OR schedulers'; structural-only validation (cron/region deferred to server) - function.Service.UpdateSchedulerByID: in-place full update preserving the scheduler UUID (no delete+recreate) - deploy: reconcileSchedulers diffs by name and creates/updates/leaves unchanged; NON-DESTRUCTIVE (never deletes undeclared/ad-hoc schedulers); idempotent payload (canonical JSON) and region (server-managed when omitted) comparison; Summary + output extended - guard reconcileFunctions against nil Public for schedulers-only entries Reuses the existing cloud scheduler REST API + client, so the same manifest applies to localmode and cloud.
Let the CLI knowingly run an unpublished, locally-built local-mode image: - start/restart accept --image <ref> (highest precedence: flag > VOLCANO_IMAGE env > project .env.local > bundled default kong/volcano:local-nightly) - Service.WithImage option + resolveImage() single source of precedence - pre-flight: when the resolved image is custom (differs from the default), it must already exist locally; the CLI never pulls it. Missing -> hard fail with an actionable build message instead of a confusing compose registry error - the published default is left to compose's normal pull-if-missing behavior Tests: resolveImage precedence (flag/env/.env.local/default + custom detection) and Start hard-fail when a custom image is absent.
…re announcing (review) - README: document the schedulers block in volcano-config.yaml, the non-destructive reconcile (never deletes; remove via imperative delete), and that omitted regions are server-managed. - start: run the custom-image local-presence check before printing 'Using Docker image' so we never announce an image we then reject.
…review) schedulerNeedsUpdate defaulted a nil existing.Enabled to false while schedulerState renders nil as enabled; an API that omits enabled would then force a spurious update every deploy. Default existingEnabled to true to keep reconcile idempotent; add a regression subtest.
CI 'check' (golangci-lint) flagged issues my local go vet/gofmt missed: - deploy.go: bytes.Equal instead of string(a)==string(b) (gocritic) - deploy_test.go fake: maps.Clone for payload copies (modernize), invert+ continue to reduce nesting (gocritic), errors.New over fmt.Errorf (perfsprint), and remove the dead duplicateNames flag + empty block (revive). Verified with 'go tool golangci-lint run ./...' (0 issues).
swkeever: an existing non-empty payload + a manifest that omits payload would enter the update path forever — api.UpdateFunctionScheduler doesn't serialize a nil payload, so the server keeps its value and the deploy never converges. Only diff/reconcile payload when the manifest declares one (mirrors omitted-regions handling). Adds a regression subtest.
f144375 to
c9de2cb
Compare
swkeever
left a comment
There was a problem hiding this comment.
Reviewed the scheduler reconcile + --image work. Solid overall — non-destructive reconcile, the payload/regions idempotency handling, and the fail-fast image preflight all look right, and test coverage is good. Two correctness findings left inline (one is the same non-convergence class as the payload fix in c9de2cb, applied to enabled). Not blocking the merge — leaving as comments.
Validation: reviewed the cumulative diff and the generated apiclient request/response types (UpdateFunctionSchedulerRequest, FunctionScheduler) plus api.UpdateFunctionScheduler/ListFunctionSchedulers to confirm the serialization assumptions. Did not run the test suite locally.
| if existing.Enabled != nil { | ||
| existingEnabled = *existing.Enabled | ||
| } | ||
| if existingEnabled != desiredEnabled { |
There was a problem hiding this comment.
Non-convergent update when an existing scheduler is disabled and the manifest omits enabled.
Here desiredEnabled defaults to true when the manifest omits enabled, but buildSchedulerInput (line 411) sends the raw manifest.Enabled (nil). The generated body is Enabled *bool json:"enabled,omitempty" and api.UpdateFunctionScheduler assigns body.Enabled = input.Enabled directly, so a nil Enabled is dropped from the request and the server leaves the field unchanged.
Failure mode: a scheduler that exists on the server as enabled=false and is re-declared in the manifest without an explicit enabled: false → existingEnabled=false, desiredEnabled=true → schedulerNeedsUpdate returns true on every deploy → the update is sent with enabled omitted → server stays disabled → next config deploy detects the same diff. It never converges and reports "1 updated" forever. This is the same class as the payload fix in c9de2cb, but enabled wasn't given the same treatment.
Suggested fix — mirror the payload/regions handling and only compare when the manifest declares it:
if desired.Enabled != nil {
existingEnabled := true
if existing.Enabled != nil {
existingEnabled = *existing.Enabled
}
if existingEnabled != *desired.Enabled {
return true
}
}(Alternatively, send the defaulted value from buildSchedulerInput so the update actually enforces enabled=true — but that silently re-enables a manually-disabled scheduler on every deploy, which seems contrary to the non-destructive stance. The skip-when-omitted approach is consistent with how payload and regions are handled just below.)
There was a problem hiding this comment.
Addressed in 07f6f5e — we went with the skip-when-omitted approach you suggested. The team's call was to follow the server's own contract: an omitted enabled means "keep existing" (the update handler does enabled := existing.Enabled when the field is absent). So buildSchedulerInput now forwards the manifest's enabled as-is (nil is dropped on the wire → server keeps state; create still defaults to true), and schedulerNeedsUpdate only diffs enabled when the manifest declares it — consistent with how payload and regions are handled, and it no longer re-enables a deliberately-disabled scheduler on deploy.
This supersedes the re-enable approach in e54ab31; I replaced that test with TestReconcileSchedulersLeavesDisabledWhenEnabledOmitted plus explicit enable/disable subtests in TestSchedulerNeedsUpdateIdempotency.
| // Build map of existing schedulers by Name | ||
| existingByName := make(map[string]apiclient.FunctionScheduler) | ||
| if listResp != nil { | ||
| for _, existing := range listResp.Data { |
There was a problem hiding this comment.
existingByName is built from a single page of schedulers. function.Service.ListSchedulers → api.Client.ListFunctionSchedulers calls ListFunctionSchedulersWithResponse with no page/limit and returns only the first page; the HasMore/Page/Total fields on FunctionSchedulerListResponse are ignored. reconcileFunctions paginates its function listing (loops until !HasMore), but this path doesn't.
If a function has more schedulers than the server's default page size, any declared scheduler living beyond page 1 won't be found in existingByName and will be treated as missing → CreateSchedulerByID → duplicate creation (or a server unique-name rejection). Worth paginating ListSchedulers (or guarding/documenting the limit). Lower severity — only bites functions with many schedulers.
There was a problem hiding this comment.
Addressed in 07f6f5e. The function-scoped scheduler list has no page parameter on the generated client and the server returns the full set (has_more=false) today, so rather than add pagination the CLI can't actually drive, reconcileSchedulers now fails fast if has_more is ever true — avoiding silent duplicate creation should the server start paginating. Added TestReconcileSchedulersAbortsOnPaginatedList. (True pagination would need a client/server change; worth a follow-up only if a single function ever exceeds one page of schedulers.)
marckong
left a comment
There was a problem hiding this comment.
Review: local-first function schedulers + --image
Summary: Adds declarative schedulers: under functions in volcano-config.yaml with a non-destructive config deploy reconcile (create/update, name-keyed, idempotent), plus a --image flag for local mode that runs a locally-built server image and fails fast if it isn't present. Well-tested, and the reconcile semantics for payload/regions are carefully thought through.
Solid, cohesive change — one correctness issue worth fixing before merge.
[P2] config deploy can't re-enable a disabled scheduler when the manifest omits enabled
internal/projectconfig/deploy.go — buildSchedulerInput / schedulerNeedsUpdate
schedulerNeedsUpdate defaults a desired enabled to true when the manifest omits it, so if the server scheduler is currently disabled the diff reports a needed update. But buildSchedulerInput forwards manifest.Enabled verbatim (nil), and UpdateFunctionScheduler serializes that as an omitted field (enabled,omitempty) — which, per this PR's own payload/regions reasoning, the server leaves unchanged. The scheduler stays disabled, the run reports Schedulers: … updated, and the next deploy repeats indefinitely (never converges). This only arises when a scheduler is disabled server-side while the manifest omits enabled. Defaulting the input to match the comparison fixes it:
func buildSchedulerInput(manifest SchedulerManifest) api.FunctionSchedulerInput {
enabled := true
if manifest.Enabled != nil {
enabled = *manifest.Enabled
}
return api.FunctionSchedulerInput{
Name: manifest.Name,
CronExpression: manifest.Cron,
Payload: manifest.Payload,
Regions: manifest.Regions,
Enabled: &enabled,
}
}Minor (non-blocking): Restart runs Stop then Start, and the new --image existence preflight lives inside Start — so restart --image <missing> tears the environment down before failing. Moving the check ahead of Stop would preserve the fail-fast intent, but this matches the pre-existing "Start can fail after Stop" behavior, so it's optional.
Overall correctness: not fully correct — the enabled non-convergence is a real (if narrow) idempotency bug.
Two correctness fixes from the PR review: config: when a manifest omits 'enabled', buildSchedulerInput forwarded a nil *bool, which UpdateFunctionScheduler drops on the wire (enabled,omitempty) and the server leaves unchanged. schedulerNeedsUpdate, however, defaults a desired enabled to true, so a server-disabled scheduler was reported as needing an update on every deploy without ever converging. Default the input to true (matching the comparison and the documented default) so the update actually re-enables the scheduler. localmode: move the custom-image existence preflight into ensureCustomImageAvailable and call it before teardown in Restart, so 'restart --image <missing>' fails fast while the running stack stays intact instead of being stopped first. Adds regression tests for re-enable convergence and restart preflight.
|
|
…uard paginated list Team decision: the manifest follows the server's own reconcile contract — declared fields are enforced; omitted fields are server-managed. - enabled: supersede the re-enable approach in e54ab31. buildSchedulerInput now forwards the manifest's enabled as-is (nil when omitted, which the update body drops so the server keeps the current state; create still defaults true), and schedulerNeedsUpdate only diffs enabled when the manifest declares it. This mirrors the server (omitted enabled = keep existing) and stops config deploy from re-enabling a deliberately-disabled scheduler — matching how payload and regions are already handled. Replaces the re-enable convergence test. - pagination (swkeever): the scheduler list is read as a single page and the client has no page param; fail fast if has_more is ever set rather than silently recreating schedulers beyond page 1. - README: document the single rule (declared=enforced, omitted=server-managed; deploy never deletes or disables; cron always required). Keeps marckong's localmode restart --image preflight from e54ab31 intact.
|
Heads up on 07f6f5e (builds on @swkeever's review + @marckong's e54ab31): The manifest now follows the server's own reconcile contract for schedulers — declared fields are enforced; omitted fields are server-managed. Concretely, an omitted This supersedes the re-enable approach for Known follow-ups (not in this PR): no server-side unique constraint on scheduler name-per-function (CLI guards defensively), and declarative deletion/pruning is intentionally out of scope (deploy is additive). |
swkeever
left a comment
There was a problem hiding this comment.
Reviewed — looks solid. The reconcile model (declared = enforced, omitted = server-managed) is applied consistently across enabled/payload/regions, and the tricky convergence cases (disabled scheduler left alone, explicit re-enable, omitted-payload no-churn) plus the pagination guard are all covered by tests. The --image local-only preflight and its placement before teardown in restart are a nice touch.
Approving. Two non-blocking nits inline.
| if existing.Id == nil { | ||
| return fmt.Errorf("function %q scheduler %q: existing scheduler has no ID", fnManifest.Name, desired.Name) | ||
| } | ||
| schedulerID, err := uuid.Parse(existing.Id.String()) |
There was a problem hiding this comment.
Non-blocking nit: existing.Id is already a *uuid.UUID (openapi_types.UUID is a type alias for uuid.UUID), so this stringify→reparse round-trip reduces to schedulerID := *existing.Id. That drops the redundant parse and its never-hit error branch — the existing.Id == nil guard just above already covers the only real failure mode.
| func (s Service) Restart(ctx context.Context, w io.Writer) error { | ||
| // Validate a custom image before tearing the environment down, so a bad | ||
| // --image leaves the running stack intact instead of stopped. | ||
| if err := s.ensureCustomImageAvailable(ctx); err != nil { |
There was a problem hiding this comment.
Non-blocking: this preflight runs before Start's checkDocker, so restart --image <custom> with the Docker daemon down fails with image "…" not found locally … build it … rather than a "Docker not available" message (docker image inspect errors → imageExistsLocally returns false). Low impact — building the image needs Docker anyway — but slightly misleading. A checkDocker here first would give the clearer message.
Summary
Local-first function schedulers on the CLI: declare schedulers in
volcano-config.yamland deploy them (to localmode or cloud), plus a--imageflag so localmode can run a locally-built server image for end-to-end testing.Jira: VOL-271 (declarative config). Depends on volcano-hosting VOL-270 (local execution).
Commits
feat(config): declarative function schedulers (VOL-271) —
f32e9e9schedulers:nested under functions (name + cron required; optional enabled/payload/regions).validateFunctionsrelaxed to "public OR schedulers"; structural-only validation (cron/region rules deferred to the server).function.Service.UpdateSchedulerByID: full in-place update preserving the scheduler UUID (no delete+recreate).config deployreconcile:reconcileSchedulersdiffs by name → create / update / unchanged. Non-destructive: never deletes undeclared/ad-hoc schedulers (no managed-by marker exists). Idempotent payload (canonical JSON) and region (server-managed when omitted) comparison.Summary+ output extended. GuardsreconcileFunctionsagainst nilpublicfor schedulers-only entries.feat(localmode):
--imageflag —5fc5b89volcano start/restart --image <ref>(precedence: flag >VOLCANO_IMAGE> project.env.local> default).docker image inspect; if missing → hard fail with an actionable build message instead of a confusing registry-pull error. The published default still pulls as before.Testing
go build ./...✓ ·go vet ./...clean ·gofmtclean ·go test ./...all packages ✓ — including reconcile (create/update/unchanged, non-destructive, duplicate-name), schedulers-only validation, idempotency,resolveImageprecedence, and the custom-image hard-fail.