Skip to content

feat: local-first function schedulers — declarative config + --image (VOL-271)#18

Merged
marckong merged 8 commits into
mainfrom
feat/local-scheduler
Jun 26, 2026
Merged

feat: local-first function schedulers — declarative config + --image (VOL-271)#18
marckong merged 8 commits into
mainfrom
feat/local-scheduler

Conversation

@tkkhq

@tkkhq tkkhq commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Local-first function schedulers on the CLI: declare schedulers in volcano-config.yaml and deploy them (to localmode or cloud), plus a --image flag so localmode can run a locally-built server image for end-to-end testing.

Jira: VOL-271 (declarative config). Depends on volcano-hosting VOL-270 (local execution).

Commits

feat(config): declarative function schedulers (VOL-271) — f32e9e9

  • manifest: schedulers: nested under functions (name + cron required; optional enabled/payload/regions). validateFunctions relaxed to "public OR schedulers"; structural-only validation (cron/region rules deferred to the server).
  • function.Service.UpdateSchedulerByID: full in-place update preserving the scheduler UUID (no delete+recreate).
  • config deploy reconcile: reconcileSchedulers diffs by name → create / update / unchanged. Non-destructive: never deletes undeclared/ad-hoc schedulers (no managed-by marker exists). Idempotent payload (canonical JSON) and region (server-managed when omitted) comparison. Summary + output extended. Guards reconcileFunctions against nil public for schedulers-only entries.
  • Reuses the existing cloud scheduler REST API + client, so the same manifest applies to localmode and cloud.

feat(localmode): --image flag — 5fc5b89

  • volcano start/restart --image <ref> (precedence: flag > VOLCANO_IMAGE > project .env.local > default).
  • A custom image (≠ bundled default) is treated as local-only: pre-flight docker image inspect; if missing → hard fail with an actionable build message instead of a confusing registry-pull error. The published default still pulls as before.

Testing

go build ./... ✓ · go vet ./... clean · gofmt clean · go test ./... all packages ✓ — including reconcile (create/update/unchanged, non-destructive, duplicate-name), schedulers-only validation, idempotency, resolveImage precedence, and the custom-image hard-fail.

@tkkhq tkkhq marked this pull request as ready for review June 25, 2026 17:52
@tkkhq tkkhq requested a review from a team as a code owner June 25, 2026 17:52
Copilot AI review requested due to automatic review settings June 25, 2026 17:52

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds “local-first” function scheduler support to the CLI by extending the project manifest to declare schedulers and reconciling them during config deploy, while also enhancing localmode startup to support a locally-built server image via --image for end-to-end testing.

Changes:

  • Extend volcano-config.yaml function entries to support schedulers: (with validation for required scheduler fields and duplicate names).
  • Add scheduler reconciliation to config deploy (create/update/unchanged; non-destructive) and include scheduler counts in deploy output.
  • Add localmode --image flag with clear precedence and a fail-fast preflight when a non-default image is not present locally.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
internal/projectconfig/manifest.go Adds scheduler manifest types and validation; relaxes function validation to allow schedulers-only entries.
internal/projectconfig/manifest_test.go Adds/updates validation test cases for schedulers and schedulers-only function entries.
internal/projectconfig/deploy.go Introduces scheduler reconciliation logic and scheduler counters in Summary.
internal/projectconfig/deploy_test.go Adds fake scheduler reconciler + tests for create/update/unchanged/non-destructive/idempotency behavior.
internal/output/projectconfig.go Prints scheduler reconciliation counts in config deploy summary output.
internal/localmode/start_test.go Adds tests for image precedence and failing when a custom image is missing locally.
internal/localmode/service.go Adds WithImage option and preflight check for overridden images before compose up.
internal/localmode/compose.go Centralizes image selection logic (resolveImage) and adds local image existence check helper.
internal/function/function.go Adds UpdateSchedulerByID to update schedulers in-place (preserving UUID).
internal/cmd/localmode/localmode.go Wires --image flag into volcano start and volcano restart commands and updates help text.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/projectconfig/deploy.go Outdated
Comment on lines +425 to +428
existingEnabled := false
if existing.Enabled != nil {
existingEnabled = *existing.Enabled
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ca8a247. schedulerNeedsUpdate now defaults a nil existing.Enabled to true, matching schedulerState's rendering (nil = enabled), so an omitted enabled no longer forces a spurious update on every deploy. Added a regression subtest (nil existing enabled treated as enabled).

@swkeever swkeever left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one scheduler reconciliation convergence issue.

return api.FunctionSchedulerInput{
Name: manifest.Name,
CronExpression: manifest.Cron,
Payload: manifest.Payload,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because omitted payloads are treated as the desired empty-object default in schedulerNeedsUpdate, an existing scheduler with a non-empty payload and a manifest that omits payload will enter the update path. This input leaves Payload nil, and api.UpdateFunctionScheduler only serializes payload when input.Payload != nil, so the server preserves the old payload. The deploy then reports an update on every run and never converges. Could we send an explicit empty payload (or otherwise distinguish omitted vs empty) when the desired state is no payload?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c9de2cb. schedulerNeedsUpdate now only diffs payload when the manifest declares one. An omitted payload is treated as server-managed: api.UpdateFunctionScheduler won't serialize a nil payload (so we can't clear it) and the server keeps its value, so flagging an update could never converge. This mirrors how omitted regions are already handled. Added a regression subtest (omitted payload is server-managed).

@marckong marckong left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary: Adds declarative schedulers: under functions in volcano-config.yaml with a non-destructive config deploy reconcile (create/update by name, never deletes), plus a --image flag for localmode that fails fast when a custom image isn't present locally.

Solid, well-tested work — the idempotency handling (canonical-JSON payload comparison, server-managed regions, nil-Enabled→true) is thoughtful and go build/vet/test all pass. One minor UX issue worth a look.

[P2] restart --image <missing> tears down the environment before the image preflight runs

internal/localmode/service.goRestart calls Stop then Start, and the "image not found locally" preflight lives inside Start (after the stop has already happened):

func (s Service) Restart(ctx context.Context, w io.Writer) error {
	if err := s.Stop(ctx, w, false); err != nil {
		return err
	}
	fmt.Fprintln(w)
	return s.Start(ctx, w)
}

When a user runs volcano restart --image <ref> with a custom image that isn't built locally, the running environment is stopped first and only then does the command fail, leaving them worse off than before. This partially defeats the fail-fast intent described in the PR. Consider validating image existence before Stop on the restart path.

Overall correctness: correct. The finding above is a non-blocking UX nit; the core reconcile and image-resolution logic is sound and covered by tests.

tkkhq added 6 commits June 25, 2026 17:15
…OL-271)

Add schedulers as a first-class declarative resource so they join the
declare -> test-local -> promote-to-cloud flow alongside buckets/functions.

- manifest: SchedulerManifest nested under functions (name+cron required,
  optional enabled/payload/regions); validateFunctions relaxed to
  'public OR schedulers'; structural-only validation (cron/region deferred
  to server)
- function.Service.UpdateSchedulerByID: in-place full update preserving the
  scheduler UUID (no delete+recreate)
- deploy: reconcileSchedulers diffs by name and creates/updates/leaves
  unchanged; NON-DESTRUCTIVE (never deletes undeclared/ad-hoc schedulers);
  idempotent payload (canonical JSON) and region (server-managed when
  omitted) comparison; Summary + output extended
- guard reconcileFunctions against nil Public for schedulers-only entries

Reuses the existing cloud scheduler REST API + client, so the same manifest
applies to localmode and cloud.
Let the CLI knowingly run an unpublished, locally-built local-mode image:

- start/restart accept --image <ref> (highest precedence: flag > VOLCANO_IMAGE
  env > project .env.local > bundled default kong/volcano:local-nightly)
- Service.WithImage option + resolveImage() single source of precedence
- pre-flight: when the resolved image is custom (differs from the default), it
  must already exist locally; the CLI never pulls it. Missing -> hard fail with
  an actionable build message instead of a confusing compose registry error
- the published default is left to compose's normal pull-if-missing behavior

Tests: resolveImage precedence (flag/env/.env.local/default + custom detection)
and Start hard-fail when a custom image is absent.
…re announcing (review)

- README: document the schedulers block in volcano-config.yaml, the
  non-destructive reconcile (never deletes; remove via imperative delete),
  and that omitted regions are server-managed.
- start: run the custom-image local-presence check before printing
  'Using Docker image' so we never announce an image we then reject.
…review)

schedulerNeedsUpdate defaulted a nil existing.Enabled to false while
schedulerState renders nil as enabled; an API that omits enabled would then
force a spurious update every deploy. Default existingEnabled to true to keep
reconcile idempotent; add a regression subtest.
CI 'check' (golangci-lint) flagged issues my local go vet/gofmt missed:
- deploy.go: bytes.Equal instead of string(a)==string(b) (gocritic)
- deploy_test.go fake: maps.Clone for payload copies (modernize), invert+
  continue to reduce nesting (gocritic), errors.New over fmt.Errorf
  (perfsprint), and remove the dead duplicateNames flag + empty block (revive).
Verified with 'go tool golangci-lint run ./...' (0 issues).
swkeever: an existing non-empty payload + a manifest that omits payload would
enter the update path forever — api.UpdateFunctionScheduler doesn't serialize a
nil payload, so the server keeps its value and the deploy never converges. Only
diff/reconcile payload when the manifest declares one (mirrors omitted-regions
handling). Adds a regression subtest.
@tkkhq tkkhq force-pushed the feat/local-scheduler branch from f144375 to c9de2cb Compare June 25, 2026 21:23
@tkkhq tkkhq requested review from marckong and swkeever June 25, 2026 21:25

@swkeever swkeever left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the scheduler reconcile + --image work. Solid overall — non-destructive reconcile, the payload/regions idempotency handling, and the fail-fast image preflight all look right, and test coverage is good. Two correctness findings left inline (one is the same non-convergence class as the payload fix in c9de2cb, applied to enabled). Not blocking the merge — leaving as comments.

Validation: reviewed the cumulative diff and the generated apiclient request/response types (UpdateFunctionSchedulerRequest, FunctionScheduler) plus api.UpdateFunctionScheduler/ListFunctionSchedulers to confirm the serialization assumptions. Did not run the test suite locally.

Comment thread internal/projectconfig/deploy.go Outdated
if existing.Enabled != nil {
existingEnabled = *existing.Enabled
}
if existingEnabled != desiredEnabled {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-convergent update when an existing scheduler is disabled and the manifest omits enabled.

Here desiredEnabled defaults to true when the manifest omits enabled, but buildSchedulerInput (line 411) sends the raw manifest.Enabled (nil). The generated body is Enabled *bool json:"enabled,omitempty" and api.UpdateFunctionScheduler assigns body.Enabled = input.Enabled directly, so a nil Enabled is dropped from the request and the server leaves the field unchanged.

Failure mode: a scheduler that exists on the server as enabled=false and is re-declared in the manifest without an explicit enabled: falseexistingEnabled=false, desiredEnabled=trueschedulerNeedsUpdate returns true on every deploy → the update is sent with enabled omitted → server stays disabled → next config deploy detects the same diff. It never converges and reports "1 updated" forever. This is the same class as the payload fix in c9de2cb, but enabled wasn't given the same treatment.

Suggested fix — mirror the payload/regions handling and only compare when the manifest declares it:

if desired.Enabled != nil {
    existingEnabled := true
    if existing.Enabled != nil {
        existingEnabled = *existing.Enabled
    }
    if existingEnabled != *desired.Enabled {
        return true
    }
}

(Alternatively, send the defaulted value from buildSchedulerInput so the update actually enforces enabled=true — but that silently re-enables a manually-disabled scheduler on every deploy, which seems contrary to the non-destructive stance. The skip-when-omitted approach is consistent with how payload and regions are handled just below.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 07f6f5e — we went with the skip-when-omitted approach you suggested. The team's call was to follow the server's own contract: an omitted enabled means "keep existing" (the update handler does enabled := existing.Enabled when the field is absent). So buildSchedulerInput now forwards the manifest's enabled as-is (nil is dropped on the wire → server keeps state; create still defaults to true), and schedulerNeedsUpdate only diffs enabled when the manifest declares it — consistent with how payload and regions are handled, and it no longer re-enables a deliberately-disabled scheduler on deploy.

This supersedes the re-enable approach in e54ab31; I replaced that test with TestReconcileSchedulersLeavesDisabledWhenEnabledOmitted plus explicit enable/disable subtests in TestSchedulerNeedsUpdateIdempotency.

// Build map of existing schedulers by Name
existingByName := make(map[string]apiclient.FunctionScheduler)
if listResp != nil {
for _, existing := range listResp.Data {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

existingByName is built from a single page of schedulers. function.Service.ListSchedulersapi.Client.ListFunctionSchedulers calls ListFunctionSchedulersWithResponse with no page/limit and returns only the first page; the HasMore/Page/Total fields on FunctionSchedulerListResponse are ignored. reconcileFunctions paginates its function listing (loops until !HasMore), but this path doesn't.

If a function has more schedulers than the server's default page size, any declared scheduler living beyond page 1 won't be found in existingByName and will be treated as missing → CreateSchedulerByID → duplicate creation (or a server unique-name rejection). Worth paginating ListSchedulers (or guarding/documenting the limit). Lower severity — only bites functions with many schedulers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 07f6f5e. The function-scoped scheduler list has no page parameter on the generated client and the server returns the full set (has_more=false) today, so rather than add pagination the CLI can't actually drive, reconcileSchedulers now fails fast if has_more is ever true — avoiding silent duplicate creation should the server start paginating. Added TestReconcileSchedulersAbortsOnPaginatedList. (True pagination would need a client/server change; worth a follow-up only if a single function ever exceeds one page of schedulers.)

@marckong marckong left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: local-first function schedulers + --image

Summary: Adds declarative schedulers: under functions in volcano-config.yaml with a non-destructive config deploy reconcile (create/update, name-keyed, idempotent), plus a --image flag for local mode that runs a locally-built server image and fails fast if it isn't present. Well-tested, and the reconcile semantics for payload/regions are carefully thought through.

Solid, cohesive change — one correctness issue worth fixing before merge.

[P2] config deploy can't re-enable a disabled scheduler when the manifest omits enabled

internal/projectconfig/deploy.gobuildSchedulerInput / schedulerNeedsUpdate

schedulerNeedsUpdate defaults a desired enabled to true when the manifest omits it, so if the server scheduler is currently disabled the diff reports a needed update. But buildSchedulerInput forwards manifest.Enabled verbatim (nil), and UpdateFunctionScheduler serializes that as an omitted field (enabled,omitempty) — which, per this PR's own payload/regions reasoning, the server leaves unchanged. The scheduler stays disabled, the run reports Schedulers: … updated, and the next deploy repeats indefinitely (never converges). This only arises when a scheduler is disabled server-side while the manifest omits enabled. Defaulting the input to match the comparison fixes it:

func buildSchedulerInput(manifest SchedulerManifest) api.FunctionSchedulerInput {
	enabled := true
	if manifest.Enabled != nil {
		enabled = *manifest.Enabled
	}
	return api.FunctionSchedulerInput{
		Name:           manifest.Name,
		CronExpression: manifest.Cron,
		Payload:        manifest.Payload,
		Regions:        manifest.Regions,
		Enabled:        &enabled,
	}
}

Minor (non-blocking): Restart runs Stop then Start, and the new --image existence preflight lives inside Start — so restart --image <missing> tears the environment down before failing. Moving the check ahead of Stop would preserve the fail-fast intent, but this matches the pre-existing "Start can fail after Stop" behavior, so it's optional.

Overall correctness: not fully correct — the enabled non-convergence is a real (if narrow) idempotency bug.

Two correctness fixes from the PR review:

config: when a manifest omits 'enabled', buildSchedulerInput forwarded a
nil *bool, which UpdateFunctionScheduler drops on the wire
(enabled,omitempty) and the server leaves unchanged. schedulerNeedsUpdate,
however, defaults a desired enabled to true, so a server-disabled scheduler
was reported as needing an update on every deploy without ever converging.
Default the input to true (matching the comparison and the documented
default) so the update actually re-enables the scheduler.

localmode: move the custom-image existence preflight into
ensureCustomImageAvailable and call it before teardown in Restart, so
'restart --image <missing>' fails fast while the running stack stays intact
instead of being stopped first.

Adds regression tests for re-enable convergence and restart preflight.
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ tkkhq
❌ marckong
You have signed the CLA already but the status is still pending? Let us recheck it.

…uard paginated list

Team decision: the manifest follows the server's own reconcile contract —
declared fields are enforced; omitted fields are server-managed.

- enabled: supersede the re-enable approach in e54ab31. buildSchedulerInput now
  forwards the manifest's enabled as-is (nil when omitted, which the update body
  drops so the server keeps the current state; create still defaults true), and
  schedulerNeedsUpdate only diffs enabled when the manifest declares it. This
  mirrors the server (omitted enabled = keep existing) and stops config deploy
  from re-enabling a deliberately-disabled scheduler — matching how payload and
  regions are already handled. Replaces the re-enable convergence test.
- pagination (swkeever): the scheduler list is read as a single page and the
  client has no page param; fail fast if has_more is ever set rather than
  silently recreating schedulers beyond page 1.
- README: document the single rule (declared=enforced, omitted=server-managed;
  deploy never deletes or disables; cron always required).

Keeps marckong's localmode restart --image preflight from e54ab31 intact.
@tkkhq

tkkhq commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

Heads up on 07f6f5e (builds on @swkeever's review + @marckong's e54ab31):

The manifest now follows the server's own reconcile contract for schedulers — declared fields are enforced; omitted fields are server-managed. Concretely, an omitted enabled/payload/regions keeps whatever the server has (create still applies server defaults), and config deploy will not re-enable a scheduler disabled out-of-band unless the manifest says enabled: true.

This supersedes the re-enable approach for enabled from e54ab31 (buildSchedulerInput now forwards the manifest value as-is; schedulerNeedsUpdate only diffs enabled when declared). @marckong — I kept your restart --image preflight from e54ab31 intact; only the enabled handling changed. Also addressed the paginated-list guard, and documented the single rule in the README.

Known follow-ups (not in this PR): no server-side unique constraint on scheduler name-per-function (CLI guards defensively), and declarative deletion/pruning is intentionally out of scope (deploy is additive).

@tkkhq tkkhq requested review from marckong and swkeever June 26, 2026 17:11

@swkeever swkeever left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed — looks solid. The reconcile model (declared = enforced, omitted = server-managed) is applied consistently across enabled/payload/regions, and the tricky convergence cases (disabled scheduler left alone, explicit re-enable, omitted-payload no-churn) plus the pagination guard are all covered by tests. The --image local-only preflight and its placement before teardown in restart are a nice touch.

Approving. Two non-blocking nits inline.

if existing.Id == nil {
return fmt.Errorf("function %q scheduler %q: existing scheduler has no ID", fnManifest.Name, desired.Name)
}
schedulerID, err := uuid.Parse(existing.Id.String())

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking nit: existing.Id is already a *uuid.UUID (openapi_types.UUID is a type alias for uuid.UUID), so this stringify→reparse round-trip reduces to schedulerID := *existing.Id. That drops the redundant parse and its never-hit error branch — the existing.Id == nil guard just above already covers the only real failure mode.

func (s Service) Restart(ctx context.Context, w io.Writer) error {
// Validate a custom image before tearing the environment down, so a bad
// --image leaves the running stack intact instead of stopped.
if err := s.ensureCustomImageAvailable(ctx); err != nil {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: this preflight runs before Start's checkDocker, so restart --image <custom> with the Docker daemon down fails with image "…" not found locally … build it … rather than a "Docker not available" message (docker image inspect errors → imageExistsLocally returns false). Low impact — building the image needs Docker anyway — but slightly misleading. A checkDocker here first would give the clearer message.

@marckong marckong added this pull request to the merge queue Jun 26, 2026
Merged via the queue into main with commit dc4cabf Jun 26, 2026
8 checks passed
@marckong marckong deleted the feat/local-scheduler branch June 26, 2026 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants