feat(BA-5373): Add blue-green deployment infrastructure and promote API by jopemachine · Pull Request #10426 · lablup/backend.ai

jopemachine · 2026-03-23T10:34:00Z

Summary

Add DeployingAwaitingPromotionHandler for blue-green AWAITING_PROMOTION sub-step processing
Add promoteDeployment GraphQL mutation for manual blue-green promotion
Add promote_deployment repository method with atomic route switch (promote green → ACTIVE, drain blue → TERMINATING, swap revision)
Wire promote through full stack: DTO → Action → Service → Processor → Adapter → GQL
Add promote_route_ids to RouteChanges for blue-green traffic switch
Add DEPLOYING_AWAITING_PROMOTION to DeploymentLifecycleSubStep

Context

This PR provides the infrastructure layer for the blue-green deployment strategy (BA-3436). The core strategy FSM (BlueGreenStrategy) is in a stacked PR on top of this one.

Test Plan

Existing deployment coordinator tests pass
ruff lint/format passes

🤖 Generated with Claude Code

📚 Documentation preview 📚: https://sorna--10426.org.readthedocs.build/en/10426/

📚 Documentation preview 📚: https://sorna-ko--10426.org.readthedocs.build/ko/10426/

- Add DeployingAwaitingPromotionHandler for AWAITING_PROMOTION sub-step - Add promoteDeployment GraphQL mutation for manual blue-green promotion - Add promote_deployment repository method with atomic route switch - Wire promote through full stack: DTO, Action, Service, Processor, Adapter, GQL - Add promote_route_ids to RouteChanges for blue-green traffic switch - Add DEPLOYING_AWAITING_PROMOTION to DeploymentLifecycleSubStep Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-authored-by: octodog <mu001@lablup.com>

Copilot

Pull request overview

This PR adds infrastructure support for blue-green deployments by introducing an AWAITING_PROMOTION sub-step handler and wiring a manual “promote deployment” operation end-to-end (service → repository → GraphQL), including atomic route traffic switching and revision swap.

Changes:

Added DEPLOYING_AWAITING_PROMOTION sub-step and a new DeployingAwaitingPromotionHandler to support the pause-before-promotion phase.
Added promoteDeployment GraphQL mutation (DTOs, adapter, action, processor, service) for manual promotion.
Implemented DeploymentRepository.promote_deployment() and extended strategy mutation plumbing to support “promote” route updates in the DB transaction.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/ai/backend/manager/sokovan/deployment/handlers/deploying.py	Adds AWAITING_PROMOTION handler and adjusts DEPLOYING/PROVISIONING behavior.
src/ai/backend/manager/sokovan/deployment/handlers/init.py	Exports the new deploying handler.
src/ai/backend/manager/sokovan/deployment/coordinator.py	Registers the new DEPLOYING/AWAITING_PROMOTION handler.
src/ai/backend/manager/services/deployment/service.py	Adds `promote_deployment()` service method and route classification logic.
src/ai/backend/manager/services/deployment/processors.py	Wires the new promote action into processors/supported actions.
src/ai/backend/manager/services/deployment/actions/revision_operations/promote_deployment.py	Introduces the promote action + result types.
src/ai/backend/manager/services/deployment/actions/revision_operations/init.py	Exports the promote action types.
src/ai/backend/manager/repositories/deployment/repository.py	Adds `promote_deployment()` and extends `apply_strategy_mutations()` signature to include `promote`.
src/ai/backend/manager/repositories/deployment/db_source/db_source.py	Executes promote route updates as part of strategy mutation transaction.
src/ai/backend/manager/data/deployment/types.py	Adds `DEPLOYING_AWAITING_PROMOTION` to lifecycle sub-steps list.
src/ai/backend/manager/api/gql/schema.py	Registers `promote_deployment` mutation.
src/ai/backend/manager/api/gql/deployment/types/revision.py	Adds GraphQL input/payload types for promotion.
src/ai/backend/manager/api/gql/deployment/types/init.py	Exports promotion input/payload GraphQL types.
src/ai/backend/manager/api/gql/deployment/resolver/revision.py	Adds the `promote_deployment` mutation resolver.
src/ai/backend/manager/api/gql/deployment/resolver/init.py	Exports the new resolver symbol.
src/ai/backend/manager/api/gql/deployment/init.py	Re-exports new GraphQL types and resolver.
src/ai/backend/manager/api/adapters/deployment.py	Adds adapter method to trigger the promote action.
src/ai/backend/common/dto/manager/v2/deployment/response.py	Adds PromoteDeploymentPayload DTO.
src/ai/backend/common/dto/manager/v2/deployment/request.py	Adds PromoteDeploymentInput DTO.
docs/manager/graphql-reference/v2-schema.graphql	Documents the new mutation and input/payload types (also includes an unrelated schema change).
docs/manager/graphql-reference/supergraph.graphql	Same as above for the supergraph schema reference.

Comments suppressed due to low confidence (2)

docs/manager/graphql-reference/v2-schema.graphql:2972

The generated GraphQL reference removed lastUsedAt from ImageV2MetadataInfo, but the Strawberry schema still defines last_used_at (see src/ai/backend/manager/api/gql/image/types.py:206). This makes the published schema docs inconsistent with the actual API. Please regenerate these schema reference files from the current schema or revert the unrelated removal.

type ImageV2MetadataInfo {
  """Config digest for verification."""
  digest: String

  """Image size in bytes."""
  sizeBytes: Int!

  """Image creation timestamp."""
  createdAt: DateTime

  """Timestamp of the most recent session created with this image."""
  lastUsedAt: DateTime

docs/manager/graphql-reference/supergraph.graphql:5323

Same as v2-schema.graphql: lastUsedAt was removed from ImageV2MetadataInfo in the supergraph reference, but the Strawberry schema still exposes it. Regenerate or revert to keep schema references consistent.

type ImageV2MetadataInfo
  @join__type(graph: STRAWBERRY)
{
  """Config digest for verification."""
  digest: String

  """Image size in bytes."""
  sizeBytes: Int!

  """Image creation timestamp."""
  createdAt: DateTime

  """Timestamp of the most recent session created with this image."""
  lastUsedAt: DateTime

  """Parsed tag components."""
  tags: [ImageV2TagEntry!]!

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-23T10:48:18Z

src/ai/backend/manager/repositories/deployment/repository.py

    async def apply_strategy_mutations(
        self,
        rollout: Sequence[RBACEntityCreator[RoutingRow]],
        drain: BatchUpdater[RoutingRow] | None,
+        promote: BatchUpdater[RoutingRow] | None,
        completed_ids: set[UUID],
    ) -> int:
        """Apply route mutations from a strategy evaluation cycle.

-        Performs route rollout/drain and revision swap in a single transaction.
+        Performs route rollout/drain/promote and revision swap in a single transaction.
        Sub-step transitions are handled by the coordinator via
        ``EndpointLifecycleBatchUpdaterSpec``.

        Returns:
            Number of deployments whose revision was swapped.
        """
        return await self._db_source.apply_strategy_mutations(
            rollout=rollout,
            drain=drain,
+            promote=promote,
            completed_ids=completed_ids,
        )


apply_strategy_mutations() now requires a promote updater, but at least one existing call site (e.g. src/ai/backend/manager/sokovan/deployment/strategy/applier.py:90-94) still calls it without that argument. This will raise a TypeError at runtime the first time the applier runs. Consider either adding a default value (promote: BatchUpdater[RoutingRow] | None = None) to keep backward compatibility, or update all call sites to pass promote=None explicitly.

Copilot · 2026-03-23T10:48:19Z

src/ai/backend/manager/sokovan/deployment/handlers/deploying.py

    @classmethod
    @override
    def status_transitions(cls) -> DeploymentStatusTransitions:
        return DeploymentStatusTransitions(
            success=DeploymentLifecycleStatus(
-                lifecycle=EndpointLifecycle.READY,
-                sub_step=None,
-            ),
-            need_retry=DeploymentLifecycleStatus(
                lifecycle=EndpointLifecycle.DEPLOYING,
-                sub_step=DeploymentLifecycleSubStep.DEPLOYING_PROVISIONING,
+                sub_step=DeploymentLifecycleSubStep.DEPLOYING_AWAITING_PROMOTION,
            ),
            expired=DeploymentLifecycleStatus(
                lifecycle=EndpointLifecycle.DEPLOYING,
                sub_step=DeploymentLifecycleSubStep.DEPLOYING_ROLLING_BACK,
            ),
+            give_up=DeploymentLifecycleStatus(
+                lifecycle=EndpointLifecycle.DEPLOYING,
+                sub_step=DeploymentLifecycleSubStep.DEPLOYING_ROLLING_BACK,
+            ),
        )


DeployingProvisioningHandler.execute() can return failures (from summary.errors), but status_transitions() does not define need_retry. In the coordinator, failures categorized as need_retry will be ignored when transitions.need_retry is None (no lifecycle update and no history record), potentially leaving deployments stuck without retries/history. Define a need_retry transition (likely staying in DEPLOYING_PROVISIONING) or ensure failures map to an explicit transition.

Copilot · 2026-03-23T10:48:19Z

src/ai/backend/manager/sokovan/deployment/handlers/deploying.py

+                continue
+
+            if spec.promote_delay_seconds > 0 and deployment.phase_started_at is not None:
+                elapsed = (datetime.now(UTC) - deployment.phase_started_at).total_seconds()


The elapsed-time calculation mixes datetime.now(UTC) (tz-aware) with deployment.phase_started_at, which may be tz-naive (the coordinator explicitly handles both). Subtracting aware/naive datetimes raises TypeError. Normalize phase_started_at to UTC similarly to _is_transition_timed_out() (e.g., add UTC tzinfo when missing) before subtraction.

Suggested change

elapsed = (datetime.now(UTC) - deployment.phase_started_at).total_seconds()

phase_started_at = deployment.phase_started_at

if phase_started_at.tzinfo is None:

phase_started_at = phase_started_at.replace(tzinfo=UTC)

elapsed = (datetime.now(UTC) - phase_started_at).total_seconds()

Copilot · 2026-03-23T10:48:19Z

src/ai/backend/manager/sokovan/deployment/handlers/deploying.py

+    @override
+    async def execute(
+        self, deployments: Sequence[DeploymentWithHistory]
+    ) -> DeploymentExecutionResult:
+        successes: list[DeploymentWithHistory] = []
+        skipped: list[DeploymentWithHistory] = []
+
+        for deployment in deployments:
+            info = deployment.deployment_info
+            policy = info.policy
+            if policy is None or not isinstance(policy.strategy_spec, BlueGreenSpec):
+                skipped.append(deployment)
+                continue
+
+            spec: BlueGreenSpec = policy.strategy_spec
+            if not spec.auto_promote:
+                skipped.append(deployment)
+                continue
+
+            if spec.promote_delay_seconds > 0 and deployment.phase_started_at is not None:
+                elapsed = (datetime.now(UTC) - deployment.phase_started_at).total_seconds()
+                if elapsed < spec.promote_delay_seconds:
+                    skipped.append(deployment)
+                    continue
+
+            promote_route_ids, drain_route_ids = await self._classify_routes(info)
+            await self._deployment_repository.promote_deployment(
+                deployment_id=info.id,
+                promote_route_ids=promote_route_ids,
+                drain_route_ids=drain_route_ids,
+            )
+            log.info("deployment {}: auto-promoted", info.id)
+            successes.append(deployment)
+
+        return DeploymentExecutionResult(successes=successes, skipped=skipped)
+


This handler relies on the coordinator's “skipped deployments timed out” path to trigger expired, but DeploymentWithHistory.phase_started_at is only populated when the latest history phase matches handler.name() (see DeploymentDBSource.fetch_deployments_for_handler). Since this handler typically returns skipped without causing any lifecycle transition/history write, phase_started_at may remain None indefinitely and the timeout rollback may never trigger. Consider ensuring an initial history record is written when entering/first processing AWAITING_PROMOTION so timeout logic has a start timestamp.

Copilot · 2026-03-23T10:48:19Z

src/ai/backend/manager/services/deployment/service.py

+        # Classify into green (promote) and blue (drain)
+        promote_route_ids = []
+        drain_route_ids = []
+        for route in route_search.items:
+            if route.revision_id == deploying_revision_id:
+                if route.status == RouteStatus.HEALTHY:
+                    promote_route_ids.append(route.route_id)
+            else:
+                if route.status.is_active():
+                    drain_route_ids.append(route.route_id)
+
+        await self._deployment_repository.promote_deployment(
+            deployment_id=action.deployment_id,
+            promote_route_ids=promote_route_ids,
+            drain_route_ids=drain_route_ids,
+        )


Manual promotion swaps deploying_revision → current_revision regardless of whether any green routes were classified as promotable. If promote_route_ids ends up empty (e.g., no HEALTHY routes for the deploying revision), this still drains blue routes and swaps the revision, which can leave the deployment with no active traffic-serving routes. Consider validating that there is at least one (or the expected count of) HEALTHY green routes before calling promote_deployment, and raise a clear error otherwise.

Copilot · 2026-03-23T10:48:20Z

src/ai/backend/manager/services/deployment/service.py

+    async def promote_deployment(
+        self, action: PromoteDeploymentAction
+    ) -> PromoteDeploymentActionResult:
+        """Manually promote a blue-green deployment.
+
+        Directly switches traffic from blue (old) to green (new) routes
+        when the deployment is in AWAITING_PROMOTION state.  This bypasses
+        the FSM cycle and applies the promote/drain atomically.
+        """
+        deployment_info = await self._deployment_repository.get_endpoint_info(action.deployment_id)
+
+        if deployment_info.sub_step != DeploymentLifecycleSubStep.DEPLOYING_AWAITING_PROMOTION:
+            raise InvalidEndpointState(
+                f"Deployment {action.deployment_id} is not in AWAITING_PROMOTION state "
+                f"(current sub_step: {deployment_info.sub_step}). "
+                "Manual promotion is only allowed during AWAITING_PROMOTION."
+            )
+
+        deploying_revision_id = deployment_info.deploying_revision_id
+        if deploying_revision_id is None:
+            raise InvalidEndpointState(
+                f"Deployment {action.deployment_id} has no deploying_revision_id."
+            )
+
+        # Fetch non-terminated routes for this deployment
+        route_search = await self._deployment_repository.search_routes(
+            BatchQuerier(
+                pagination=NoPagination(),
+                conditions=[
+                    RouteQueryConditions.by_endpoint_ids({action.deployment_id}),
+                    RouteQueryConditions.exclude_statuses([RouteStatus.TERMINATED]),
+                ],
+            )
+        )
+
+        # Classify into green (promote) and blue (drain)
+        promote_route_ids = []
+        drain_route_ids = []
+        for route in route_search.items:
+            if route.revision_id == deploying_revision_id:
+                if route.status == RouteStatus.HEALTHY:
+                    promote_route_ids.append(route.route_id)
+            else:
+                if route.status.is_active():
+                    drain_route_ids.append(route.route_id)
+
+        await self._deployment_repository.promote_deployment(
+            deployment_id=action.deployment_id,
+            promote_route_ids=promote_route_ids,
+            drain_route_ids=drain_route_ids,
+        )
+
+        log.info(
+            "Manually promoted deployment {}: {} routes promoted, {} routes drained",
+            action.deployment_id,
+            len(promote_route_ids),
+            len(drain_route_ids),
+        )
+
+        deployment_info = await self._deployment_repository.get_endpoint_info(action.deployment_id)
+
+        return PromoteDeploymentActionResult(
+            deployment=_convert_deployment_info_to_data(deployment_info),
+        )
+


promote_deployment() introduces non-trivial new behavior (state validation + route classification + atomic promote/drain + revision swap), but there doesn't appear to be unit test coverage for this method while tests/unit/manager/services/deployment/test_deployment_service.py covers other service behaviors. Adding tests for: (1) rejecting when not in DEPLOYING_AWAITING_PROMOTION, (2) rejecting when deploying_revision_id is missing, and (3) calling DeploymentRepository.promote_deployment() with correctly classified route IDs would help prevent regressions.

…ns call Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings March 23, 2026 10:34

github-actions bot assigned jopemachine Mar 23, 2026

github-actions bot added size:XL 500~ LoC area:docs Documentations comp:manager Related to Manager component comp:common Related to Common component labels Mar 23, 2026

Copilot started reviewing on behalf of jopemachine March 23, 2026 10:34 View session

jopemachine mentioned this pull request Mar 23, 2026

Add blue-green deployment infrastructure and promote API #10427

Open

jopemachine changed the title ~~feat(BA-3436): Add blue-green deployment infrastructure and promote API~~ feat(BA-5373): Add blue-green deployment infrastructure and promote API Mar 23, 2026

chore: update api schema dump

b2ab8ac

Co-authored-by: octodog <mu001@lablup.com>

jopemachine added this to the 26.4 milestone Mar 23, 2026

jopemachine marked this pull request as draft March 23, 2026 10:37

docs: Add news fragment

8fbd0fe

Copilot AI reviewed Mar 23, 2026

View reviewed changes

fix(BA-5373): add missing promote parameter to apply_strategy_mutatio…

e3eca7a

…ns call Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(BA-5373): Add blue-green deployment infrastructure and promote API#10426

feat(BA-5373): Add blue-green deployment infrastructure and promote API#10426
jopemachine wants to merge 4 commits intomainfrom
BA-3436-promote-api

jopemachine commented Mar 23, 2026 •

edited by github-actions bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-                elapsed = (datetime.now(UTC) - deployment.phase_started_at).total_seconds()
+                phase_started_at = deployment.phase_started_at
+                if phase_started_at.tzinfo is None:
+                    phase_started_at = phase_started_at.replace(tzinfo=UTC)
+                elapsed = (datetime.now(UTC) - phase_started_at).total_seconds()

Conversation

jopemachine commented Mar 23, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Test Plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jopemachine commented Mar 23, 2026 •

edited by github-actions bot

Loading