Skip to content

feat(BA-5373): Add blue-green deployment infrastructure and promote API#10426

Draft
jopemachine wants to merge 4 commits intomainfrom
BA-3436-promote-api
Draft

feat(BA-5373): Add blue-green deployment infrastructure and promote API#10426
jopemachine wants to merge 4 commits intomainfrom
BA-3436-promote-api

Conversation

@jopemachine
Copy link
Member

@jopemachine jopemachine commented Mar 23, 2026

Resolves BA-5373.

Summary

  • Add DeployingAwaitingPromotionHandler for blue-green AWAITING_PROMOTION sub-step processing
  • Add promoteDeployment GraphQL mutation for manual blue-green promotion
  • Add promote_deployment repository method with atomic route switch (promote green → ACTIVE, drain blue → TERMINATING, swap revision)
  • Wire promote through full stack: DTO → Action → Service → Processor → Adapter → GQL
  • Add promote_route_ids to RouteChanges for blue-green traffic switch
  • Add DEPLOYING_AWAITING_PROMOTION to DeploymentLifecycleSubStep

Context

This PR provides the infrastructure layer for the blue-green deployment strategy (BA-3436). The core strategy FSM (BlueGreenStrategy) is in a stacked PR on top of this one.

Test Plan

  • Existing deployment coordinator tests pass
  • ruff lint/format passes

🤖 Generated with Claude Code


📚 Documentation preview 📚: https://sorna--10426.org.readthedocs.build/en/10426/


📚 Documentation preview 📚: https://sorna-ko--10426.org.readthedocs.build/ko/10426/

- Add DeployingAwaitingPromotionHandler for AWAITING_PROMOTION sub-step
- Add promoteDeployment GraphQL mutation for manual blue-green promotion
- Add promote_deployment repository method with atomic route switch
- Wire promote through full stack: DTO, Action, Service, Processor, Adapter, GQL
- Add promote_route_ids to RouteChanges for blue-green traffic switch
- Add DEPLOYING_AWAITING_PROMOTION to DeploymentLifecycleSubStep

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 23, 2026 10:34
@github-actions github-actions bot added size:XL 500~ LoC area:docs Documentations comp:manager Related to Manager component comp:common Related to Common component labels Mar 23, 2026
@jopemachine jopemachine changed the title feat(BA-3436): Add blue-green deployment infrastructure and promote API feat(BA-5373): Add blue-green deployment infrastructure and promote API Mar 23, 2026
Co-authored-by: octodog <mu001@lablup.com>
@jopemachine jopemachine added this to the 26.4 milestone Mar 23, 2026
@jopemachine jopemachine marked this pull request as draft March 23, 2026 10:37
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds infrastructure support for blue-green deployments by introducing an AWAITING_PROMOTION sub-step handler and wiring a manual “promote deployment” operation end-to-end (service → repository → GraphQL), including atomic route traffic switching and revision swap.

Changes:

  • Added DEPLOYING_AWAITING_PROMOTION sub-step and a new DeployingAwaitingPromotionHandler to support the pause-before-promotion phase.
  • Added promoteDeployment GraphQL mutation (DTOs, adapter, action, processor, service) for manual promotion.
  • Implemented DeploymentRepository.promote_deployment() and extended strategy mutation plumbing to support “promote” route updates in the DB transaction.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/ai/backend/manager/sokovan/deployment/handlers/deploying.py Adds AWAITING_PROMOTION handler and adjusts DEPLOYING/PROVISIONING behavior.
src/ai/backend/manager/sokovan/deployment/handlers/init.py Exports the new deploying handler.
src/ai/backend/manager/sokovan/deployment/coordinator.py Registers the new DEPLOYING/AWAITING_PROMOTION handler.
src/ai/backend/manager/services/deployment/service.py Adds promote_deployment() service method and route classification logic.
src/ai/backend/manager/services/deployment/processors.py Wires the new promote action into processors/supported actions.
src/ai/backend/manager/services/deployment/actions/revision_operations/promote_deployment.py Introduces the promote action + result types.
src/ai/backend/manager/services/deployment/actions/revision_operations/init.py Exports the promote action types.
src/ai/backend/manager/repositories/deployment/repository.py Adds promote_deployment() and extends apply_strategy_mutations() signature to include promote.
src/ai/backend/manager/repositories/deployment/db_source/db_source.py Executes promote route updates as part of strategy mutation transaction.
src/ai/backend/manager/data/deployment/types.py Adds DEPLOYING_AWAITING_PROMOTION to lifecycle sub-steps list.
src/ai/backend/manager/api/gql/schema.py Registers promote_deployment mutation.
src/ai/backend/manager/api/gql/deployment/types/revision.py Adds GraphQL input/payload types for promotion.
src/ai/backend/manager/api/gql/deployment/types/init.py Exports promotion input/payload GraphQL types.
src/ai/backend/manager/api/gql/deployment/resolver/revision.py Adds the promote_deployment mutation resolver.
src/ai/backend/manager/api/gql/deployment/resolver/init.py Exports the new resolver symbol.
src/ai/backend/manager/api/gql/deployment/init.py Re-exports new GraphQL types and resolver.
src/ai/backend/manager/api/adapters/deployment.py Adds adapter method to trigger the promote action.
src/ai/backend/common/dto/manager/v2/deployment/response.py Adds PromoteDeploymentPayload DTO.
src/ai/backend/common/dto/manager/v2/deployment/request.py Adds PromoteDeploymentInput DTO.
docs/manager/graphql-reference/v2-schema.graphql Documents the new mutation and input/payload types (also includes an unrelated schema change).
docs/manager/graphql-reference/supergraph.graphql Same as above for the supergraph schema reference.
Comments suppressed due to low confidence (2)

docs/manager/graphql-reference/v2-schema.graphql:2972

  • The generated GraphQL reference removed lastUsedAt from ImageV2MetadataInfo, but the Strawberry schema still defines last_used_at (see src/ai/backend/manager/api/gql/image/types.py:206). This makes the published schema docs inconsistent with the actual API. Please regenerate these schema reference files from the current schema or revert the unrelated removal.
type ImageV2MetadataInfo {
  """Config digest for verification."""
  digest: String

  """Image size in bytes."""
  sizeBytes: Int!

  """Image creation timestamp."""
  createdAt: DateTime

  """Timestamp of the most recent session created with this image."""
  lastUsedAt: DateTime

docs/manager/graphql-reference/supergraph.graphql:5323

  • Same as v2-schema.graphql: lastUsedAt was removed from ImageV2MetadataInfo in the supergraph reference, but the Strawberry schema still exposes it. Regenerate or revert to keep schema references consistent.
type ImageV2MetadataInfo
  @join__type(graph: STRAWBERRY)
{
  """Config digest for verification."""
  digest: String

  """Image size in bytes."""
  sizeBytes: Int!

  """Image creation timestamp."""
  createdAt: DateTime

  """Timestamp of the most recent session created with this image."""
  lastUsedAt: DateTime

  """Parsed tag components."""
  tags: [ImageV2TagEntry!]!

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1441 to 1462
async def apply_strategy_mutations(
self,
rollout: Sequence[RBACEntityCreator[RoutingRow]],
drain: BatchUpdater[RoutingRow] | None,
promote: BatchUpdater[RoutingRow] | None,
completed_ids: set[UUID],
) -> int:
"""Apply route mutations from a strategy evaluation cycle.

Performs route rollout/drain and revision swap in a single transaction.
Performs route rollout/drain/promote and revision swap in a single transaction.
Sub-step transitions are handled by the coordinator via
``EndpointLifecycleBatchUpdaterSpec``.

Returns:
Number of deployments whose revision was swapped.
"""
return await self._db_source.apply_strategy_mutations(
rollout=rollout,
drain=drain,
promote=promote,
completed_ids=completed_ids,
)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply_strategy_mutations() now requires a promote updater, but at least one existing call site (e.g. src/ai/backend/manager/sokovan/deployment/strategy/applier.py:90-94) still calls it without that argument. This will raise a TypeError at runtime the first time the applier runs. Consider either adding a default value (promote: BatchUpdater[RoutingRow] | None = None) to keep backward compatibility, or update all call sites to pass promote=None explicitly.

Copilot uses AI. Check for mistakes.
Comment on lines 120 to 136
@classmethod
@override
def status_transitions(cls) -> DeploymentStatusTransitions:
return DeploymentStatusTransitions(
success=DeploymentLifecycleStatus(
lifecycle=EndpointLifecycle.READY,
sub_step=None,
),
need_retry=DeploymentLifecycleStatus(
lifecycle=EndpointLifecycle.DEPLOYING,
sub_step=DeploymentLifecycleSubStep.DEPLOYING_PROVISIONING,
sub_step=DeploymentLifecycleSubStep.DEPLOYING_AWAITING_PROMOTION,
),
expired=DeploymentLifecycleStatus(
lifecycle=EndpointLifecycle.DEPLOYING,
sub_step=DeploymentLifecycleSubStep.DEPLOYING_ROLLING_BACK,
),
give_up=DeploymentLifecycleStatus(
lifecycle=EndpointLifecycle.DEPLOYING,
sub_step=DeploymentLifecycleSubStep.DEPLOYING_ROLLING_BACK,
),
)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeployingProvisioningHandler.execute() can return failures (from summary.errors), but status_transitions() does not define need_retry. In the coordinator, failures categorized as need_retry will be ignored when transitions.need_retry is None (no lifecycle update and no history record), potentially leaving deployments stuck without retries/history. Define a need_retry transition (likely staying in DEPLOYING_PROVISIONING) or ensure failures map to an explicit transition.

Copilot uses AI. Check for mistakes.
continue

if spec.promote_delay_seconds > 0 and deployment.phase_started_at is not None:
elapsed = (datetime.now(UTC) - deployment.phase_started_at).total_seconds()
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The elapsed-time calculation mixes datetime.now(UTC) (tz-aware) with deployment.phase_started_at, which may be tz-naive (the coordinator explicitly handles both). Subtracting aware/naive datetimes raises TypeError. Normalize phase_started_at to UTC similarly to _is_transition_timed_out() (e.g., add UTC tzinfo when missing) before subtraction.

Suggested change
elapsed = (datetime.now(UTC) - deployment.phase_started_at).total_seconds()
phase_started_at = deployment.phase_started_at
if phase_started_at.tzinfo is None:
phase_started_at = phase_started_at.replace(tzinfo=UTC)
elapsed = (datetime.now(UTC) - phase_started_at).total_seconds()

Copilot uses AI. Check for mistakes.
Comment on lines +240 to +275
@override
async def execute(
self, deployments: Sequence[DeploymentWithHistory]
) -> DeploymentExecutionResult:
successes: list[DeploymentWithHistory] = []
skipped: list[DeploymentWithHistory] = []

for deployment in deployments:
info = deployment.deployment_info
policy = info.policy
if policy is None or not isinstance(policy.strategy_spec, BlueGreenSpec):
skipped.append(deployment)
continue

spec: BlueGreenSpec = policy.strategy_spec
if not spec.auto_promote:
skipped.append(deployment)
continue

if spec.promote_delay_seconds > 0 and deployment.phase_started_at is not None:
elapsed = (datetime.now(UTC) - deployment.phase_started_at).total_seconds()
if elapsed < spec.promote_delay_seconds:
skipped.append(deployment)
continue

promote_route_ids, drain_route_ids = await self._classify_routes(info)
await self._deployment_repository.promote_deployment(
deployment_id=info.id,
promote_route_ids=promote_route_ids,
drain_route_ids=drain_route_ids,
)
log.info("deployment {}: auto-promoted", info.id)
successes.append(deployment)

return DeploymentExecutionResult(successes=successes, skipped=skipped)

Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This handler relies on the coordinator's “skipped deployments timed out” path to trigger expired, but DeploymentWithHistory.phase_started_at is only populated when the latest history phase matches handler.name() (see DeploymentDBSource.fetch_deployments_for_handler). Since this handler typically returns skipped without causing any lifecycle transition/history write, phase_started_at may remain None indefinitely and the timeout rollback may never trigger. Consider ensuring an initial history record is written when entering/first processing AWAITING_PROMOTION so timeout logic has a start timestamp.

Copilot uses AI. Check for mistakes.
Comment on lines +762 to +777
# Classify into green (promote) and blue (drain)
promote_route_ids = []
drain_route_ids = []
for route in route_search.items:
if route.revision_id == deploying_revision_id:
if route.status == RouteStatus.HEALTHY:
promote_route_ids.append(route.route_id)
else:
if route.status.is_active():
drain_route_ids.append(route.route_id)

await self._deployment_repository.promote_deployment(
deployment_id=action.deployment_id,
promote_route_ids=promote_route_ids,
drain_route_ids=drain_route_ids,
)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual promotion swaps deploying_revisioncurrent_revision regardless of whether any green routes were classified as promotable. If promote_route_ids ends up empty (e.g., no HEALTHY routes for the deploying revision), this still drains blue routes and swaps the revision, which can leave the deployment with no active traffic-serving routes. Consider validating that there is at least one (or the expected count of) HEALTHY green routes before calling promote_deployment, and raise a clear error otherwise.

Copilot uses AI. Check for mistakes.
Comment on lines +727 to +791
async def promote_deployment(
self, action: PromoteDeploymentAction
) -> PromoteDeploymentActionResult:
"""Manually promote a blue-green deployment.

Directly switches traffic from blue (old) to green (new) routes
when the deployment is in AWAITING_PROMOTION state. This bypasses
the FSM cycle and applies the promote/drain atomically.
"""
deployment_info = await self._deployment_repository.get_endpoint_info(action.deployment_id)

if deployment_info.sub_step != DeploymentLifecycleSubStep.DEPLOYING_AWAITING_PROMOTION:
raise InvalidEndpointState(
f"Deployment {action.deployment_id} is not in AWAITING_PROMOTION state "
f"(current sub_step: {deployment_info.sub_step}). "
"Manual promotion is only allowed during AWAITING_PROMOTION."
)

deploying_revision_id = deployment_info.deploying_revision_id
if deploying_revision_id is None:
raise InvalidEndpointState(
f"Deployment {action.deployment_id} has no deploying_revision_id."
)

# Fetch non-terminated routes for this deployment
route_search = await self._deployment_repository.search_routes(
BatchQuerier(
pagination=NoPagination(),
conditions=[
RouteQueryConditions.by_endpoint_ids({action.deployment_id}),
RouteQueryConditions.exclude_statuses([RouteStatus.TERMINATED]),
],
)
)

# Classify into green (promote) and blue (drain)
promote_route_ids = []
drain_route_ids = []
for route in route_search.items:
if route.revision_id == deploying_revision_id:
if route.status == RouteStatus.HEALTHY:
promote_route_ids.append(route.route_id)
else:
if route.status.is_active():
drain_route_ids.append(route.route_id)

await self._deployment_repository.promote_deployment(
deployment_id=action.deployment_id,
promote_route_ids=promote_route_ids,
drain_route_ids=drain_route_ids,
)

log.info(
"Manually promoted deployment {}: {} routes promoted, {} routes drained",
action.deployment_id,
len(promote_route_ids),
len(drain_route_ids),
)

deployment_info = await self._deployment_repository.get_endpoint_info(action.deployment_id)

return PromoteDeploymentActionResult(
deployment=_convert_deployment_info_to_data(deployment_info),
)

Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

promote_deployment() introduces non-trivial new behavior (state validation + route classification + atomic promote/drain + revision swap), but there doesn't appear to be unit test coverage for this method while tests/unit/manager/services/deployment/test_deployment_service.py covers other service behaviors. Adding tests for: (1) rejecting when not in DEPLOYING_AWAITING_PROMOTION, (2) rejecting when deploying_revision_id is missing, and (3) calling DeploymentRepository.promote_deployment() with correctly classified route IDs would help prevent regressions.

Copilot uses AI. Check for mistakes.
…ns call

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:docs Documentations comp:common Related to Common component comp:manager Related to Manager component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants