feat(BA-5373): Add blue-green deployment infrastructure and promote API#10426
feat(BA-5373): Add blue-green deployment infrastructure and promote API#10426jopemachine wants to merge 4 commits intomainfrom
Conversation
- Add DeployingAwaitingPromotionHandler for AWAITING_PROMOTION sub-step - Add promoteDeployment GraphQL mutation for manual blue-green promotion - Add promote_deployment repository method with atomic route switch - Wire promote through full stack: DTO, Action, Service, Processor, Adapter, GQL - Add promote_route_ids to RouteChanges for blue-green traffic switch - Add DEPLOYING_AWAITING_PROMOTION to DeploymentLifecycleSubStep Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: octodog <mu001@lablup.com>
There was a problem hiding this comment.
Pull request overview
This PR adds infrastructure support for blue-green deployments by introducing an AWAITING_PROMOTION sub-step handler and wiring a manual “promote deployment” operation end-to-end (service → repository → GraphQL), including atomic route traffic switching and revision swap.
Changes:
- Added
DEPLOYING_AWAITING_PROMOTIONsub-step and a newDeployingAwaitingPromotionHandlerto support the pause-before-promotion phase. - Added
promoteDeploymentGraphQL mutation (DTOs, adapter, action, processor, service) for manual promotion. - Implemented
DeploymentRepository.promote_deployment()and extended strategy mutation plumbing to support “promote” route updates in the DB transaction.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ai/backend/manager/sokovan/deployment/handlers/deploying.py | Adds AWAITING_PROMOTION handler and adjusts DEPLOYING/PROVISIONING behavior. |
| src/ai/backend/manager/sokovan/deployment/handlers/init.py | Exports the new deploying handler. |
| src/ai/backend/manager/sokovan/deployment/coordinator.py | Registers the new DEPLOYING/AWAITING_PROMOTION handler. |
| src/ai/backend/manager/services/deployment/service.py | Adds promote_deployment() service method and route classification logic. |
| src/ai/backend/manager/services/deployment/processors.py | Wires the new promote action into processors/supported actions. |
| src/ai/backend/manager/services/deployment/actions/revision_operations/promote_deployment.py | Introduces the promote action + result types. |
| src/ai/backend/manager/services/deployment/actions/revision_operations/init.py | Exports the promote action types. |
| src/ai/backend/manager/repositories/deployment/repository.py | Adds promote_deployment() and extends apply_strategy_mutations() signature to include promote. |
| src/ai/backend/manager/repositories/deployment/db_source/db_source.py | Executes promote route updates as part of strategy mutation transaction. |
| src/ai/backend/manager/data/deployment/types.py | Adds DEPLOYING_AWAITING_PROMOTION to lifecycle sub-steps list. |
| src/ai/backend/manager/api/gql/schema.py | Registers promote_deployment mutation. |
| src/ai/backend/manager/api/gql/deployment/types/revision.py | Adds GraphQL input/payload types for promotion. |
| src/ai/backend/manager/api/gql/deployment/types/init.py | Exports promotion input/payload GraphQL types. |
| src/ai/backend/manager/api/gql/deployment/resolver/revision.py | Adds the promote_deployment mutation resolver. |
| src/ai/backend/manager/api/gql/deployment/resolver/init.py | Exports the new resolver symbol. |
| src/ai/backend/manager/api/gql/deployment/init.py | Re-exports new GraphQL types and resolver. |
| src/ai/backend/manager/api/adapters/deployment.py | Adds adapter method to trigger the promote action. |
| src/ai/backend/common/dto/manager/v2/deployment/response.py | Adds PromoteDeploymentPayload DTO. |
| src/ai/backend/common/dto/manager/v2/deployment/request.py | Adds PromoteDeploymentInput DTO. |
| docs/manager/graphql-reference/v2-schema.graphql | Documents the new mutation and input/payload types (also includes an unrelated schema change). |
| docs/manager/graphql-reference/supergraph.graphql | Same as above for the supergraph schema reference. |
Comments suppressed due to low confidence (2)
docs/manager/graphql-reference/v2-schema.graphql:2972
- The generated GraphQL reference removed
lastUsedAtfromImageV2MetadataInfo, but the Strawberry schema still defineslast_used_at(seesrc/ai/backend/manager/api/gql/image/types.py:206). This makes the published schema docs inconsistent with the actual API. Please regenerate these schema reference files from the current schema or revert the unrelated removal.
type ImageV2MetadataInfo {
"""Config digest for verification."""
digest: String
"""Image size in bytes."""
sizeBytes: Int!
"""Image creation timestamp."""
createdAt: DateTime
"""Timestamp of the most recent session created with this image."""
lastUsedAt: DateTime
docs/manager/graphql-reference/supergraph.graphql:5323
- Same as
v2-schema.graphql:lastUsedAtwas removed fromImageV2MetadataInfoin the supergraph reference, but the Strawberry schema still exposes it. Regenerate or revert to keep schema references consistent.
type ImageV2MetadataInfo
@join__type(graph: STRAWBERRY)
{
"""Config digest for verification."""
digest: String
"""Image size in bytes."""
sizeBytes: Int!
"""Image creation timestamp."""
createdAt: DateTime
"""Timestamp of the most recent session created with this image."""
lastUsedAt: DateTime
"""Parsed tag components."""
tags: [ImageV2TagEntry!]!
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| async def apply_strategy_mutations( | ||
| self, | ||
| rollout: Sequence[RBACEntityCreator[RoutingRow]], | ||
| drain: BatchUpdater[RoutingRow] | None, | ||
| promote: BatchUpdater[RoutingRow] | None, | ||
| completed_ids: set[UUID], | ||
| ) -> int: | ||
| """Apply route mutations from a strategy evaluation cycle. | ||
|
|
||
| Performs route rollout/drain and revision swap in a single transaction. | ||
| Performs route rollout/drain/promote and revision swap in a single transaction. | ||
| Sub-step transitions are handled by the coordinator via | ||
| ``EndpointLifecycleBatchUpdaterSpec``. | ||
|
|
||
| Returns: | ||
| Number of deployments whose revision was swapped. | ||
| """ | ||
| return await self._db_source.apply_strategy_mutations( | ||
| rollout=rollout, | ||
| drain=drain, | ||
| promote=promote, | ||
| completed_ids=completed_ids, | ||
| ) |
There was a problem hiding this comment.
apply_strategy_mutations() now requires a promote updater, but at least one existing call site (e.g. src/ai/backend/manager/sokovan/deployment/strategy/applier.py:90-94) still calls it without that argument. This will raise a TypeError at runtime the first time the applier runs. Consider either adding a default value (promote: BatchUpdater[RoutingRow] | None = None) to keep backward compatibility, or update all call sites to pass promote=None explicitly.
| @classmethod | ||
| @override | ||
| def status_transitions(cls) -> DeploymentStatusTransitions: | ||
| return DeploymentStatusTransitions( | ||
| success=DeploymentLifecycleStatus( | ||
| lifecycle=EndpointLifecycle.READY, | ||
| sub_step=None, | ||
| ), | ||
| need_retry=DeploymentLifecycleStatus( | ||
| lifecycle=EndpointLifecycle.DEPLOYING, | ||
| sub_step=DeploymentLifecycleSubStep.DEPLOYING_PROVISIONING, | ||
| sub_step=DeploymentLifecycleSubStep.DEPLOYING_AWAITING_PROMOTION, | ||
| ), | ||
| expired=DeploymentLifecycleStatus( | ||
| lifecycle=EndpointLifecycle.DEPLOYING, | ||
| sub_step=DeploymentLifecycleSubStep.DEPLOYING_ROLLING_BACK, | ||
| ), | ||
| give_up=DeploymentLifecycleStatus( | ||
| lifecycle=EndpointLifecycle.DEPLOYING, | ||
| sub_step=DeploymentLifecycleSubStep.DEPLOYING_ROLLING_BACK, | ||
| ), | ||
| ) |
There was a problem hiding this comment.
DeployingProvisioningHandler.execute() can return failures (from summary.errors), but status_transitions() does not define need_retry. In the coordinator, failures categorized as need_retry will be ignored when transitions.need_retry is None (no lifecycle update and no history record), potentially leaving deployments stuck without retries/history. Define a need_retry transition (likely staying in DEPLOYING_PROVISIONING) or ensure failures map to an explicit transition.
| continue | ||
|
|
||
| if spec.promote_delay_seconds > 0 and deployment.phase_started_at is not None: | ||
| elapsed = (datetime.now(UTC) - deployment.phase_started_at).total_seconds() |
There was a problem hiding this comment.
The elapsed-time calculation mixes datetime.now(UTC) (tz-aware) with deployment.phase_started_at, which may be tz-naive (the coordinator explicitly handles both). Subtracting aware/naive datetimes raises TypeError. Normalize phase_started_at to UTC similarly to _is_transition_timed_out() (e.g., add UTC tzinfo when missing) before subtraction.
| elapsed = (datetime.now(UTC) - deployment.phase_started_at).total_seconds() | |
| phase_started_at = deployment.phase_started_at | |
| if phase_started_at.tzinfo is None: | |
| phase_started_at = phase_started_at.replace(tzinfo=UTC) | |
| elapsed = (datetime.now(UTC) - phase_started_at).total_seconds() |
| @override | ||
| async def execute( | ||
| self, deployments: Sequence[DeploymentWithHistory] | ||
| ) -> DeploymentExecutionResult: | ||
| successes: list[DeploymentWithHistory] = [] | ||
| skipped: list[DeploymentWithHistory] = [] | ||
|
|
||
| for deployment in deployments: | ||
| info = deployment.deployment_info | ||
| policy = info.policy | ||
| if policy is None or not isinstance(policy.strategy_spec, BlueGreenSpec): | ||
| skipped.append(deployment) | ||
| continue | ||
|
|
||
| spec: BlueGreenSpec = policy.strategy_spec | ||
| if not spec.auto_promote: | ||
| skipped.append(deployment) | ||
| continue | ||
|
|
||
| if spec.promote_delay_seconds > 0 and deployment.phase_started_at is not None: | ||
| elapsed = (datetime.now(UTC) - deployment.phase_started_at).total_seconds() | ||
| if elapsed < spec.promote_delay_seconds: | ||
| skipped.append(deployment) | ||
| continue | ||
|
|
||
| promote_route_ids, drain_route_ids = await self._classify_routes(info) | ||
| await self._deployment_repository.promote_deployment( | ||
| deployment_id=info.id, | ||
| promote_route_ids=promote_route_ids, | ||
| drain_route_ids=drain_route_ids, | ||
| ) | ||
| log.info("deployment {}: auto-promoted", info.id) | ||
| successes.append(deployment) | ||
|
|
||
| return DeploymentExecutionResult(successes=successes, skipped=skipped) | ||
|
|
There was a problem hiding this comment.
This handler relies on the coordinator's “skipped deployments timed out” path to trigger expired, but DeploymentWithHistory.phase_started_at is only populated when the latest history phase matches handler.name() (see DeploymentDBSource.fetch_deployments_for_handler). Since this handler typically returns skipped without causing any lifecycle transition/history write, phase_started_at may remain None indefinitely and the timeout rollback may never trigger. Consider ensuring an initial history record is written when entering/first processing AWAITING_PROMOTION so timeout logic has a start timestamp.
| # Classify into green (promote) and blue (drain) | ||
| promote_route_ids = [] | ||
| drain_route_ids = [] | ||
| for route in route_search.items: | ||
| if route.revision_id == deploying_revision_id: | ||
| if route.status == RouteStatus.HEALTHY: | ||
| promote_route_ids.append(route.route_id) | ||
| else: | ||
| if route.status.is_active(): | ||
| drain_route_ids.append(route.route_id) | ||
|
|
||
| await self._deployment_repository.promote_deployment( | ||
| deployment_id=action.deployment_id, | ||
| promote_route_ids=promote_route_ids, | ||
| drain_route_ids=drain_route_ids, | ||
| ) |
There was a problem hiding this comment.
Manual promotion swaps deploying_revision → current_revision regardless of whether any green routes were classified as promotable. If promote_route_ids ends up empty (e.g., no HEALTHY routes for the deploying revision), this still drains blue routes and swaps the revision, which can leave the deployment with no active traffic-serving routes. Consider validating that there is at least one (or the expected count of) HEALTHY green routes before calling promote_deployment, and raise a clear error otherwise.
| async def promote_deployment( | ||
| self, action: PromoteDeploymentAction | ||
| ) -> PromoteDeploymentActionResult: | ||
| """Manually promote a blue-green deployment. | ||
|
|
||
| Directly switches traffic from blue (old) to green (new) routes | ||
| when the deployment is in AWAITING_PROMOTION state. This bypasses | ||
| the FSM cycle and applies the promote/drain atomically. | ||
| """ | ||
| deployment_info = await self._deployment_repository.get_endpoint_info(action.deployment_id) | ||
|
|
||
| if deployment_info.sub_step != DeploymentLifecycleSubStep.DEPLOYING_AWAITING_PROMOTION: | ||
| raise InvalidEndpointState( | ||
| f"Deployment {action.deployment_id} is not in AWAITING_PROMOTION state " | ||
| f"(current sub_step: {deployment_info.sub_step}). " | ||
| "Manual promotion is only allowed during AWAITING_PROMOTION." | ||
| ) | ||
|
|
||
| deploying_revision_id = deployment_info.deploying_revision_id | ||
| if deploying_revision_id is None: | ||
| raise InvalidEndpointState( | ||
| f"Deployment {action.deployment_id} has no deploying_revision_id." | ||
| ) | ||
|
|
||
| # Fetch non-terminated routes for this deployment | ||
| route_search = await self._deployment_repository.search_routes( | ||
| BatchQuerier( | ||
| pagination=NoPagination(), | ||
| conditions=[ | ||
| RouteQueryConditions.by_endpoint_ids({action.deployment_id}), | ||
| RouteQueryConditions.exclude_statuses([RouteStatus.TERMINATED]), | ||
| ], | ||
| ) | ||
| ) | ||
|
|
||
| # Classify into green (promote) and blue (drain) | ||
| promote_route_ids = [] | ||
| drain_route_ids = [] | ||
| for route in route_search.items: | ||
| if route.revision_id == deploying_revision_id: | ||
| if route.status == RouteStatus.HEALTHY: | ||
| promote_route_ids.append(route.route_id) | ||
| else: | ||
| if route.status.is_active(): | ||
| drain_route_ids.append(route.route_id) | ||
|
|
||
| await self._deployment_repository.promote_deployment( | ||
| deployment_id=action.deployment_id, | ||
| promote_route_ids=promote_route_ids, | ||
| drain_route_ids=drain_route_ids, | ||
| ) | ||
|
|
||
| log.info( | ||
| "Manually promoted deployment {}: {} routes promoted, {} routes drained", | ||
| action.deployment_id, | ||
| len(promote_route_ids), | ||
| len(drain_route_ids), | ||
| ) | ||
|
|
||
| deployment_info = await self._deployment_repository.get_endpoint_info(action.deployment_id) | ||
|
|
||
| return PromoteDeploymentActionResult( | ||
| deployment=_convert_deployment_info_to_data(deployment_info), | ||
| ) | ||
|
|
There was a problem hiding this comment.
promote_deployment() introduces non-trivial new behavior (state validation + route classification + atomic promote/drain + revision swap), but there doesn't appear to be unit test coverage for this method while tests/unit/manager/services/deployment/test_deployment_service.py covers other service behaviors. Adding tests for: (1) rejecting when not in DEPLOYING_AWAITING_PROMOTION, (2) rejecting when deploying_revision_id is missing, and (3) calling DeploymentRepository.promote_deployment() with correctly classified route IDs would help prevent regressions.
…ns call Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolves BA-5373.
Summary
DeployingAwaitingPromotionHandlerfor blue-green AWAITING_PROMOTION sub-step processingpromoteDeploymentGraphQL mutation for manual blue-green promotionpromote_deploymentrepository method with atomic route switch (promote green → ACTIVE, drain blue → TERMINATING, swap revision)promote_route_idstoRouteChangesfor blue-green traffic switchDEPLOYING_AWAITING_PROMOTIONtoDeploymentLifecycleSubStepContext
This PR provides the infrastructure layer for the blue-green deployment strategy (BA-3436). The core strategy FSM (BlueGreenStrategy) is in a stacked PR on top of this one.
Test Plan
🤖 Generated with Claude Code
📚 Documentation preview 📚: https://sorna--10426.org.readthedocs.build/en/10426/
📚 Documentation preview 📚: https://sorna-ko--10426.org.readthedocs.build/ko/10426/