Skip to content

feat(BA-5389): resolve model definition at revision creation time instead of PROVISIONING#10458

Open
jopemachine wants to merge 29 commits intomainfrom
BA-5389
Open

feat(BA-5389): resolve model definition at revision creation time instead of PROVISIONING#10458
jopemachine wants to merge 29 commits intomainfrom
BA-5389

Conversation

@jopemachine
Copy link
Member

@jopemachine jopemachine commented Mar 24, 2026

resolves #10457 (BA-5389)

Summary

Move model definition resolution from every PROVISIONING cycle to revision creation time. The DB now stores the fully-resolved definition; PROVISIONING reads it directly without vfolder access.

What changed

  • DeploymentService._resolve_model_definition() resolves the final definition before creating the revision row
  • Executor no longer calls generate_model_definition() — uses stored target_revision.model_definition directly
  • New ModelDefinitionContext replaces ModelRevisionSpec as the generator input (only the fields generators actually need)
  • Mapping[str, Any]ModelDefinition strong typing across revision spec and creator
  • GQL model_definition field annotated with NEXT_RELEASE_VERSION

Merge priority (applied at revision creation)

  1. Generator-produced definition (lowest)
  2. Vfolder file override (non-CUSTOM only, requires enable_model_definition_override flag)
  3. User-provided model_definition override (highest)

Not included

The legacy session creation path (_get_health_check_info in registry.py) still reads model definition from vfolder at runtime. This serves the old endpoint system and should be migrated separately.

Merge dependency

Depends on #10459 (BA-5390) which adds the write-side: accepting model_definition as input in add_model_revision API. Merge #10459 first.

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Installer updates including:
    • Fixtures for db schema changes
    • New mandatory config options
  • Update of end-to-end CLI integration tests in ai.backend.test
  • API server-client counterparts (e.g., manager API -> client SDK)
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

📚 Documentation preview 📚: https://sorna--10458.org.readthedocs.build/en/10458/


📚 Documentation preview 📚: https://sorna-ko--10458.org.readthedocs.build/ko/10458/

@github-actions github-actions bot added size:L 100~500 LoC comp:manager Related to Manager component labels Mar 24, 2026
@jopemachine jopemachine changed the title feat(BA-5389): merge deployment revision DB model_definition into generated model definition feat(BA-5389): merge deployment revision DB model_definition into generated model definition Mar 24, 2026
@jopemachine jopemachine added this to the 26.4 milestone Mar 24, 2026
@github-actions github-actions bot added the comp:common Related to Common component label Mar 25, 2026
@jopemachine jopemachine requested a review from a team March 25, 2026 06:14
@jopemachine jopemachine marked this pull request as ready for review March 25, 2026 06:14
Copilot AI review requested due to automatic review settings March 25, 2026 06:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ensures deployment revision model_definition stored in the deployment_revisions DB record is merged into the generated model definition during inference session creation, aligning runtime behavior with the intended override precedence.

Changes:

  • Added model_definition to ModelRevisionSpec and propagated it from DeploymentRevisionRow into revision specs.
  • Updated ModelDefinitionGeneratorRegistry.generate_model_definition() to merge DB model_definition before applying optional storage-file overrides.
  • Introduced ModelDefinition.merge() (deep-merge helper) and expanded unit tests to cover DB merge and precedence vs storage overrides.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/ai/backend/manager/data/deployment/types.py Adds model_definition field to revision spec for downstream merging.
src/ai/backend/manager/models/endpoint/row.py Propagates DeploymentRevisionRow.model_definition into ModelRevisionSpec.
src/ai/backend/manager/sokovan/deployment/definition_generator/registry.py Merges DB model_definition into generated definition and reuses ModelDefinition.merge() for storage overrides.
src/ai/backend/common/config.py Adds ModelDefinition.merge() deep-merge helper (and imports deep_merge).
tests/unit/manager/sokovan/deployment/definition_generator/test_registry.py Adds tests/fixtures for DB merge behavior and override precedence.
changes/10458.feature.md Adds changelog entry for the feature.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot added the area:docs Documentations label Mar 25, 2026
@jopemachine jopemachine changed the title feat(BA-5389): merge deployment revision DB model_definition into generated model definition refactor(BA-5389): resolve model definition at revision creation time instead of PROVISIONING Mar 25, 2026
Copy link
Member

@fregataa fregataa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Version annotations are needed
  2. Does the new model definition value override a stored value or is merged with it? Could you describe how the model definition of model deployment and revision changes after this work?

@jopemachine jopemachine marked this pull request as draft March 25, 2026 09:04
@github-actions github-actions bot added size:XL 500~ LoC size:L 100~500 LoC comp:storage-proxy Related to Storage proxy component require:db-migration Automatically set when alembic migrations are added or updated and removed size:L 100~500 LoC size:XL 500~ LoC labels Mar 25, 2026
@github-actions github-actions bot added size:XL 500~ LoC and removed size:L 100~500 LoC labels Mar 25, 2026
@github-actions github-actions bot added size:L 100~500 LoC and removed size:XL 500~ LoC labels Mar 26, 2026
@jopemachine
Copy link
Member Author

  1. Version annotations are needed
  2. Does the new model definition value override a stored value or is merged with it? Could you describe how the model definition of model deployment and revision changes after this work?

The system no longer references the vfolder's model definition at the PROVISIONING stage.
The model_definition is now merged with the vfolder’s model definition at revision creation time and stored in the database, rather than being resolved during the PROVISIONING stage.

jopemachine and others added 18 commits March 26, 2026 17:51
Replace loose Mapping[str, Any] with strongly-typed ModelDefinition
across ModelRevisionSpec, DeploymentRevisionCreatorSpec, and merge()
signature. Improve field descriptions for clarity per review feedback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move model definition resolution from PROVISIONING (executor) to
add_model_revision (service layer). The DB now stores the fully-merged
definition so runtime only reads the stored value.

- Introduce ModelDefinitionContext as lightweight generator input
- Remove generate_model_definition call from executor
- Add _resolve_model_definition to DeploymentService
- Update all generators to accept ModelDefinitionContext
- Inline _build_revision into add_model_revision

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add gql_added_field with NEXT_RELEASE_VERSION to model_definition
fields in CreateRevisionInput and AddRevisionInput, since they were
added after the parent type's initial version (25.19.0).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: octodog <mu001@lablup.com>
- Make model_definition field optional (None default) across all DTOs,
  GQL inputs, and ModelRevisionCreator so deployments can be created
  without providing a model_definition override
- Fix unit test failures: add missing model_definition to test helpers,
  patch _register_endpoint in executor test, fix mypy attr-defined error
- Add merge verification unit tests for custom and non-custom variants

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e dataclass

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: octodog <mu001@lablup.com>
- Rename should_fetch to expect_vfolder_fetch for clarity
- Extract user override data into fixtures (custom_user_override, vllm_user_override)
- Rename test_three_layer_merge to test_vfolder_and_user_override_merged_on_generated_definition

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mediate layers

Remove = None default from ModelRevisionCreator.model_definition and
ModelDefinitionContext.model_definition to force callers to pass it
explicitly. API DTO layer retains default=None since users may omit it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Extract mock_model_definition_generator_registry as a separate fixture
- Remove cast(AsyncMock, ...) by using the fixture directly in tests
- Remove resolved_model_definition fixture, reuse sample_model_definition
- Remove unused cast import

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hods

Split generate_model_definition() merge steps into _apply_vfolder_override()
and _apply_user_override() for readability.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n component test

The deployment component test used MagicMock for the generator registry,
which cannot be awaited. This caused a TypeError in add_revision flow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove 32f0e1a3420f which duplicates f5338adb2de1 from main,
both merging the same heads (cff56a8381dd, d7879c511ea1).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added size:XL 500~ LoC and removed size:L 100~500 LoC labels Mar 26, 2026
…type

Replace _to_deployment_info_legacy fallback (method doesn't exist)
with _to_deployment_info_with_revisions([]) empty list fallback.
Fix Sequence -> list type for model_revisions parameter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jopemachine jopemachine marked this pull request as draft March 27, 2026 04:15
@github-actions github-actions bot added size:L 100~500 LoC and removed size:XL 500~ LoC labels Mar 27, 2026
@jopemachine jopemachine marked this pull request as ready for review March 27, 2026 04:24
@github-actions github-actions bot added size:XL 500~ LoC and removed size:L 100~500 LoC labels Mar 27, 2026
# Assert - Deployment skipped due to missing revision
assert len(result.successes) == 0
assert len(result.failures) == 0
assert len(result.skipped) == 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it removed?

Comment on lines +520 to +523
assert model.service.start_command == "vfolder-command"
assert model.service.health_check is not None
# from vfolder (preserved through user override)
assert model.service.health_check.initial_delay == 300.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vllm_vfolder_definition and vllm_user_override share the same start_command and initial_delay fixtures, which leads to weak assertion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:docs Documentations comp:cli Related to CLI component comp:client Related to Client component comp:common Related to Common component comp:manager Related to Manager component comp:storage-proxy Related to Storage proxy component require:db-migration Automatically set when alembic migrations are added or updated size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Resolve model definition at revision creation time instead of PROVISIONING

4 participants