fix(backend): add db backoff retry for resolvers by fatih-acar · Pull Request #8712 · opsmill/infrahub

fatih-acar · 2026-03-26T20:09:30Z

This PR improves the retry_db_transaction decorator by having a configurable backoff retry.
It also enables the decorator on the graphql query resolvers.

Adds component tests to check the retry behavior solves hard failures
when neo4j thread pool exhausts: we must start a new container with a
low thread pool size.

I had to duplicate a few fixtures to have the new neo4j container working... I'm not very satisfied of the tests, looking for feedback on those. There's also high chance they are flaky... so we probably want to disable them in CI for now.

Fixes #8696

Summary by CodeRabbit

Bug Fixes
- GraphQL read operations now automatically retry with exponential backoff and jitter to reduce transient failures.
New Features
- Three new configurable database retry parameters: base delay, max delay, and jitter.
Documentation
- Configuration reference updated with the new retry environment variables and defaults.
Tests
- Added unit and component tests validating retry behavior and backoff timing under contention.
Chores
- Minor test fixture refactor and changelog entry added.

coderabbitai · 2026-03-26T20:09:40Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 82ca71e7-17b3-4ce8-8375-a992e2e14536

📥 Commits

Reviewing files that changed from the base of the PR and between 55179e3 and 40321e2.

📒 Files selected for processing (12)

backend/infrahub/config.py
backend/infrahub/database/__init__.py
backend/infrahub/graphql/resolvers/account_metadata.py
backend/infrahub/graphql/resolvers/ipam.py
backend/infrahub/graphql/resolvers/resolver.py
backend/tests/component/graphql/test_graphql_read_retry.py
backend/tests/conftest.py
backend/tests/helpers/utils.py
backend/tests/unit/graphql/resolvers/__init__.py
backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py
changelog/8696.fixed.md
docs/docs/reference/configuration.mdx

✅ Files skipped from review due to trivial changes (3)

changelog/8696.fixed.md
docs/docs/reference/configuration.mdx
backend/tests/component/graphql/test_graphql_read_retry.py

🚧 Files skipped from review as they are similar to previous changes (7)

backend/infrahub/graphql/resolvers/ipam.py
backend/tests/helpers/utils.py
backend/infrahub/graphql/resolvers/account_metadata.py
backend/infrahub/database/init.py
backend/tests/conftest.py
backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py
backend/infrahub/graphql/resolvers/resolver.py

Walkthrough

Added three database settings: retry_base_delay, retry_max_delay, and retry_jitter_max to configure exponential backoff timing. Updated retry_db_transaction to use functools.wraps and compute exponential backoff with uniform jitter clamped by retry_max_delay instead of a fixed random delay. Applied the decorator to multiple GraphQL resolvers (account, default, paginated, relationship, hierarchy, account metadata, IPAM). Added unit and component tests for retry behavior and timing, extended test helper to accept extra env, and updated docs and changelog.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix(backend): add db backoff retry for resolvers' accurately summarizes the main change: adding a configurable backoff retry mechanism to the retry_db_transaction decorator and enabling it on GraphQL resolvers.
Description check	✅ Passed	The PR description covers key aspects: the backoff retry improvement, enabling the decorator on GraphQL resolvers, addition of component tests validating retry behavior, test infrastructure concerns, and a reference to the fixed issue.
Linked Issues check	✅ Passed	The PR fully addresses the linked issue `#8696` objectives: retry_db_transaction is applied to read-path resolvers (account_resolver, default_resolver, default_paginated_list_resolver, single_relationship_resolver, many_relationship_resolver, hierarchy_resolver) and exponential backoff is implemented with configurable timing parameters.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to implementing exponential backoff retry and enabling it on GraphQL resolvers; no unrelated modifications are present in the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codspeed-hq · 2026-03-26T20:27:36Z

Merging this PR will not alter performance

✅ 12 untouched benchmarks

_{Comparing fac-add-resolver-db-retry (17ed35f) with stable (9126527)¹}

No successful run was found on stable (a280f9a) during the generation of this report, so 9126527 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/infrahub/database/__init__.py (1)

520-537: ⚠️ Potential issue | 🟡 Minor

Don't sleep or count a retry after the last attempt.

On the terminal failure path this block still computes backoff, logs Retrying ..., increments TRANSACTION_RETRIES, and awaits asyncio.sleep(retry_time) before raising. That adds avoidable latency and overcounts retries whenever the final attempt fails.

💡 Proposed fix

         async def wrapper(*args: Any, **kwargs: Any) -> R:
             error = Exception()
             for attempt in range(1, config.SETTINGS.database.retry_limit + 1):
                 try:
                     return await func(*args, **kwargs)
                 except (TransientError, ClientError) as exc:
                     if isinstance(exc, ClientError):
                         if exc.code != "Neo.ClientError.Statement.EntityNotFound":
                             raise exc
+                    if attempt == config.SETTINGS.database.retry_limit:
+                        error = exc
+                        break
                     base_delay = config.SETTINGS.database.retry_base_delay
                     max_delay = config.SETTINGS.database.retry_max_delay
                     jitter = random.uniform(0, config.SETTINGS.database.retry_jitter_max)
                     retry_time = min(base_delay * (2 ** (attempt - 1)) + jitter, max_delay)
                     log.exception("Retry handler caught database error")
@@
                     )
                     log.debug("Database transaction failed", message=exc.message)
                     TRANSACTION_RETRIES.labels(name).inc()
                     await asyncio.sleep(retry_time)
-                    if attempt == config.SETTINGS.database.retry_limit:
-                        error = exc
-                        break
             raise error

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/infrahub/database/__init__.py` around lines 520 - 537, The retry
branch inside the transaction retry loop wrongly performs backoff work even on
the final attempt; update the block that handles (TransientError, ClientError)
so that it checks if attempt == config.SETTINGS.database.retry_limit first and,
on the last attempt, sets the final error (error = exc) and breaks/raises
without computing retry_time, calling log.info("Retrying..."), incrementing
TRANSACTION_RETRIES.labels(name).inc(), or awaiting asyncio.sleep(retry_time);
only compute jitter/base_delay/max_delay, log the retry, increment the metric
and await asyncio.sleep when there will be another attempt (i.e., attempt <
retry_limit). Ensure you reference the existing symbols: attempt,
config.SETTINGS.database.retry_limit, TRANSACTION_RETRIES.labels(name).inc(),
asyncio.sleep, log.info and error = exc.

🧹 Nitpick comments (2)

backend/infrahub/graphql/resolvers/resolver.py (1)
28-29: Per-attempt span visibility is lost with trace decorator above retry logic.

With @trace.get_tracer(__name__).start_as_current_span() above @retry_db_transaction, all retry attempts execute within a single span. Attempt-level information (which attempt failed/succeeded) is logged but not separated into distinct spans. To enable per-attempt span visibility, either swap the decorator order so @retry_db_transaction wraps the trace spans, or add span instrumentation inside the retry_db_transaction decorator.

Also applies to: 54-56, 187-189, 288-290, 296-298, 318-320
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/infrahub/graphql/resolvers/resolver.py` around lines 28 - 29, The
trace span decorator is placed above the retry decorator so all attempts run
inside one span; swap the decorator order so
`@retry_db_transaction`(name="account_resolver") is above
trace.get_tracer(__name__).start_as_current_span("account_resolver") (or
alternatively add instrumentation to open a new span inside
retry_db_transaction) for the account_resolver and the other listed resolvers
(the decorators at lines around 54-56, 187-189, 288-290, 296-298, 318-320) so
each retry attempt creates its own span and attempt-level visibility is
preserved.
backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py (1)
33-39: Replace test helper with a property mock to avoid depending on Neo4j's private exception implementation.

The _make_client_error helper (lines 33–39) sets private attributes (_neo4j_code, _classification, _category, _title) that the Neo4j driver may change internally without notice. While production code correctly uses the public exc.code property, the test's reliance on private fields is brittle.

Instead, construct the exception and mock its code property directly:
def _make_client_error(
    message: str = "not found", code: str = "Neo.ClientError.Statement.EntityNotFound"
) -> ClientError:
    exc = ClientError(message)
    exc.code = code  # or use unittest.mock.PropertyMock if needed
    return exc
Or use unittest.mock.patch.object on the code property when setting up test fixtures to ensure the mock aligns with the public API the retry logic actually reads.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py` around
lines 33 - 39, The helper _make_client_error currently sets Neo4j private
attributes (_neo4j_code, _classification, _category, _title) on a ClientError
which is brittle; change it to set or mock the public .code property instead
(either assign exc.code = code if possible or use
unittest.mock.PropertyMock/patch.object to patch ClientError.code) so tests use
the same public API that retry_db_transaction reads; update any test fixtures
that call _make_client_error accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/tests/component/graphql/test_graphql_read_retry.py`:
- Around line 69-77: The neo4j_small_thread_pool fixture currently sets the Bolt
thread pool max size to "10" but should be deterministic and match the module
docstring (size=1); update the extra_env value passed to start_neo4j_container
in neo4j_small_thread_pool to set "NEO4J_server_bolt_thread__pool__max__size" to
"1" (and ensure the fixture docstring reflects size=1) so the test reliably
exhausts the pool and exercises retry_db_transaction.
- Around line 84-102: The fixture db_small_thread_pool can skip readiness
because backend/infrahub/database/__init__.py::validated_database may already be
set, and it leaves config.SETTINGS.database.address/port mutated on exceptions;
fix by forcing validation and ensuring cleanup: save and clear the module-level
validated_database before calling get_db (or call get_db with a force/validate
flag if available), wrap the setup/yield/teardown in a try/finally so you always
restore config.SETTINGS.database.address and .port and restore the original
validated_database, and ensure any created driver/db (from get_db and
InfrahubDatabase) are closed in the finally if setup failed before yield;
reference db_small_thread_pool, get_db, validated_database, config.SETTINGS,
InfrahubDatabase, and add_indexes to locate changed code.

In `@backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py`:
- Around line 19-24: The fixture _set_retry_limit only pins retry_limit; update
it to also capture and restore the timing-related settings
(config.SETTINGS.database.retry_base, config.SETTINGS.database.retry_max, and
config.SETTINGS.database.retry_jitter) alongside retry_limit so tests asserting
timing (lines referencing those defaults) remain deterministic—store each
original value before changing, set the desired deterministic values in the
fixture, yield, then restore all originals after the test.

---

Outside diff comments:
In `@backend/infrahub/database/__init__.py`:
- Around line 520-537: The retry branch inside the transaction retry loop
wrongly performs backoff work even on the final attempt; update the block that
handles (TransientError, ClientError) so that it checks if attempt ==
config.SETTINGS.database.retry_limit first and, on the last attempt, sets the
final error (error = exc) and breaks/raises without computing retry_time,
calling log.info("Retrying..."), incrementing
TRANSACTION_RETRIES.labels(name).inc(), or awaiting asyncio.sleep(retry_time);
only compute jitter/base_delay/max_delay, log the retry, increment the metric
and await asyncio.sleep when there will be another attempt (i.e., attempt <
retry_limit). Ensure you reference the existing symbols: attempt,
config.SETTINGS.database.retry_limit, TRANSACTION_RETRIES.labels(name).inc(),
asyncio.sleep, log.info and error = exc.

---

Nitpick comments:
In `@backend/infrahub/graphql/resolvers/resolver.py`:
- Around line 28-29: The trace span decorator is placed above the retry
decorator so all attempts run inside one span; swap the decorator order so
`@retry_db_transaction`(name="account_resolver") is above
trace.get_tracer(__name__).start_as_current_span("account_resolver") (or
alternatively add instrumentation to open a new span inside
retry_db_transaction) for the account_resolver and the other listed resolvers
(the decorators at lines around 54-56, 187-189, 288-290, 296-298, 318-320) so
each retry attempt creates its own span and attempt-level visibility is
preserved.

In `@backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py`:
- Around line 33-39: The helper _make_client_error currently sets Neo4j private
attributes (_neo4j_code, _classification, _category, _title) on a ClientError
which is brittle; change it to set or mock the public .code property instead
(either assign exc.code = code if possible or use
unittest.mock.PropertyMock/patch.object to patch ClientError.code) so tests use
the same public API that retry_db_transaction reads; update any test fixtures
that call _make_client_error accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b91d40ce-b414-452d-865e-0edf9252f300

📥 Commits

Reviewing files that changed from the base of the PR and between aeee493 and 9ad1b42.

📒 Files selected for processing (11)

backend/infrahub/config.py
backend/infrahub/database/__init__.py
backend/infrahub/graphql/resolvers/account_metadata.py
backend/infrahub/graphql/resolvers/ipam.py
backend/infrahub/graphql/resolvers/resolver.py
backend/tests/component/graphql/test_graphql_read_retry.py
backend/tests/helpers/utils.py
backend/tests/unit/graphql/resolvers/__init__.py
backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py
changelog/8696.fixed.md
docs/docs/reference/configuration.mdx

coderabbitai · 2026-03-26T20:59:47Z

+def neo4j_small_thread_pool(request: pytest.FixtureRequest) -> dict[int, int]:
+    """Start a dedicated Neo4j container with a minimal Bolt thread pool (size=10)."""
+    if not INFRAHUB_USE_TEST_CONTAINERS:
+        pytest.skip("Test containers are disabled")
+
+    container = start_neo4j_container(
+        neo4j_image=NEO4J_IMAGE,
+        extra_env={"NEO4J_server_bolt_thread__pool__max__size": "10"},
+    )


⚠️ Potential issue | 🟠 Major

Make the thread-pool bottleneck deterministic here.

The module docstring says this container should run with server.bolt.thread_pool_max_size=1, but this fixture actually sets it to 10. With the success case only issuing 10 concurrent queries on Line 312, the test can pass without ever exhausting the pool or exercising retry_db_transaction.

💡 Proposed fix

`@pytest.fixture`(scope="module") def neo4j_small_thread_pool(request: pytest.FixtureRequest) -> dict[int, int]: - """Start a dedicated Neo4j container with a minimal Bolt thread pool (size=10).""" + """Start a dedicated Neo4j container with a minimal Bolt thread pool (size=1).""" @@ container = start_neo4j_container( neo4j_image=NEO4J_IMAGE, - extra_env={"NEO4J_server_bolt_thread__pool__max__size": "10"}, + extra_env={"NEO4J_server_bolt_thread__pool__max__size": "1"}, )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/tests/component/graphql/test_graphql_read_retry.py` around lines 69 - 77, The neo4j_small_thread_pool fixture currently sets the Bolt thread pool max size to "10" but should be deterministic and match the module docstring (size=1); update the extra_env value passed to start_neo4j_container in neo4j_small_thread_pool to set "NEO4J_server_bolt_thread__pool__max__size" to "1" (and ensure the fixture docstring reflects size=1) so the test reliably exhausts the pool and exercises retry_db_transaction.

coderabbitai · 2026-03-26T20:59:47Z

+async def db_small_thread_pool(
+    neo4j_small_thread_pool: dict[int, int],
+) -> AsyncGenerator[InfrahubDatabase, None]:
+    """Create an InfrahubDatabase connected to the small-thread-pool Neo4j."""
+    original_address = config.SETTINGS.database.address
+    original_port = config.SETTINGS.database.port
+
+    config.SETTINGS.database.address = "localhost"
+    config.SETTINGS.database.port = neo4j_small_thread_pool[PORT_BOLT_NEO4J]
+
+    driver = await get_db(retry=5)
+    db = InfrahubDatabase(driver=driver)
+    await add_indexes(db=db)
+
+    yield db
+
+    await db.close()
+    config.SETTINGS.database.address = original_address
+    config.SETTINGS.database.port = original_port


⚠️ Potential issue | 🟠 Major

Harden this fixture against stale validation state and setup failures.

get_db(retry=5) only waits through validate_database(), but that path is skipped once backend/infrahub/database/__init__.py::validated_database already contains the default database name. In a full suite run this fixture can therefore skip readiness checks for the dedicated container, and if setup fails before yield, the overridden config.SETTINGS.database.address / port never get restored.

💡 Proposed fix

-from infrahub.database import InfrahubDatabase, get_db +from infrahub.database import InfrahubDatabase, get_db, validate_database

`@pytest.fixture`(scope="module") async def db_small_thread_pool( neo4j_small_thread_pool: dict[int, int], ) -> AsyncGenerator[InfrahubDatabase, None]: """Create an InfrahubDatabase connected to the small-thread-pool Neo4j.""" original_address = config.SETTINGS.database.address original_port = config.SETTINGS.database.port + db: InfrahubDatabase | None = None - config.SETTINGS.database.address = "localhost" - config.SETTINGS.database.port = neo4j_small_thread_pool[PORT_BOLT_NEO4J] - - driver = await get_db(retry=5) - db = InfrahubDatabase(driver=driver) - await add_indexes(db=db) - - yield db - - await db.close() - config.SETTINGS.database.address = original_address - config.SETTINGS.database.port = original_port + try: + config.SETTINGS.database.address = "localhost" + config.SETTINGS.database.port = neo4j_small_thread_pool[PORT_BOLT_NEO4J] + + driver = await get_db(retry=5) + await validate_database( + driver=driver, + database_name=config.SETTINGS.database.database_name, + retry=5, + create_db=True, + ) + db = InfrahubDatabase(driver=driver) + await add_indexes(db=db) + + yield db + finally: + if db is not None: + await db.close() + config.SETTINGS.database.address = original_address + config.SETTINGS.database.port = original_port

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/tests/component/graphql/test_graphql_read_retry.py` around lines 84 - 102, The fixture db_small_thread_pool can skip readiness because backend/infrahub/database/__init__.py::validated_database may already be set, and it leaves config.SETTINGS.database.address/port mutated on exceptions; fix by forcing validation and ensuring cleanup: save and clear the module-level validated_database before calling get_db (or call get_db with a force/validate flag if available), wrap the setup/yield/teardown in a try/finally so you always restore config.SETTINGS.database.address and .port and restore the original validated_database, and ensure any created driver/db (from get_db and InfrahubDatabase) are closed in the finally if setup failed before yield; reference db_small_thread_pool, get_db, validated_database, config.SETTINGS, InfrahubDatabase, and add_indexes to locate changed code.

coderabbitai · 2026-03-26T20:59:47Z

+def _set_retry_limit() -> Generator[None, None, None]:
+    original = config.SETTINGS.database.retry_limit
+    config.SETTINGS.database.retry_limit = 3
+    yield
+    config.SETTINGS.database.retry_limit = original
+


⚠️ Potential issue | 🟡 Minor

Pin retry timing settings in the fixture to keep these tests deterministic.

Line 19-24 only fixes retry_limit, but Line 105-112 and Line 128-131 assert timing behavior tied to base/max/jitter defaults. Please set and restore all retry timing settings in the fixture to avoid unrelated config drift breaking unit tests.

Suggested patch

`@pytest.fixture`(autouse=True) def _set_retry_limit() -> Generator[None, None, None]: - original = config.SETTINGS.database.retry_limit + original_retry_limit = config.SETTINGS.database.retry_limit + original_base_delay = config.SETTINGS.database.retry_base_delay + original_max_delay = config.SETTINGS.database.retry_max_delay + original_jitter_max = config.SETTINGS.database.retry_jitter_max + config.SETTINGS.database.retry_limit = 3 + config.SETTINGS.database.retry_base_delay = 0.1 + config.SETTINGS.database.retry_max_delay = 2.0 + config.SETTINGS.database.retry_jitter_max = 0.0 yield - config.SETTINGS.database.retry_limit = original + config.SETTINGS.database.retry_limit = original_retry_limit + config.SETTINGS.database.retry_base_delay = original_base_delay + config.SETTINGS.database.retry_max_delay = original_max_delay + config.SETTINGS.database.retry_jitter_max = original_jitter_max

Also applies to: 105-112, 128-131

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py` around lines 19 - 24, The fixture _set_retry_limit only pins retry_limit; update it to also capture and restore the timing-related settings (config.SETTINGS.database.retry_base, config.SETTINGS.database.retry_max, and config.SETTINGS.database.retry_jitter) alongside retry_limit so tests asserting timing (lines referencing those defaults) remain deterministic—store each original value before changing, set the desired deterministic values in the fixture, yield, then restore all originals after the test.

ajtmccarty · 2026-03-26T22:12:37Z

+        default=2.0, ge=0, description="Maximum delay in seconds for exponential backoff on transaction retries."
+    )
+    retry_jitter_max: float = Field(
+        default=0.1, ge=0, description="Maximum jitter in seconds added to retry delay to avoid thundering herd."


"thundering herd" 👍

ajtmccarty · 2026-03-26T22:42:01Z

+    """Test that without retries, concurrent queries on a small thread pool fail
+    with TransientError, proving the retry mechanism is needed."""
+    original_retry_limit = config.SETTINGS.database.retry_limit
+    config.SETTINGS.database.retry_limit = 1  # Only 1 attempt, no retries


should be able to use a fixture with the yield pattern for this and avoid the try-finally below

ajtmccarty · 2026-03-26T22:46:49Z

+    db_small_thread_pool: InfrahubDatabase,
+    default_branch_stp: Branch,
+) -> SchemaBranch:
+    """Register data_schema + node_group_schema + car_person_schema on the small-thread-pool DB.


we have a pattern that we've used to rescope fixtures in other places that I think would let you remove all or almost all of this function. I'm pretty sure we've got fixtures for all of these (maybe not CoreNode) in other places with the default or "class" scope

this is one example

@pytest.fixture async def default_branch( reset_registry: None, local_storage_dir: Path, empty_database: None, db: InfrahubDatabase ) -> Branch: return await do_default_branch(db=db) @pytest.fixture(scope="class") async def default_branch_scope_class( reset_registry_scope_class: None, local_storage_dir_scope_class: Path, empty_database_scope_class: None, db: InfrahubDatabase, ) -> Branch: return await do_default_branch(db=db) async def do_default_branch(db: InfrahubDatabase) -> Branch: branch = await create_default_branch(db=db) await create_global_branch(db=db) registry.schema = SchemaManager() return branch

thanks I think there's less duplicate now

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

backend/tests/component/graphql/test_graphql_read_retry.py (2)

83-102: ⚠️ Potential issue | 🟡 Minor

Config restoration is not guaranteed on setup failure.

If an exception occurs during setup (lines 91-96) before the yield, the original config.SETTINGS.database.address and port values won't be restored, potentially affecting subsequent tests. Wrap the setup/yield/teardown in a try/finally block.

🛡️ Proposed fix

 `@pytest.fixture`(scope="module")
 async def db_small_thread_pool(
     neo4j_small_thread_pool: dict[int, int],
 ) -> AsyncGenerator[InfrahubDatabase, None]:
     """Create an InfrahubDatabase connected to the small-thread-pool Neo4j."""
     original_address = config.SETTINGS.database.address
     original_port = config.SETTINGS.database.port
+    db: InfrahubDatabase | None = None
 
-    config.SETTINGS.database.address = "localhost"
-    config.SETTINGS.database.port = neo4j_small_thread_pool[PORT_BOLT_NEO4J]
-
-    driver = await get_db(retry=5)
-    db = InfrahubDatabase(driver=driver)
-    await add_indexes(db=db)
-
-    yield db
-
-    await db.close()
-    config.SETTINGS.database.address = original_address
-    config.SETTINGS.database.port = original_port
+    try:
+        config.SETTINGS.database.address = "localhost"
+        config.SETTINGS.database.port = neo4j_small_thread_pool[PORT_BOLT_NEO4J]
+
+        driver = await get_db(retry=5)
+        db = InfrahubDatabase(driver=driver)
+        await add_indexes(db=db)
+
+        yield db
+    finally:
+        if db is not None:
+            await db.close()
+        config.SETTINGS.database.address = original_address
+        config.SETTINGS.database.port = original_port

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/tests/component/graphql/test_graphql_read_retry.py` around lines 83 -
102, The fixture db_small_thread_pool currently sets
config.SETTINGS.database.address/port and performs async setup (get_db,
InfrahubDatabase, add_indexes) but if an exception occurs before yield the
original config isn't restored; modify db_small_thread_pool to wrap the setup,
yield and teardown in a try/finally so that in the finally block you always call
await db.close() if db was created and restore config.SETTINGS.database.address
and .port regardless of errors (referencing db_small_thread_pool, get_db,
add_indexes, db.close and config.SETTINGS.database.address/port to locate the
code).

1-9: ⚠️ Potential issue | 🟡 Minor

Module docstring is inconsistent with actual configuration.

The module docstring states server.bolt.thread_pool_max_size=1 (Line 4), but the fixture at Line 76 actually configures "NEO4J_server_bolt_thread__pool__max__size": "10". Update the module docstring to reflect the actual thread pool size used.

 """Component tests for GraphQL read resolver retry on transient Neo4j errors.
 
 These tests simulate Neo4j server-side Bolt thread pool exhaustion by spinning up
-a dedicated Neo4j container with server.bolt.thread_pool_max_size=1 and firing
+a dedicated Neo4j container with server.bolt.thread_pool_max_size=10 and firing
 concurrent queries that hit many-cardinality relationship resolvers.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/tests/component/graphql/test_graphql_read_retry.py` around lines 1 -
9, Update the module docstring at the top of
backend/tests/component/graphql/test_graphql_read_retry.py to match the actual
Neo4j configuration used in the tests: change the described Bolt thread pool max
size from 1 to 10 to reflect the fixture that sets
"NEO4J_server_bolt_thread__pool__max__size": "10"; ensure the docstring wording
mentions server.bolt.thread_pool_max_size=10 so it accurately describes the test
setup.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/tests/component/graphql/test_graphql_read_retry.py`:
- Around line 155-165: The return type hint for _run_graphql_query is incorrect:
the function returns the ExecutionResult from graphql(), not a dict; update the
function signature to use the correct type (e.g., graphql.ExecutionResult or the
appropriate ExecutionResult import) and add/update the import if needed so
_run_graphql_query(db: InfrahubDatabase, branch: Branch) -> ExecutionResult
accurately reflects the value returned by the graphql(...) call.

---

Duplicate comments:
In `@backend/tests/component/graphql/test_graphql_read_retry.py`:
- Around line 83-102: The fixture db_small_thread_pool currently sets
config.SETTINGS.database.address/port and performs async setup (get_db,
InfrahubDatabase, add_indexes) but if an exception occurs before yield the
original config isn't restored; modify db_small_thread_pool to wrap the setup,
yield and teardown in a try/finally so that in the finally block you always call
await db.close() if db was created and restore config.SETTINGS.database.address
and .port regardless of errors (referencing db_small_thread_pool, get_db,
add_indexes, db.close and config.SETTINGS.database.address/port to locate the
code).
- Around line 1-9: Update the module docstring at the top of
backend/tests/component/graphql/test_graphql_read_retry.py to match the actual
Neo4j configuration used in the tests: change the described Bolt thread pool max
size from 1 to 10 to reflect the fixture that sets
"NEO4J_server_bolt_thread__pool__max__size": "10"; ensure the docstring wording
mentions server.bolt.thread_pool_max_size=10 so it accurately describes the test
setup.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 97d9397e-7b65-4583-87d9-3c80efbf9f12

📥 Commits

Reviewing files that changed from the base of the PR and between 9ad1b42 and 55179e3.

📒 Files selected for processing (2)

backend/tests/component/graphql/test_graphql_read_retry.py
backend/tests/conftest.py

BeArchiTek · 2026-03-30T08:35:32Z

@cubic review this PR

cubic-dev-ai · 2026-03-30T08:35:42Z

@cubic review this PR

@BeArchiTek I have started the AI code review. It will take a few minutes to complete.

cubic-dev-ai

2 issues found across 12 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py">

<violation number="1" location="backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py:115">
P2: Custom agent: **Detect try/finally-based ad-hoc cleanup in tests (encourage yield fixtures)**

This test uses `try/finally` for setup/teardown of `retry_limit`, duplicating the yield-fixture pattern already used by `_set_retry_limit` in the same file. Extract a parameterized or dedicated yield fixture (e.g., `_set_high_retry_limit`) to centralize the cleanup and keep the test body focused on assertions.</violation>
</file>

<file name="backend/tests/component/graphql/test_graphql_read_retry.py">

<violation number="1" location="backend/tests/component/graphql/test_graphql_read_retry.py:175">
P1: Custom agent: **Flag ad-hoc reproducer tests committed to CI (unmarked/flaky reproducer code)**

This test is unmarked but lives in a file flagged as flaky/reproducer. The file explicitly contains 'flaky' and 'Reproduction test for issue #8696' tokens (line 207), and the PR author states these tests should be disabled in CI. Per the rule, tests in files containing these keywords must be gated with `@pytest.mark.skip` or `@pytest.mark.xfail`. Add a skip marker to prevent CI instability.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

ajtmccarty

just some test comments that don't need to block merging

ajtmccarty · 2026-03-30T14:21:57Z

not certain if this should be a component test
I'd think of spinning up a new container as more of an integration or integration_docker test... BUT it is just another database container, so I'm not sure. asking for team's input in slack, but does not need to block merging this PR

In its current state, based on the dev/guidelines/backend/testing.md file, I think this should be into component as this is only related to the database, but maybe in a separate folder that test specifically the database container and its settings. I do not find an existing database folder in component tests, we could create one if this is the first time we try to solve this kind of db specific issue.

In my opinion, if we want this test scenario to be related to similar other integration tests, it may be justified if we group this kind of test. In that case, we should perhaps send the requests via the "accessible" API infrahub_sdk.client.InfrahubClient.execute_graphql rather than directly from the inner graphql library?

I think either way is fine as long as the test exercises the resolvers that have the new decorator

I understand the debate but this felt obvious to me: I just want to test an internal behavior (the graphql resolvers) against a custom, real database... component test is the right place in my opinion

This PR improves the retry_db_transaction decorator by having a configurable backoff retry. It also enables the decorator on the graphql query resolvers. Adds component tests to check the retry behavior solves hard failures when neo4j thread pool exhausts: we must start a new container with a low thread pool size. Signed-off-by: Fatih Acar <fatih@opsmill.com>

Signed-off-by: Fatih Acar <fatih@opsmill.com>

github-actions Bot added type/documentation Improvements or additions to documentation group/backend Issue related to the backend (API Server, Git Agent) labels Mar 26, 2026

fatih-acar force-pushed the fac-add-resolver-db-retry branch from 9ee1e8d to 9ad1b42 Compare March 26, 2026 20:18

fatih-acar marked this pull request as ready for review March 26, 2026 20:48

fatih-acar requested review from a team as code owners March 26, 2026 20:48

coderabbitai Bot reviewed Mar 26, 2026

View reviewed changes

ajtmccarty reviewed Mar 26, 2026

View reviewed changes

coderabbitai Bot reviewed Mar 27, 2026

View reviewed changes

Comment thread backend/tests/component/graphql/test_graphql_read_retry.py Outdated

fatih-acar force-pushed the fac-add-resolver-db-retry branch 2 times, most recently from d5d9ed3 to 40321e2 Compare March 27, 2026 15:42

cubic-dev-ai Bot reviewed Mar 30, 2026

View reviewed changes

Comment thread backend/tests/component/graphql/test_graphql_read_retry.py

Comment thread backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py Outdated

ajtmccarty approved these changes Mar 30, 2026

View reviewed changes

ogenstad reviewed Mar 30, 2026

View reviewed changes

Comment thread backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py Outdated

Comment thread backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py Outdated

fatih-acar force-pushed the fac-add-resolver-db-retry branch 2 times, most recently from 207d096 to bd23d0d Compare March 31, 2026 07:03

fatih-acar added 2 commits March 31, 2026 15:27

chore: add newsfragment

17ed35f

Signed-off-by: Fatih Acar <fatih@opsmill.com>

fatih-acar force-pushed the fac-add-resolver-db-retry branch from bd23d0d to 17ed35f Compare March 31, 2026 13:27

ajtmccarty approved these changes Mar 31, 2026

View reviewed changes

fatih-acar merged commit 69eb282 into stable Mar 31, 2026
46 of 47 checks passed

fatih-acar deleted the fac-add-resolver-db-retry branch March 31, 2026 14:13

infrahub-github-bot-app Bot mentioned this pull request Mar 31, 2026

Merge stable into develop #8761

Closed

Conversation

fatih-acar commented Mar 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

codspeed-hq Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Footnotes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

ajtmccarty Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

ajtmccarty Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

ajtmccarty Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

fatih-acar Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

BeArchiTek commented Mar 30, 2026

Uh oh!

cubic-dev-ai Bot commented Mar 30, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ajtmccarty left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ajtmccarty Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

polmichel Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajtmccarty Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

fatih-acar Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

fatih-acar commented Mar 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 26, 2026 •

edited

Loading

codspeed-hq Bot commented Mar 26, 2026 •

edited

Loading

polmichel Mar 30, 2026 •

edited

Loading