Skip to content

fix(backend): add db backoff retry for resolvers#8712

Merged
fatih-acar merged 2 commits intostablefrom
fac-add-resolver-db-retry
Mar 31, 2026
Merged

fix(backend): add db backoff retry for resolvers#8712
fatih-acar merged 2 commits intostablefrom
fac-add-resolver-db-retry

Conversation

@fatih-acar
Copy link
Copy Markdown
Contributor

@fatih-acar fatih-acar commented Mar 26, 2026

This PR improves the retry_db_transaction decorator by having a configurable backoff retry.
It also enables the decorator on the graphql query resolvers.

Adds component tests to check the retry behavior solves hard failures
when neo4j thread pool exhausts: we must start a new container with a
low thread pool size.

I had to duplicate a few fixtures to have the new neo4j container working... I'm not very satisfied of the tests, looking for feedback on those. There's also high chance they are flaky... so we probably want to disable them in CI for now.

Fixes #8696

Summary by CodeRabbit

  • Bug Fixes

    • GraphQL read operations now automatically retry with exponential backoff and jitter to reduce transient failures.
  • New Features

    • Three new configurable database retry parameters: base delay, max delay, and jitter.
  • Documentation

    • Configuration reference updated with the new retry environment variables and defaults.
  • Tests

    • Added unit and component tests validating retry behavior and backoff timing under contention.
  • Chores

    • Minor test fixture refactor and changelog entry added.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 26, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 82ca71e7-17b3-4ce8-8375-a992e2e14536

📥 Commits

Reviewing files that changed from the base of the PR and between 55179e3 and 40321e2.

📒 Files selected for processing (12)
  • backend/infrahub/config.py
  • backend/infrahub/database/__init__.py
  • backend/infrahub/graphql/resolvers/account_metadata.py
  • backend/infrahub/graphql/resolvers/ipam.py
  • backend/infrahub/graphql/resolvers/resolver.py
  • backend/tests/component/graphql/test_graphql_read_retry.py
  • backend/tests/conftest.py
  • backend/tests/helpers/utils.py
  • backend/tests/unit/graphql/resolvers/__init__.py
  • backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py
  • changelog/8696.fixed.md
  • docs/docs/reference/configuration.mdx
✅ Files skipped from review due to trivial changes (3)
  • changelog/8696.fixed.md
  • docs/docs/reference/configuration.mdx
  • backend/tests/component/graphql/test_graphql_read_retry.py
🚧 Files skipped from review as they are similar to previous changes (7)
  • backend/infrahub/graphql/resolvers/ipam.py
  • backend/tests/helpers/utils.py
  • backend/infrahub/graphql/resolvers/account_metadata.py
  • backend/infrahub/database/init.py
  • backend/tests/conftest.py
  • backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py
  • backend/infrahub/graphql/resolvers/resolver.py

Walkthrough

Added three database settings: retry_base_delay, retry_max_delay, and retry_jitter_max to configure exponential backoff timing. Updated retry_db_transaction to use functools.wraps and compute exponential backoff with uniform jitter clamped by retry_max_delay instead of a fixed random delay. Applied the decorator to multiple GraphQL resolvers (account, default, paginated, relationship, hierarchy, account metadata, IPAM). Added unit and component tests for retry behavior and timing, extended test helper to accept extra env, and updated docs and changelog.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(backend): add db backoff retry for resolvers' accurately summarizes the main change: adding a configurable backoff retry mechanism to the retry_db_transaction decorator and enabling it on GraphQL resolvers.
Description check ✅ Passed The PR description covers key aspects: the backoff retry improvement, enabling the decorator on GraphQL resolvers, addition of component tests validating retry behavior, test infrastructure concerns, and a reference to the fixed issue.
Linked Issues check ✅ Passed The PR fully addresses the linked issue #8696 objectives: retry_db_transaction is applied to read-path resolvers (account_resolver, default_resolver, default_paginated_list_resolver, single_relationship_resolver, many_relationship_resolver, hierarchy_resolver) and exponential backoff is implemented with configurable timing parameters.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing exponential backoff retry and enabling it on GraphQL resolvers; no unrelated modifications are present in the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added type/documentation Improvements or additions to documentation group/backend Issue related to the backend (API Server, Git Agent) labels Mar 26, 2026
@fatih-acar fatih-acar force-pushed the fac-add-resolver-db-retry branch from 9ee1e8d to 9ad1b42 Compare March 26, 2026 20:18
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Mar 26, 2026

Merging this PR will not alter performance

✅ 12 untouched benchmarks


Comparing fac-add-resolver-db-retry (17ed35f) with stable (9126527)1

Open in CodSpeed

Footnotes

  1. No successful run was found on stable (a280f9a) during the generation of this report, so 9126527 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@fatih-acar fatih-acar marked this pull request as ready for review March 26, 2026 20:48
@fatih-acar fatih-acar requested review from a team as code owners March 26, 2026 20:48
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/infrahub/database/__init__.py (1)

520-537: ⚠️ Potential issue | 🟡 Minor

Don't sleep or count a retry after the last attempt.

On the terminal failure path this block still computes backoff, logs Retrying ..., increments TRANSACTION_RETRIES, and awaits asyncio.sleep(retry_time) before raising. That adds avoidable latency and overcounts retries whenever the final attempt fails.

💡 Proposed fix
         async def wrapper(*args: Any, **kwargs: Any) -> R:
             error = Exception()
             for attempt in range(1, config.SETTINGS.database.retry_limit + 1):
                 try:
                     return await func(*args, **kwargs)
                 except (TransientError, ClientError) as exc:
                     if isinstance(exc, ClientError):
                         if exc.code != "Neo.ClientError.Statement.EntityNotFound":
                             raise exc
+                    if attempt == config.SETTINGS.database.retry_limit:
+                        error = exc
+                        break
                     base_delay = config.SETTINGS.database.retry_base_delay
                     max_delay = config.SETTINGS.database.retry_max_delay
                     jitter = random.uniform(0, config.SETTINGS.database.retry_jitter_max)
                     retry_time = min(base_delay * (2 ** (attempt - 1)) + jitter, max_delay)
                     log.exception("Retry handler caught database error")
@@
                     )
                     log.debug("Database transaction failed", message=exc.message)
                     TRANSACTION_RETRIES.labels(name).inc()
                     await asyncio.sleep(retry_time)
-                    if attempt == config.SETTINGS.database.retry_limit:
-                        error = exc
-                        break
             raise error
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/infrahub/database/__init__.py` around lines 520 - 537, The retry
branch inside the transaction retry loop wrongly performs backoff work even on
the final attempt; update the block that handles (TransientError, ClientError)
so that it checks if attempt == config.SETTINGS.database.retry_limit first and,
on the last attempt, sets the final error (error = exc) and breaks/raises
without computing retry_time, calling log.info("Retrying..."), incrementing
TRANSACTION_RETRIES.labels(name).inc(), or awaiting asyncio.sleep(retry_time);
only compute jitter/base_delay/max_delay, log the retry, increment the metric
and await asyncio.sleep when there will be another attempt (i.e., attempt <
retry_limit). Ensure you reference the existing symbols: attempt,
config.SETTINGS.database.retry_limit, TRANSACTION_RETRIES.labels(name).inc(),
asyncio.sleep, log.info and error = exc.
🧹 Nitpick comments (2)
backend/infrahub/graphql/resolvers/resolver.py (1)

28-29: Per-attempt span visibility is lost with trace decorator above retry logic.

With @trace.get_tracer(__name__).start_as_current_span() above @retry_db_transaction, all retry attempts execute within a single span. Attempt-level information (which attempt failed/succeeded) is logged but not separated into distinct spans. To enable per-attempt span visibility, either swap the decorator order so @retry_db_transaction wraps the trace spans, or add span instrumentation inside the retry_db_transaction decorator.

Also applies to: 54-56, 187-189, 288-290, 296-298, 318-320

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/infrahub/graphql/resolvers/resolver.py` around lines 28 - 29, The
trace span decorator is placed above the retry decorator so all attempts run
inside one span; swap the decorator order so
`@retry_db_transaction`(name="account_resolver") is above
trace.get_tracer(__name__).start_as_current_span("account_resolver") (or
alternatively add instrumentation to open a new span inside
retry_db_transaction) for the account_resolver and the other listed resolvers
(the decorators at lines around 54-56, 187-189, 288-290, 296-298, 318-320) so
each retry attempt creates its own span and attempt-level visibility is
preserved.
backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py (1)

33-39: Replace test helper with a property mock to avoid depending on Neo4j's private exception implementation.

The _make_client_error helper (lines 33–39) sets private attributes (_neo4j_code, _classification, _category, _title) that the Neo4j driver may change internally without notice. While production code correctly uses the public exc.code property, the test's reliance on private fields is brittle.

Instead, construct the exception and mock its code property directly:

def _make_client_error(
    message: str = "not found", code: str = "Neo.ClientError.Statement.EntityNotFound"
) -> ClientError:
    exc = ClientError(message)
    exc.code = code  # or use unittest.mock.PropertyMock if needed
    return exc

Or use unittest.mock.patch.object on the code property when setting up test fixtures to ensure the mock aligns with the public API the retry logic actually reads.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py` around
lines 33 - 39, The helper _make_client_error currently sets Neo4j private
attributes (_neo4j_code, _classification, _category, _title) on a ClientError
which is brittle; change it to set or mock the public .code property instead
(either assign exc.code = code if possible or use
unittest.mock.PropertyMock/patch.object to patch ClientError.code) so tests use
the same public API that retry_db_transaction reads; update any test fixtures
that call _make_client_error accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/tests/component/graphql/test_graphql_read_retry.py`:
- Around line 69-77: The neo4j_small_thread_pool fixture currently sets the Bolt
thread pool max size to "10" but should be deterministic and match the module
docstring (size=1); update the extra_env value passed to start_neo4j_container
in neo4j_small_thread_pool to set "NEO4J_server_bolt_thread__pool__max__size" to
"1" (and ensure the fixture docstring reflects size=1) so the test reliably
exhausts the pool and exercises retry_db_transaction.
- Around line 84-102: The fixture db_small_thread_pool can skip readiness
because backend/infrahub/database/__init__.py::validated_database may already be
set, and it leaves config.SETTINGS.database.address/port mutated on exceptions;
fix by forcing validation and ensuring cleanup: save and clear the module-level
validated_database before calling get_db (or call get_db with a force/validate
flag if available), wrap the setup/yield/teardown in a try/finally so you always
restore config.SETTINGS.database.address and .port and restore the original
validated_database, and ensure any created driver/db (from get_db and
InfrahubDatabase) are closed in the finally if setup failed before yield;
reference db_small_thread_pool, get_db, validated_database, config.SETTINGS,
InfrahubDatabase, and add_indexes to locate changed code.

In `@backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py`:
- Around line 19-24: The fixture _set_retry_limit only pins retry_limit; update
it to also capture and restore the timing-related settings
(config.SETTINGS.database.retry_base, config.SETTINGS.database.retry_max, and
config.SETTINGS.database.retry_jitter) alongside retry_limit so tests asserting
timing (lines referencing those defaults) remain deterministic—store each
original value before changing, set the desired deterministic values in the
fixture, yield, then restore all originals after the test.

---

Outside diff comments:
In `@backend/infrahub/database/__init__.py`:
- Around line 520-537: The retry branch inside the transaction retry loop
wrongly performs backoff work even on the final attempt; update the block that
handles (TransientError, ClientError) so that it checks if attempt ==
config.SETTINGS.database.retry_limit first and, on the last attempt, sets the
final error (error = exc) and breaks/raises without computing retry_time,
calling log.info("Retrying..."), incrementing
TRANSACTION_RETRIES.labels(name).inc(), or awaiting asyncio.sleep(retry_time);
only compute jitter/base_delay/max_delay, log the retry, increment the metric
and await asyncio.sleep when there will be another attempt (i.e., attempt <
retry_limit). Ensure you reference the existing symbols: attempt,
config.SETTINGS.database.retry_limit, TRANSACTION_RETRIES.labels(name).inc(),
asyncio.sleep, log.info and error = exc.

---

Nitpick comments:
In `@backend/infrahub/graphql/resolvers/resolver.py`:
- Around line 28-29: The trace span decorator is placed above the retry
decorator so all attempts run inside one span; swap the decorator order so
`@retry_db_transaction`(name="account_resolver") is above
trace.get_tracer(__name__).start_as_current_span("account_resolver") (or
alternatively add instrumentation to open a new span inside
retry_db_transaction) for the account_resolver and the other listed resolvers
(the decorators at lines around 54-56, 187-189, 288-290, 296-298, 318-320) so
each retry attempt creates its own span and attempt-level visibility is
preserved.

In `@backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py`:
- Around line 33-39: The helper _make_client_error currently sets Neo4j private
attributes (_neo4j_code, _classification, _category, _title) on a ClientError
which is brittle; change it to set or mock the public .code property instead
(either assign exc.code = code if possible or use
unittest.mock.PropertyMock/patch.object to patch ClientError.code) so tests use
the same public API that retry_db_transaction reads; update any test fixtures
that call _make_client_error accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b91d40ce-b414-452d-865e-0edf9252f300

📥 Commits

Reviewing files that changed from the base of the PR and between aeee493 and 9ad1b42.

📒 Files selected for processing (11)
  • backend/infrahub/config.py
  • backend/infrahub/database/__init__.py
  • backend/infrahub/graphql/resolvers/account_metadata.py
  • backend/infrahub/graphql/resolvers/ipam.py
  • backend/infrahub/graphql/resolvers/resolver.py
  • backend/tests/component/graphql/test_graphql_read_retry.py
  • backend/tests/helpers/utils.py
  • backend/tests/unit/graphql/resolvers/__init__.py
  • backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py
  • changelog/8696.fixed.md
  • docs/docs/reference/configuration.mdx

Comment on lines +69 to +77
def neo4j_small_thread_pool(request: pytest.FixtureRequest) -> dict[int, int]:
"""Start a dedicated Neo4j container with a minimal Bolt thread pool (size=10)."""
if not INFRAHUB_USE_TEST_CONTAINERS:
pytest.skip("Test containers are disabled")

container = start_neo4j_container(
neo4j_image=NEO4J_IMAGE,
extra_env={"NEO4J_server_bolt_thread__pool__max__size": "10"},
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make the thread-pool bottleneck deterministic here.

The module docstring says this container should run with server.bolt.thread_pool_max_size=1, but this fixture actually sets it to 10. With the success case only issuing 10 concurrent queries on Line 312, the test can pass without ever exhausting the pool or exercising retry_db_transaction.

💡 Proposed fix
 `@pytest.fixture`(scope="module")
 def neo4j_small_thread_pool(request: pytest.FixtureRequest) -> dict[int, int]:
-    """Start a dedicated Neo4j container with a minimal Bolt thread pool (size=10)."""
+    """Start a dedicated Neo4j container with a minimal Bolt thread pool (size=1)."""
@@
     container = start_neo4j_container(
         neo4j_image=NEO4J_IMAGE,
-        extra_env={"NEO4J_server_bolt_thread__pool__max__size": "10"},
+        extra_env={"NEO4J_server_bolt_thread__pool__max__size": "1"},
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/tests/component/graphql/test_graphql_read_retry.py` around lines 69 -
77, The neo4j_small_thread_pool fixture currently sets the Bolt thread pool max
size to "10" but should be deterministic and match the module docstring
(size=1); update the extra_env value passed to start_neo4j_container in
neo4j_small_thread_pool to set "NEO4J_server_bolt_thread__pool__max__size" to
"1" (and ensure the fixture docstring reflects size=1) so the test reliably
exhausts the pool and exercises retry_db_transaction.

Comment on lines +84 to +102
async def db_small_thread_pool(
neo4j_small_thread_pool: dict[int, int],
) -> AsyncGenerator[InfrahubDatabase, None]:
"""Create an InfrahubDatabase connected to the small-thread-pool Neo4j."""
original_address = config.SETTINGS.database.address
original_port = config.SETTINGS.database.port

config.SETTINGS.database.address = "localhost"
config.SETTINGS.database.port = neo4j_small_thread_pool[PORT_BOLT_NEO4J]

driver = await get_db(retry=5)
db = InfrahubDatabase(driver=driver)
await add_indexes(db=db)

yield db

await db.close()
config.SETTINGS.database.address = original_address
config.SETTINGS.database.port = original_port
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Harden this fixture against stale validation state and setup failures.

get_db(retry=5) only waits through validate_database(), but that path is skipped once backend/infrahub/database/__init__.py::validated_database already contains the default database name. In a full suite run this fixture can therefore skip readiness checks for the dedicated container, and if setup fails before yield, the overridden config.SETTINGS.database.address / port never get restored.

💡 Proposed fix
-from infrahub.database import InfrahubDatabase, get_db
+from infrahub.database import InfrahubDatabase, get_db, validate_database
 `@pytest.fixture`(scope="module")
 async def db_small_thread_pool(
     neo4j_small_thread_pool: dict[int, int],
 ) -> AsyncGenerator[InfrahubDatabase, None]:
     """Create an InfrahubDatabase connected to the small-thread-pool Neo4j."""
     original_address = config.SETTINGS.database.address
     original_port = config.SETTINGS.database.port
+    db: InfrahubDatabase | None = None
 
-    config.SETTINGS.database.address = "localhost"
-    config.SETTINGS.database.port = neo4j_small_thread_pool[PORT_BOLT_NEO4J]
-
-    driver = await get_db(retry=5)
-    db = InfrahubDatabase(driver=driver)
-    await add_indexes(db=db)
-
-    yield db
-
-    await db.close()
-    config.SETTINGS.database.address = original_address
-    config.SETTINGS.database.port = original_port
+    try:
+        config.SETTINGS.database.address = "localhost"
+        config.SETTINGS.database.port = neo4j_small_thread_pool[PORT_BOLT_NEO4J]
+
+        driver = await get_db(retry=5)
+        await validate_database(
+            driver=driver,
+            database_name=config.SETTINGS.database.database_name,
+            retry=5,
+            create_db=True,
+        )
+        db = InfrahubDatabase(driver=driver)
+        await add_indexes(db=db)
+
+        yield db
+    finally:
+        if db is not None:
+            await db.close()
+        config.SETTINGS.database.address = original_address
+        config.SETTINGS.database.port = original_port
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/tests/component/graphql/test_graphql_read_retry.py` around lines 84 -
102, The fixture db_small_thread_pool can skip readiness because
backend/infrahub/database/__init__.py::validated_database may already be set,
and it leaves config.SETTINGS.database.address/port mutated on exceptions; fix
by forcing validation and ensuring cleanup: save and clear the module-level
validated_database before calling get_db (or call get_db with a force/validate
flag if available), wrap the setup/yield/teardown in a try/finally so you always
restore config.SETTINGS.database.address and .port and restore the original
validated_database, and ensure any created driver/db (from get_db and
InfrahubDatabase) are closed in the finally if setup failed before yield;
reference db_small_thread_pool, get_db, validated_database, config.SETTINGS,
InfrahubDatabase, and add_indexes to locate changed code.

Comment on lines +19 to +24
def _set_retry_limit() -> Generator[None, None, None]:
original = config.SETTINGS.database.retry_limit
config.SETTINGS.database.retry_limit = 3
yield
config.SETTINGS.database.retry_limit = original

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Pin retry timing settings in the fixture to keep these tests deterministic.

Line 19-24 only fixes retry_limit, but Line 105-112 and Line 128-131 assert timing behavior tied to base/max/jitter defaults. Please set and restore all retry timing settings in the fixture to avoid unrelated config drift breaking unit tests.

Suggested patch
 `@pytest.fixture`(autouse=True)
 def _set_retry_limit() -> Generator[None, None, None]:
-    original = config.SETTINGS.database.retry_limit
+    original_retry_limit = config.SETTINGS.database.retry_limit
+    original_base_delay = config.SETTINGS.database.retry_base_delay
+    original_max_delay = config.SETTINGS.database.retry_max_delay
+    original_jitter_max = config.SETTINGS.database.retry_jitter_max
+
     config.SETTINGS.database.retry_limit = 3
+    config.SETTINGS.database.retry_base_delay = 0.1
+    config.SETTINGS.database.retry_max_delay = 2.0
+    config.SETTINGS.database.retry_jitter_max = 0.0
     yield
-    config.SETTINGS.database.retry_limit = original
+    config.SETTINGS.database.retry_limit = original_retry_limit
+    config.SETTINGS.database.retry_base_delay = original_base_delay
+    config.SETTINGS.database.retry_max_delay = original_max_delay
+    config.SETTINGS.database.retry_jitter_max = original_jitter_max

Also applies to: 105-112, 128-131

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py` around
lines 19 - 24, The fixture _set_retry_limit only pins retry_limit; update it to
also capture and restore the timing-related settings
(config.SETTINGS.database.retry_base, config.SETTINGS.database.retry_max, and
config.SETTINGS.database.retry_jitter) alongside retry_limit so tests asserting
timing (lines referencing those defaults) remain deterministic—store each
original value before changing, set the desired deterministic values in the
fixture, yield, then restore all originals after the test.

default=2.0, ge=0, description="Maximum delay in seconds for exponential backoff on transaction retries."
)
retry_jitter_max: float = Field(
default=0.1, ge=0, description="Maximum jitter in seconds added to retry delay to avoid thundering herd."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"thundering herd" 👍

"""Test that without retries, concurrent queries on a small thread pool fail
with TransientError, proving the retry mechanism is needed."""
original_retry_limit = config.SETTINGS.database.retry_limit
config.SETTINGS.database.retry_limit = 1 # Only 1 attempt, no retries
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be able to use a fixture with the yield pattern for this and avoid the try-finally below

db_small_thread_pool: InfrahubDatabase,
default_branch_stp: Branch,
) -> SchemaBranch:
"""Register data_schema + node_group_schema + car_person_schema on the small-thread-pool DB.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a pattern that we've used to rescope fixtures in other places that I think would let you remove all or almost all of this function. I'm pretty sure we've got fixtures for all of these (maybe not CoreNode) in other places with the default or "class" scope

this is one example

@pytest.fixture
async def default_branch(
    reset_registry: None, local_storage_dir: Path, empty_database: None, db: InfrahubDatabase
) -> Branch:
    return await do_default_branch(db=db)


@pytest.fixture(scope="class")
async def default_branch_scope_class(
    reset_registry_scope_class: None,
    local_storage_dir_scope_class: Path,
    empty_database_scope_class: None,
    db: InfrahubDatabase,
) -> Branch:
    return await do_default_branch(db=db)


async def do_default_branch(db: InfrahubDatabase) -> Branch:
    branch = await create_default_branch(db=db)
    await create_global_branch(db=db)
    registry.schema = SchemaManager()
    return branch

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks I think there's less duplicate now

Comment thread backend/tests/component/graphql/test_graphql_read_retry.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
backend/tests/component/graphql/test_graphql_read_retry.py (2)

83-102: ⚠️ Potential issue | 🟡 Minor

Config restoration is not guaranteed on setup failure.

If an exception occurs during setup (lines 91-96) before the yield, the original config.SETTINGS.database.address and port values won't be restored, potentially affecting subsequent tests. Wrap the setup/yield/teardown in a try/finally block.

🛡️ Proposed fix
 `@pytest.fixture`(scope="module")
 async def db_small_thread_pool(
     neo4j_small_thread_pool: dict[int, int],
 ) -> AsyncGenerator[InfrahubDatabase, None]:
     """Create an InfrahubDatabase connected to the small-thread-pool Neo4j."""
     original_address = config.SETTINGS.database.address
     original_port = config.SETTINGS.database.port
+    db: InfrahubDatabase | None = None
 
-    config.SETTINGS.database.address = "localhost"
-    config.SETTINGS.database.port = neo4j_small_thread_pool[PORT_BOLT_NEO4J]
-
-    driver = await get_db(retry=5)
-    db = InfrahubDatabase(driver=driver)
-    await add_indexes(db=db)
-
-    yield db
-
-    await db.close()
-    config.SETTINGS.database.address = original_address
-    config.SETTINGS.database.port = original_port
+    try:
+        config.SETTINGS.database.address = "localhost"
+        config.SETTINGS.database.port = neo4j_small_thread_pool[PORT_BOLT_NEO4J]
+
+        driver = await get_db(retry=5)
+        db = InfrahubDatabase(driver=driver)
+        await add_indexes(db=db)
+
+        yield db
+    finally:
+        if db is not None:
+            await db.close()
+        config.SETTINGS.database.address = original_address
+        config.SETTINGS.database.port = original_port
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/tests/component/graphql/test_graphql_read_retry.py` around lines 83 -
102, The fixture db_small_thread_pool currently sets
config.SETTINGS.database.address/port and performs async setup (get_db,
InfrahubDatabase, add_indexes) but if an exception occurs before yield the
original config isn't restored; modify db_small_thread_pool to wrap the setup,
yield and teardown in a try/finally so that in the finally block you always call
await db.close() if db was created and restore config.SETTINGS.database.address
and .port regardless of errors (referencing db_small_thread_pool, get_db,
add_indexes, db.close and config.SETTINGS.database.address/port to locate the
code).

1-9: ⚠️ Potential issue | 🟡 Minor

Module docstring is inconsistent with actual configuration.

The module docstring states server.bolt.thread_pool_max_size=1 (Line 4), but the fixture at Line 76 actually configures "NEO4J_server_bolt_thread__pool__max__size": "10". Update the module docstring to reflect the actual thread pool size used.

 """Component tests for GraphQL read resolver retry on transient Neo4j errors.
 
 These tests simulate Neo4j server-side Bolt thread pool exhaustion by spinning up
-a dedicated Neo4j container with server.bolt.thread_pool_max_size=1 and firing
+a dedicated Neo4j container with server.bolt.thread_pool_max_size=10 and firing
 concurrent queries that hit many-cardinality relationship resolvers.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/tests/component/graphql/test_graphql_read_retry.py` around lines 1 -
9, Update the module docstring at the top of
backend/tests/component/graphql/test_graphql_read_retry.py to match the actual
Neo4j configuration used in the tests: change the described Bolt thread pool max
size from 1 to 10 to reflect the fixture that sets
"NEO4J_server_bolt_thread__pool__max__size": "10"; ensure the docstring wording
mentions server.bolt.thread_pool_max_size=10 so it accurately describes the test
setup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/tests/component/graphql/test_graphql_read_retry.py`:
- Around line 155-165: The return type hint for _run_graphql_query is incorrect:
the function returns the ExecutionResult from graphql(), not a dict; update the
function signature to use the correct type (e.g., graphql.ExecutionResult or the
appropriate ExecutionResult import) and add/update the import if needed so
_run_graphql_query(db: InfrahubDatabase, branch: Branch) -> ExecutionResult
accurately reflects the value returned by the graphql(...) call.

---

Duplicate comments:
In `@backend/tests/component/graphql/test_graphql_read_retry.py`:
- Around line 83-102: The fixture db_small_thread_pool currently sets
config.SETTINGS.database.address/port and performs async setup (get_db,
InfrahubDatabase, add_indexes) but if an exception occurs before yield the
original config isn't restored; modify db_small_thread_pool to wrap the setup,
yield and teardown in a try/finally so that in the finally block you always call
await db.close() if db was created and restore config.SETTINGS.database.address
and .port regardless of errors (referencing db_small_thread_pool, get_db,
add_indexes, db.close and config.SETTINGS.database.address/port to locate the
code).
- Around line 1-9: Update the module docstring at the top of
backend/tests/component/graphql/test_graphql_read_retry.py to match the actual
Neo4j configuration used in the tests: change the described Bolt thread pool max
size from 1 to 10 to reflect the fixture that sets
"NEO4J_server_bolt_thread__pool__max__size": "10"; ensure the docstring wording
mentions server.bolt.thread_pool_max_size=10 so it accurately describes the test
setup.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 97d9397e-7b65-4583-87d9-3c80efbf9f12

📥 Commits

Reviewing files that changed from the base of the PR and between 9ad1b42 and 55179e3.

📒 Files selected for processing (2)
  • backend/tests/component/graphql/test_graphql_read_retry.py
  • backend/tests/conftest.py

Comment thread backend/tests/component/graphql/test_graphql_read_retry.py Outdated
@fatih-acar fatih-acar force-pushed the fac-add-resolver-db-retry branch 2 times, most recently from d5d9ed3 to 40321e2 Compare March 27, 2026 15:42
@BeArchiTek
Copy link
Copy Markdown
Contributor

@cubic review this PR

@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented Mar 30, 2026

@cubic review this PR

@BeArchiTek I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 12 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py">

<violation number="1" location="backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py:115">
P2: Custom agent: **Detect try/finally-based ad-hoc cleanup in tests (encourage yield fixtures)**

This test uses `try/finally` for setup/teardown of `retry_limit`, duplicating the yield-fixture pattern already used by `_set_retry_limit` in the same file. Extract a parameterized or dedicated yield fixture (e.g., `_set_high_retry_limit`) to centralize the cleanup and keep the test body focused on assertions.</violation>
</file>

<file name="backend/tests/component/graphql/test_graphql_read_retry.py">

<violation number="1" location="backend/tests/component/graphql/test_graphql_read_retry.py:175">
P1: Custom agent: **Flag ad-hoc reproducer tests committed to CI (unmarked/flaky reproducer code)**

This test is unmarked but lives in a file flagged as flaky/reproducer. The file explicitly contains 'flaky' and 'Reproduction test for issue #8696' tokens (line 207), and the PR author states these tests should be disabled in CI. Per the rule, tests in files containing these keywords must be gated with `@pytest.mark.skip` or `@pytest.mark.xfail`. Add a skip marker to prevent CI instability.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread backend/tests/component/graphql/test_graphql_read_retry.py
Comment thread backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py Outdated
Copy link
Copy Markdown
Contributor

@ajtmccarty ajtmccarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some test comments that don't need to block merging

Comment thread backend/tests/component/graphql/test_graphql_read_retry.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not certain if this should be a component test
I'd think of spinning up a new container as more of an integration or integration_docker test... BUT it is just another database container, so I'm not sure. asking for team's input in slack, but does not need to block merging this PR

Copy link
Copy Markdown
Contributor

@polmichel polmichel Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In its current state, based on the dev/guidelines/backend/testing.md file, I think this should be into component as this is only related to the database, but maybe in a separate folder that test specifically the database container and its settings. I do not find an existing database folder in component tests, we could create one if this is the first time we try to solve this kind of db specific issue.

In my opinion, if we want this test scenario to be related to similar other integration tests, it may be justified if we group this kind of test. In that case, we should perhaps send the requests via the "accessible" API infrahub_sdk.client.InfrahubClient.execute_graphql rather than directly from the inner graphql library?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either way is fine as long as the test exercises the resolvers that have the new decorator

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the debate but this felt obvious to me: I just want to test an internal behavior (the graphql resolvers) against a custom, real database... component test is the right place in my opinion

Comment thread backend/tests/component/graphql/test_graphql_read_retry.py Outdated
Comment thread backend/tests/unit/test_retry_db_transaction.py
Comment thread backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py Outdated
Comment thread backend/tests/unit/graphql/resolvers/test_retry_db_transaction.py Outdated
@fatih-acar fatih-acar force-pushed the fac-add-resolver-db-retry branch 2 times, most recently from 207d096 to bd23d0d Compare March 31, 2026 07:03
This PR improves the retry_db_transaction decorator by having a
configurable backoff retry.
It also enables the decorator on the graphql query resolvers.
Adds component tests to check the retry behavior solves hard failures
when neo4j thread pool exhausts: we must start a new container with a
low thread pool size.

Signed-off-by: Fatih Acar <fatih@opsmill.com>
Signed-off-by: Fatih Acar <fatih@opsmill.com>
@fatih-acar fatih-acar force-pushed the fac-add-resolver-db-retry branch from bd23d0d to 17ed35f Compare March 31, 2026 13:27
@fatih-acar fatih-acar merged commit 69eb282 into stable Mar 31, 2026
46 of 47 checks passed
@fatih-acar fatih-acar deleted the fac-add-resolver-db-retry branch March 31, 2026 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

group/backend Issue related to the backend (API Server, Git Agent) type/documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

task: add database query retry for read queries (not only mutations)

5 participants