Skip to content

fix(mongodb): add pymongo error handling to classify user vs platform errors#677

Open
aballman wants to merge 5 commits into
mainfrom
aballman/mongodb-error-handling
Open

fix(mongodb): add pymongo error handling to classify user vs platform errors#677
aballman wants to merge 5 commits into
mainfrom
aballman/mongodb-error-handling

Conversation

@aballman

@aballman aballman commented Mar 30, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add explicit pymongo exception handling to run_data() and delete_by_record_id() in the MongoDB uploader
  • Pymongo exceptions were falling through to the generic Exception catch in api_generator.py, defaulting to HTTP 500 and being classified as "platform" errors — this was the largest source of SLI misattribution for the DAG Job Completions SLO (933 errors in 7 days)
  • Map pymongo exceptions to appropriate error types: OperationFailure with quota → QuotaError (401), ServerSelectionTimeoutErrorTimeoutError (408), BulkWriteErrorWriteError (400), AutoReconnectDestinationConnectionError (400)

Test plan

  • 8 new unit tests covering all exception mappings for both run_data() and delete_by_record_id()
  • Ruff lint and format pass
  • After deploy: query Elasticsearch to confirm MongoDB uploader errors shift from error_type: "platform" to error_type: "user"

Note

Medium Risk
Changes how MongoDB write/delete failures are surfaced (error type/status code), which may affect downstream error handling and retries. Logic is straightforward and covered by new unit tests, but impacts runtime classification of production failures.

Overview
MongoDB uploader errors are now explicitly classified. MongoDBUploader.run_data() and delete_by_record_id() catch key pymongo exceptions and re-raise them as typed ingest errors (QuotaError, TimeoutError, WriteError, DestinationConnectionError) instead of falling through as generic failures.

Adds a focused unit test suite (test_mongodb_errors.py) to assert the exception-to-error mapping for both insert and delete paths.

Written by Cursor Bugbot for commit 75eb740. This will update automatically on new commits. Configure here.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

except AutoReconnect as e:
raise DestinationConnectionError(f"MongoDB connection lost: {e}") from e
except Exception as e:
raise DestinationConnectionError(f"MongoDB error: {e}") from e

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catch-all swallows correctly classified errors from delete_by_record_id

High Severity

The except Exception catch-all in run_data re-wraps already-classified exceptions from delete_by_record_id as DestinationConnectionError. When delete_by_record_id raises QuotaError or TimeoutError, those propagate into run_data's try block, skip all the pymongo-specific except handlers, and get caught by the generic except Exception — converting them back to DestinationConnectionError. This directly undermines the PR's goal of correct error classification. The catch-all needs to re-raise UnstructuredIngestError subclasses before falling through to the generic handler.

Additional Locations (1)
Fix in Cursor Fix in Web

@brndnblck brndnblck left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bugbot flagged a real issue that needs addressing:

except Exception swallows correctly classified errors

When delete_by_record_id() raises QuotaError or TimeoutError, those exceptions propagate into run_data()'s try block, skip the pymongo-specific handlers (since they're no longer pymongo exceptions), and get caught by the generic except Exception - converting them back to DestinationConnectionError. This directly undermines the PR's goal.

Fix by re-raising already-classified errors before the catch-all:

except (QuotaError, TimeoutError, WriteError, DestinationConnectionError):
    raise  # Re-raise already-classified errors
except Exception as e:
    raise DestinationConnectionError(f"MongoDB error: {e}") from e

Or if UnstructuredIngestError is the base class:

except UnstructuredIngestError:
    raise
except Exception as e:
    raise DestinationConnectionError(f"MongoDB error: {e}") from e

Also worth adding a test that exercises the delete -> error -> propagation path to catch this regression.

aballman added 2 commits May 18, 2026 16:14
… errors

Pymongo exceptions in run_data() and delete_by_record_id() were falling
through to the generic Exception catch in api_generator.py, defaulting
to HTTP 500 and being classified as "platform" errors. This was the
largest source of SLI misattribution (933 errors in 7 days).

Add explicit try/except for pymongo-specific exceptions:
- OperationFailure with "quota" → QuotaError (401)
- OperationFailure (other) → DestinationConnectionError (400)
- ServerSelectionTimeoutError → TimeoutError (408)
- AutoReconnect → DestinationConnectionError (400)
- BulkWriteError → WriteError (400)
Fixups after rebasing onto current main to unblock CI:

- test/unit/connectors/test_mongodb_errors.py: wrap pymongo import
  with pytest.importorskip("pymongo") so the unit suite skips
  gracefully in environments that don't have the mongodb extra
  installed (CI was failing with ModuleNotFoundError).
- unstructured_ingest/__version__.py: bump 1.6.2 -> 1.6.3 to satisfy
  check-version (previous value of 1.4.15 conflicted with a released
  tag from when this branch was originally cut).
- CHANGELOG.md: add a 1.6.3 entry describing the mongodb error
  classification change.

No change to the load-bearing mongodb.py error mapping (BulkWriteError
-> WriteError, OperationFailure -> QuotaError,
ServerSelectionTimeoutError -> TimeoutError, AutoReconnect ->
DestinationConnectionError). Exception ordering preserved so subclass
catches fire before parent.
Three P2 findings from codex on the prior commit:

- run_data catch-all was relabeling QuotaError/TimeoutError/WriteError
  raised by delete_by_record_id as DestinationConnectionError. Add an
  early `except (QuotaError, TimeoutError, WriteError,
  DestinationConnectionError): raise` ahead of the broad except blocks
  so internal types pass through unchanged.

- NetworkTimeout is a subclass of AutoReconnect. Without a dedicated
  catch, in-flight socket timeouts were being labeled as
  DestinationConnectionError instead of TimeoutError. Add an explicit
  `except NetworkTimeout` clause before AutoReconnect on both
  run_data and delete_by_record_id.

- Quota detection via `"quota" in str(e).lower()` could false-positive
  on namespace names (e.g. an auth failure on database 'quota_db').
  Replace with an _is_quota_error() helper that inspects the server's
  structured `errmsg` field via `e.details` before falling back to
  the bare `_message`; this avoids matching against the full str(e)
  representation which often includes 'full error: <details>' with
  the namespace embedded.

Test coverage added for all three paths:

- NetworkTimeout on run_data and delete_by_record_id -> TimeoutError
- OperationFailure with 'quota_db' namespace but unrelated errmsg
  -> DestinationConnectionError (regression test for the false
  positive)
- OperationFailure with structured details errmsg='user is over quota'
  -> QuotaError (structured-detection happy path)
- delete-time QuotaError surfaces unchanged from run_data
  (no relabel by the catch-all)
@claude

claude Bot commented May 18, 2026

Copy link
Copy Markdown

Claude finished @aballman's task —— View job


Code Review: MongoDB Error Handling

Todo List:

  • Read MongoDB connector implementation
  • Review quota detection logic
  • Analyze exception handling in run_data
  • Check pymongo exception hierarchy
  • Review credential leak concerns
  • Examine test implementation
  • Validate bugbot findings
  • Provide comprehensive review feedback

Critical Issue: Exception Re-wrapping Bug

🚨 HIGH SEVERITY: Bugbot identified a real issue that undermines the PR's goal.

The current run_data() implementation has a fatal flaw in lines 417-420 and 434-435:

except (QuotaError, TimeoutError, WriteError, DestinationConnectionError):
    # Internal types raised by delete_by_record_id (or future helpers).
    # Must come before the broad except blocks so they aren't relabeled.
    raise
# ... other pymongo exception handlers ...
except Exception as e:
    raise DestinationConnectionError(f"MongoDB error: {e}") from e

The Problem: When delete_by_record_id() raises a classified error (e.g., QuotaError), it propagates into run_data()'s try block but skips all the pymongo-specific except blocks because it's no longer a pymongo exception. The catch-all except Exception then re-wraps it as DestinationConnectionError, completely defeating the classification fix.

Your test at line 148 actually validates this exact scenario but doesn't reveal the bug because the test structure masks the issue.

Review of Your Focus Areas

1. Quota Detection LogicGOOD

Your _is_quota_error() helper (lines 347-355) is well-designed:

  • ✅ Correctly prioritizes structured e.details["errmsg"] over string matching
  • ✅ Falls back to e._message when details unavailable
  • ✅ Your test case (lines 112-132) proves it avoids false positives like "auth failed on quota_db"

No action needed - the structured approach is more reliable than Atlas error codes, which can vary by operation type.

2. Exception Tuple in run_dataBROKEN

Your tuple approach is correct in theory but implemented incorrectly:

Current (Broken):

except (QuotaError, TimeoutError, WriteError, DestinationConnectionError):
    raise
# ... pymongo handlers ...
except Exception as e:
    raise DestinationConnectionError(f"MongoDB error: {e}") from e

Should be:

# ... pymongo handlers first ...
except UnstructuredIngestError:
    raise  # Re-raise any already-classified internal errors
except Exception as e:
    raise DestinationConnectionError(f"MongoDB error: {e}") from e

3. NetworkTimeout OrderingCORRECT

Lines 429-432 properly handle NetworkTimeout before AutoReconnect since NetworkTimeout is indeed a subclass. Your ordering prevents misclassification.

4. Credential Leak Risk ⚠️ MODERATE RISK

Your concern about f"MongoDB quota exceeded: {e}" is valid. PyMongo's str() representations can include connection details, but typically not credentials directly. However, MongoDB Atlas connection strings in error messages could expose:

  • Database names
  • Collection names
  • Host information

Recommendation: Consider truncating or sanitizing the error message.

5. Generic Exception Catch-AllMISLEADING & BROKEN

The except Exception handler has two problems:

  1. Misclassification: KeyError/TypeError aren't connection errors
  2. Re-wrapping bug: Converts properly classified errors back to generic ones

6. Test Constructor StabilitySTABLE

OperationFailure(msg, code=X, details={...}) constructor is stable across PyMongo 4.x versions.

Additional Issues Found

Missing Exception: Your code doesn't handle pymongo.errors.WriteConcernError, which is a subclass of OperationFailure. Current ordering would handle it correctly, but it's worth documenting.

Test Gap: Line 172 test validates the delete→error→propagation path but doesn't catch the re-wrapping bug because it only tests the intermediate QuotaError raising, not the full run_data() flow.

Required Fix

Move the internal error re-raising after all pymongo handlers:

except BulkWriteError as e:
    raise WriteError(f"MongoDB bulk write failed: {e}") from e
except OperationFailure as e:
    if self._is_quota_error(e):
        raise QuotaError(f"MongoDB quota exceeded: {e}") from e
    raise DestinationConnectionError(f"MongoDB operation failed: {e}") from e
except ServerSelectionTimeoutError as e:
    raise TimeoutError(f"MongoDB server unreachable: {e}") from e
except NetworkTimeout as e:
    raise TimeoutError(f"MongoDB network timeout: {e}") from e
except AutoReconnect as e:
    raise DestinationConnectionError(f"MongoDB connection lost: {e}") from e
except UnstructuredIngestError:
    # Re-raise any already-classified errors from delete_by_record_id
    raise
except Exception as e:
    raise DestinationConnectionError(f"MongoDB error: {e}") from e

This fix is critical - without it, the PR fails to achieve its primary goal of correct error classification.


Overall Assessment: Strong implementation undermined by one critical ordering bug. The quota detection logic and test coverage are excellent, but the exception handling flow needs the fix above to work correctly.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files

Shadow auto-approve: would not auto-approve because issues were found.

Fix all with cubic | Re-trigger cubic

Comment thread unstructured_ingest/processes/connectors/mongodb.py Outdated
…r re-raise

Address Claude review feedback. Replace the explicit tuple
`except (QuotaError, TimeoutError, WriteError, DestinationConnectionError): raise`
with `except UnstructuredIngestError: raise`. Behavior is identical for
the current set of internal types and future-proofs against new
internal types raised from delete_by_record_id or other helpers
(SourceConnectionError, ConnectionError, ValueError, etc.).

No test changes - all 14 existing tests still validate the same flows.
@aballman

Copy link
Copy Markdown
Contributor Author

Thanks for the review. I want to push back on the main finding before applying the rest.

The "critical bug" claim that except Exception re-wraps the QuotaError from delete_by_record_id is not correct. Python evaluates except clauses top-to-bottom and the first matching clause wins. The except (QuotaError, TimeoutError, WriteError, DestinationConnectionError): raise is placed first, so when delete_by_record_id raises QuotaError, that clause matches and re-raises before any other handler (including except Exception) is checked.

Standalone repro to confirm the semantics:

class A(Exception): pass
class B(Exception): pass

def inner():
    raise A("mapped")

def outer():
    try:
        inner()
    except (A, B):
        raise
    except Exception as e:
        raise RuntimeError(f"wrapped: {e}") from e

try:
    outer()
except A as e:
    print(f"GOT A (correct): {e}")
except Exception as e:
    print(f"GOT WRONG: {type(e).__name__}: {e}")
# Prints: GOT A (correct): mapped

The end-to-end test test_delete_quota_error_not_relabeled_by_run_data_catch_all (added in the prior commit) already validates this exact path: it mocks collection.delete_many to raise a quota OperationFailure, which delete_by_record_id maps to QuotaError, which then propagates up through run_data's try block. The test asserts pytest.raises(QuotaError) and passes. If the catch-all were re-wrapping it, the test would fail with DestinationConnectionError. It does not.

So the ordering is correct as written. That said, your suggestion to use the base class is a strict improvement, so I applied that in 38c525e: replaced the explicit tuple with except UnstructuredIngestError: raise. This catches the same types we already handle plus anything else inheriting from the ingest base (SourceConnectionError, ConnectionError, ValueError, etc.) if a future helper raises one. All 14 tests still pass.

On your other points:

  • Quota detection: agreed, the structured errmsg approach is sufficient for now without baking in Atlas error codes I haven't verified.
  • NetworkTimeout ordering: agreed, current ordering is correct.
  • Credential leak: real risk worth tracking. pymongo's OperationFailure.__str__ doesn't typically include the connection string, but the wrapped messages do interpolate the full exception. I've added a queue item to sanitize the wrapper output across all our connectors as a follow-up rather than scope-creep this PR.
  • WriteConcernError: subclass of OperationFailure, current ordering routes it through the except OperationFailure handler correctly. Documenting in the code comment isn't load-bearing.
  • Catch-all except Exception: agreed it's misleading shape-wise. Same follow-up as the credential leak.
  • Constructor stability: confirmed for pymongo 4.x.

Ready for human review when one of the connector folks has a moment.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 1 file (changes from recent commits).

Shadow auto-approve: would not auto-approve. Auto-approval blocked by 1 unresolved issue from previous reviews.

Re-trigger cubic

Address cubic review P2: the broad `except Exception` was relabeling
non-pymongo failures (KeyError/TypeError/ValueError from config or
input-shape errors) as DestinationConnectionError, masking real
programming/config bugs as connection problems.

Replace with `except PyMongoError`. Behavior:

- Known pymongo exceptions: still handled by the specific clauses above
  (BulkWriteError/OperationFailure/ServerSelectionTimeoutError/
  NetworkTimeout/AutoReconnect).
- Unknown pymongo exceptions: still wrapped as DestinationConnectionError
  (sensible default for the pymongo family).
- Non-pymongo exceptions: propagate unchanged so upstream classification
  isn't poisoned by a misleading destination-connection label.

Adds a regression test that a KeyError raised inside the upload path
propagates unchanged rather than being relabeled.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 2 files (changes from recent commits).

Shadow auto-approve: would require human review. This PR changes core error classification logic in the MongoDB uploader, which alters the error types and status codes returned for production failures, directly impacting downstream error handling, retries, and SLO monitoring – a change with non-trivial blast radius that should be reviewed by a te?

Re-trigger cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants