Robustness & Error Handling

How Querchecker handles failures gracefully. For ops teams and developers managing production deployments.

API Rate-Limiting

Groq HTTP 429 Handling

Scenario: Groq rejects request with HTTP 429 "Too Many Requests" (6000 TPM limit exceeded).

Backend Response:

AbstractLlmExtractionClient catches HttpClientErrorException (status 429)
Parses Retry-After header → RateLimitException(retryAfterSeconds, provider)
For DL extraction (DlExtractionService):
- If retryAfterSeconds ≤ 20: Thread.sleep(retryAfter + 500ms) → retry once inline
- If > 20: Save with FAILED status
For spec-lookup (ProductLookupService):
- If retryAfterSeconds ≤ 20: Schedule async retry via CompletableFuture.delayedExecutor()
- Broadcast SSE lookup-result event with RATE_LIMITED status
- If > 20: Fall back to bestPartial cached result (from earlier Brave search) if available
- If no cached result: Save with RATE_LIMITED status

Frontend UX:

SSE event triggers re-fetch automatically
User sees "Loading..." → eventually results appear (or "Error — try later")
No page reload, no lost context

Brave Search HTTP 429

Similar handling in BraveWebSearchService:

Throws RateLimitException(retryAfterSeconds, BRAVE)
Caught by ProductLookupService → triggers retry cascade

LLM Response Errors

Invalid JSON

Scenario: LLM returns malformed JSON (e.g., extra commas, missing quotes).

Handling:

First attempt: Standard ObjectMapper.readValue()
If fails: tryParseJson() — try alternative parsing logic
If still fails: Mark run FAILED (for DL) or fall back to next source (for lookup)

Hallucinated URLs & IDs

Scenario: LLM invents fake icecatId or sourceUrl that doesn't exist.

Handling (UrlValidator):

resolveSourceUrl(llmUrl, braveResults) — validate LLM URL against actual Brave results
- If match found: use it
- Otherwise: fall back to top Brave URL
resolveIcecatId(llmId, braveResults) — check if ID appears in any Brave result
matchesExpectedPattern(url, sourceType) — regex validation per source type
- ICECAT: icecat.biz/p/[name]-[id].html
- GSMARENA: gsmarena.com/[name]-[id].php
- FLATPANELSHD: flatpanelshd.com/[\w\-]+.php

Filler Values in QuickFacts

Scenario: LLM extracts "unbekannt", "-", "n/a", "unknown" as a specification value.

Handling (AbstractLlmExtractionClient):

Set<String> FILLER_VALUES = Set.of("unbekannt", "-", "n/a", "unknown", "not specified", ...);

// After JSON parsing, strip fillers:
quickFacts.entrySet().removeIf(e -> FILLER_VALUES.contains(e.getValue().toLowerCase()));

Benefit: Filler values don't count as "covered" in quality evaluation (GOOD/PARTIAL/EMPTY).

Inch-Mark JSON Errors

Scenario: LLM writes "24"" instead of "24 Zoll" (common with non-ASCII quotes).

Handling (AbstractLlmExtractionClient.sanitizeLlmOutput()):

// Before JSON parsing:
rawOutput = rawOutput.replaceAll("(\\d+)\"\"([,\\}])", "$1 Zoll$2");

Also documented in PRODUCT_NAME system prompt to prevent at the source:

"Verwende einfache ASCII-Leerzeichen zwischen Zahl und Einheit, nie Gänsefüßchen."

Extraction Quality Evaluation

ExtractionQualityEvaluator assigns a grade to each lookup result:

Grade	Condition	Action
`GOOD`	≥60% SYSTEM fields populated + icecatId valid (for ICECAT)	Stop, persist COMPLETE
`PARTIAL`	>0% but <60% SYSTEM fields	Try next source
`EMPTY`	0% SYSTEM fields	Try next source
`FAILED_NO_CRITERIA`	No SYSTEM fields configured for category	Try next source

System fields: Named attributes (RAM, CPU, Display Size, etc.) from CategorySpecPreference. User fields: Search keywords ("OLED", "Core i7") — excluded from quality check.

Cache & TTL Strategy

ProductLookup entity caches lookup results by lookupTerm:

COMPLETE (permanent)

Persisted once, never expires
Only deleted via Settings cleanup or manual SQL
Safe: product specs don't change post-release

FAILED (24h TTL)

Cached for 24 hours (configurable: AppConfig key product.lookup.failed.ttl.hours)
After expiry: next lookup request retries the multi-source loop
Handles: transient network blips, temporary source outages

ERROR (10min TTL)

Cached for 10 minutes (configurable: product.lookup.error.ttl.minutes)
Example: Jsoup timeout on HTML-fetch
After expiry: retry

NO_SOURCES (never cached)

Virtual status: no DB entry
Every call re-checks CategorySearchSource entries
Handles: dynamic category configuration changes

QUOTA_EXCEEDED (re-checked each call)

Checked against current quota at request time
Not persisted; status re-evaluated if period rolls over
Safe: quota limits reset on schedule (daily/monthly)

RATE_LIMITED (never cached)

Not persisted; async retry scheduled
Frontend receives SSE lookup-result event when retry completes
Handles: transient API overload

Quota & Limits

Provider Quotas

Configured per provider in application.yml:

querchecker:
  api:
    providers:
      brave:
        free-limit: 1000
        free-limit-period: MONTHLY
      groq:
        free-limit: 25000
        free-limit-period: DAILY
      icecat:
        free-limit: 0  # No quota (free access)

Quota Checks

QuotaService.checkQuota(provider) returns:

Status	When	Action
`OK`	Usage < 80%	Proceed
`WARNING`	80% ≤ Usage < 100%	Proceed + show icon in Settings
`BLOCKED`	Usage ≥ 100%	Reject, return `QUOTA_EXCEEDED`

Monitoring

Settings → Usage Monitor:

Provider cards with current period usage
Call counts (this period + today)
Token budgets (IN / OUT)
Visual gradient bar (green → yellow → red)

DL Extraction Queue

Overflow Handling

Scenario: User rapidly clicks detail panel (5 listings in quick succession).

Queue Behavior:

First extraction: INIT → queued
Second click (400ms debounce): scheduling delayed, extraction still running
Third click: new INIT run added to queue (now 2 waiting)
Fourth click: new INIT run (now 3 waiting)
Fifth click: exceeds limit (default 10) → pollLast() removes lowest-priority → marked CANCELLED + saved

Resolution: User re-opens the cancelled listing → openDetail() calls scheduleExtraction() → new INIT run created (retry).

Duplicate Detection

existsByItemTextAndModelConfigAndStatusIn([DONE, INIT, PENDING]):

If found: skip (don't queue again)
Exception: CANCELLED status NOT in skip list → creates new INIT (retry)

Why? CANCELLED runs must be retryable without user intervention.

Server Restarts & SSE Reconnection

Soft Restart (No Page Reload)

Frontend:

EventSourceServerService detects SSE close
Calls health.notifyServerError() → polling switches to rapid 3s retry
When backend comes back: SSE re-connects automatically
Token validation: new token issued, SSE event updates frontend state
MatSnackBar: "Server neugestartet — Verbindung wiederhergestellt"

Benefit: User keeps context (scroll position, search filters, form state).

40-Second Stale Watchdog

Scenario: Network hangs (no data, no error).

Handling (EventSourceServerService):

SSE idle for >40s → close connection + reconnect
Prevents half-dead connections from blocking new messages

Database Resilience

Connection Pooling

Spring Boot default HikariCP with reasonable defaults:

Max pool size: 10
Idle timeout: 10 minutes
Connection timeout: 30 seconds

Transaction Handling

Read-only queries use @Transactional(readOnly = true) (hint to DB)
Write operations auto-wrapped in @Transactional
No manual rollback logic (Spring handles exceptions)

Enum Storage

Stored as VARCHAR (not PG native enum) → safer schema migrations (Flyway-compatible).

Example: ExtractionStatus enum:

INIT | PENDING | DONE | FAILED | NO_IMPLEMENTATION | RE_EVALUATE | CANCELLED

Logging & Observability

TRACE Level

Enable for troubleshooting:

export LOGGING_LEVEL_AT_QUERCHECKER=TRACE

AbstractLlmExtractionClient logs full interpolated prompts at TRACE level:

[TRACE] System Prompt: [full prompt text]
[TRACE] User Prompt: [full prompt text]

Useful for debugging LLM responses.

Structured Logging

All services log relevant context:

whListingId, listingId, lookupTerm, provider, retryAfterSeconds, etc.

Easier to trace cross-service request flows.

Monitoring Checklist for Ops

SSE stream health: Check backend logs for SseHub.broadcast() calls
API quota usage: Monitor Settings page or ApiUsageLog table
Rate limits: Alert when Retry-After > 5 minutes (provider overload)
Queue depth: Monitor DlExtractionRun status = INIT (→ how many waiting?)
Cache hit rate: Compare COMPLETE / FAILED lookups (high reuse = good)
Error rate: Track FAILED extractions by model (degraded LLM quality?)
Database size: ProductLookup table grows over time (monitor retention)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Robustness & Error Handling

API Rate-Limiting

Groq HTTP 429 Handling

Brave Search HTTP 429

LLM Response Errors

Invalid JSON

Hallucinated URLs & IDs

Filler Values in QuickFacts

Inch-Mark JSON Errors

Extraction Quality Evaluation

Cache & TTL Strategy

COMPLETE (permanent)

FAILED (24h TTL)

ERROR (10min TTL)

NO_SOURCES (never cached)

QUOTA_EXCEEDED (re-checked each call)

RATE_LIMITED (never cached)

Quota & Limits

Provider Quotas

Quota Checks

Monitoring

DL Extraction Queue

Overflow Handling

Duplicate Detection

Server Restarts & SSE Reconnection

Soft Restart (No Page Reload)

40-Second Stale Watchdog

Database Resilience

Connection Pooling

Transaction Handling

Enum Storage

Logging & Observability

TRACE Level

Structured Logging

Monitoring Checklist for Ops

FilesExpand file tree

robustness.md

Latest commit

History

robustness.md

File metadata and controls

Robustness & Error Handling

API Rate-Limiting

Groq HTTP 429 Handling

Brave Search HTTP 429

LLM Response Errors

Invalid JSON

Hallucinated URLs & IDs

Filler Values in QuickFacts

Inch-Mark JSON Errors

Extraction Quality Evaluation

Cache & TTL Strategy

COMPLETE (permanent)

FAILED (24h TTL)

ERROR (10min TTL)

NO_SOURCES (never cached)

QUOTA_EXCEEDED (re-checked each call)

RATE_LIMITED (never cached)

Quota & Limits

Provider Quotas

Quota Checks

Monitoring

DL Extraction Queue

Overflow Handling

Duplicate Detection

Server Restarts & SSE Reconnection

Soft Restart (No Page Reload)

40-Second Stale Watchdog

Database Resilience

Connection Pooling

Transaction Handling

Enum Storage

Logging & Observability

TRACE Level

Structured Logging

Monitoring Checklist for Ops