How Querchecker handles failures gracefully. For ops teams and developers managing production deployments.
Scenario: Groq rejects request with HTTP 429 "Too Many Requests" (6000 TPM limit exceeded).
Backend Response:
AbstractLlmExtractionClientcatchesHttpClientErrorException(status 429)- Parses
Retry-Afterheader →RateLimitException(retryAfterSeconds, provider) - For DL extraction (
DlExtractionService):- If
retryAfterSeconds ≤ 20:Thread.sleep(retryAfter + 500ms)→ retry once inline - If
> 20: Save withFAILEDstatus
- If
- For spec-lookup (
ProductLookupService):- If
retryAfterSeconds ≤ 20: Schedule async retry viaCompletableFuture.delayedExecutor() - Broadcast SSE
lookup-resultevent withRATE_LIMITEDstatus - If
> 20: Fall back tobestPartialcached result (from earlier Brave search) if available - If no cached result: Save with
RATE_LIMITEDstatus
- If
Frontend UX:
- SSE event triggers re-fetch automatically
- User sees "Loading..." → eventually results appear (or "Error — try later")
- No page reload, no lost context
Similar handling in BraveWebSearchService:
- Throws
RateLimitException(retryAfterSeconds, BRAVE) - Caught by
ProductLookupService→ triggers retry cascade
Scenario: LLM returns malformed JSON (e.g., extra commas, missing quotes).
Handling:
- First attempt: Standard
ObjectMapper.readValue() - If fails:
tryParseJson()— try alternative parsing logic - If still fails: Mark run
FAILED(for DL) or fall back to next source (for lookup)
Scenario: LLM invents fake icecatId or sourceUrl that doesn't exist.
Handling (UrlValidator):
resolveSourceUrl(llmUrl, braveResults)— validate LLM URL against actual Brave results- If match found: use it
- Otherwise: fall back to top Brave URL
resolveIcecatId(llmId, braveResults)— check if ID appears in any Brave resultmatchesExpectedPattern(url, sourceType)— regex validation per source type- ICECAT:
icecat.biz/p/[name]-[id].html - GSMARENA:
gsmarena.com/[name]-[id].php - FLATPANELSHD:
flatpanelshd.com/[\w\-]+.php
- ICECAT:
Scenario: LLM extracts "unbekannt", "-", "n/a", "unknown" as a specification value.
Handling (AbstractLlmExtractionClient):
Set<String> FILLER_VALUES = Set.of("unbekannt", "-", "n/a", "unknown", "not specified", ...);
// After JSON parsing, strip fillers:
quickFacts.entrySet().removeIf(e -> FILLER_VALUES.contains(e.getValue().toLowerCase()));Benefit: Filler values don't count as "covered" in quality evaluation (GOOD/PARTIAL/EMPTY).
Scenario: LLM writes "24"" instead of "24 Zoll" (common with non-ASCII quotes).
Handling (AbstractLlmExtractionClient.sanitizeLlmOutput()):
// Before JSON parsing:
rawOutput = rawOutput.replaceAll("(\\d+)\"\"([,\\}])", "$1 Zoll$2");Also documented in PRODUCT_NAME system prompt to prevent at the source:
"Verwende einfache ASCII-Leerzeichen zwischen Zahl und Einheit, nie Gänsefüßchen."
ExtractionQualityEvaluator assigns a grade to each lookup result:
| Grade | Condition | Action |
|---|---|---|
GOOD |
≥60% SYSTEM fields populated + icecatId valid (for ICECAT) | Stop, persist COMPLETE |
PARTIAL |
>0% but <60% SYSTEM fields | Try next source |
EMPTY |
0% SYSTEM fields | Try next source |
FAILED_NO_CRITERIA |
No SYSTEM fields configured for category | Try next source |
System fields: Named attributes (RAM, CPU, Display Size, etc.) from CategorySpecPreference.
User fields: Search keywords ("OLED", "Core i7") — excluded from quality check.
ProductLookup entity caches lookup results by lookupTerm:
- Persisted once, never expires
- Only deleted via Settings cleanup or manual SQL
- Safe: product specs don't change post-release
- Cached for 24 hours (configurable:
AppConfigkeyproduct.lookup.failed.ttl.hours) - After expiry: next lookup request retries the multi-source loop
- Handles: transient network blips, temporary source outages
- Cached for 10 minutes (configurable:
product.lookup.error.ttl.minutes) - Example: Jsoup timeout on HTML-fetch
- After expiry: retry
- Virtual status: no DB entry
- Every call re-checks
CategorySearchSourceentries - Handles: dynamic category configuration changes
- Checked against current quota at request time
- Not persisted; status re-evaluated if period rolls over
- Safe: quota limits reset on schedule (daily/monthly)
- Not persisted; async retry scheduled
- Frontend receives SSE
lookup-resultevent when retry completes - Handles: transient API overload
Configured per provider in application.yml:
querchecker:
api:
providers:
brave:
free-limit: 1000
free-limit-period: MONTHLY
groq:
free-limit: 25000
free-limit-period: DAILY
icecat:
free-limit: 0 # No quota (free access)QuotaService.checkQuota(provider) returns:
| Status | When | Action |
|---|---|---|
OK |
Usage < 80% | Proceed |
WARNING |
80% ≤ Usage < 100% | Proceed + show icon in Settings |
BLOCKED |
Usage ≥ 100% | Reject, return QUOTA_EXCEEDED |
Settings → Usage Monitor:
- Provider cards with current period usage
- Call counts (this period + today)
- Token budgets (IN / OUT)
- Visual gradient bar (green → yellow → red)
Scenario: User rapidly clicks detail panel (5 listings in quick succession).
Queue Behavior:
- First extraction:
INIT→ queued - Second click (400ms debounce): scheduling delayed, extraction still running
- Third click: new
INITrun added to queue (now 2 waiting) - Fourth click: new
INITrun (now 3 waiting) - Fifth click: exceeds limit (default 10) →
pollLast()removes lowest-priority → markedCANCELLED+ saved
Resolution: User re-opens the cancelled listing → openDetail() calls scheduleExtraction() → new INIT run created (retry).
existsByItemTextAndModelConfigAndStatusIn([DONE, INIT, PENDING]):
- If found: skip (don't queue again)
- Exception:
CANCELLEDstatus NOT in skip list → creates newINIT(retry)
Why? CANCELLED runs must be retryable without user intervention.
Frontend:
- EventSourceServerService detects SSE close
- Calls
health.notifyServerError()→ polling switches to rapid 3s retry - When backend comes back: SSE re-connects automatically
- Token validation: new token issued, SSE event updates frontend state
- MatSnackBar: "Server neugestartet — Verbindung wiederhergestellt"
Benefit: User keeps context (scroll position, search filters, form state).
Scenario: Network hangs (no data, no error).
Handling (EventSourceServerService):
- SSE idle for >40s → close connection + reconnect
- Prevents half-dead connections from blocking new messages
Spring Boot default HikariCP with reasonable defaults:
- Max pool size: 10
- Idle timeout: 10 minutes
- Connection timeout: 30 seconds
- Read-only queries use
@Transactional(readOnly = true)(hint to DB) - Write operations auto-wrapped in
@Transactional - No manual rollback logic (Spring handles exceptions)
Stored as VARCHAR (not PG native enum) → safer schema migrations (Flyway-compatible).
Example: ExtractionStatus enum:
INIT | PENDING | DONE | FAILED | NO_IMPLEMENTATION | RE_EVALUATE | CANCELLEDEnable for troubleshooting:
export LOGGING_LEVEL_AT_QUERCHECKER=TRACEAbstractLlmExtractionClient logs full interpolated prompts at TRACE level:
[TRACE] System Prompt: [full prompt text]
[TRACE] User Prompt: [full prompt text]
Useful for debugging LLM responses.
All services log relevant context:
whListingId,listingId,lookupTerm,provider,retryAfterSeconds, etc.
Easier to trace cross-service request flows.
- SSE stream health: Check backend logs for
SseHub.broadcast()calls - API quota usage: Monitor Settings page or
ApiUsageLogtable - Rate limits: Alert when
Retry-After> 5 minutes (provider overload) - Queue depth: Monitor
DlExtractionRunstatus = INIT (→ how many waiting?) - Cache hit rate: Compare
COMPLETE/FAILEDlookups (high reuse = good) - Error rate: Track FAILED extractions by model (degraded LLM quality?)
- Database size:
ProductLookuptable grows over time (monitor retention)