Skip to content

Conversation

@HardMax71
Copy link
Owner

@HardMax71 HardMax71 commented Jan 25, 2026


Summary by cubic

Typed core models and streamlined the rate limit service and metrics to reduce complexity and make admin usage stats consistent across algorithms. Updated Grafana, OpenAPI, and docs to match the new metrics and types.

  • Refactors

    • Rewrote RateLimitService with Pydantic TypeAdapter-based config, simpler Redis usage, and compact sliding window/token bucket logic.
    • Introduced typed models: EndpointUsageStats, ReplayError, ExecutionResultSummary, KeyStrategy for idempotency. Moved ResourceUsageDomain to events. ExecutionResultDomain.metadata is now EventMetadata.
    • DLQBatchRetryResponse.details now returns DLQRetryResult. Admin user rate limits current_usage returns EndpointUsageStats.
    • Metrics simplified to request/allowed/rejected/bypass counters with minimal attributes. Grafana dashboard updated accordingly.
    • Event statistics now use typed lists (EventTypeCount, ServiceEventCount) instead of dicts.
  • Migration

    • Import ResourceUsageDomain from app.domain.events; treat ExecutionResultDomain.metadata as EventMetadata; stop using model_dump() in producers/handlers.
    • Update clients: read EndpointUsageStats.remaining and algorithm in current_usage; parse DLQ retry details as DLQRetryResult; handle EventStatistics lists shape.
    • Use KeyStrategy enum instead of string values for idempotency key_strategy.
    • Remove references to deprecated metrics (check_duration, redis_duration, algorithm_duration, quota_usage, token_bucket_*). Update dashboards/alerts to use the new counters.

Written for commit b8dda69. Summary will update on new commits.

Summary by CodeRabbit

  • New Features

    • Per-endpoint rate-limit usage shown with algorithm and remaining counts; dashboard panels simplified for clarity.
    • New "mark all as read" notification event and richer notification event types.
    • DLQ batch retry and replay errors return structured, typed details.
  • Bug Fixes

    • Execution metadata and replay error reporting made more consistent.
    • Pod/container status events now deliver structured status data (not raw strings).
  • Documentation

    • Rate-limiting docs updated with Usage Statistics and per-user override options.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 25, 2026

📝 Walkthrough

Walkthrough

Consolidates typing and domain models (ResourceUsageDomain moved; new ExecutionResultSummary, ReplayError, EndpointUsageStats), refactors rate-limit metrics and service, replaces DLQ response assembly with pydantic model_validate using DLQRetryResult, and modernizes annotations/typing across many modules.

Changes

Cohort / File(s) Summary
DLQ & Schemas
backend/app/api/routes/dlq.py, backend/app/schemas_pydantic/dlq.py, docs/reference/openapi.json
DLQBatchRetryResponse.details now typed as DLQRetryResult list; route constructs response via DLQBatchRetryResponse.model_validate(result, from_attributes=True); OpenAPI and frontend types updated.
Rate limit metrics & service
backend/app/core/metrics/rate_limit.py, backend/app/services/rate_limit_service.py, backend/app/domain/rate_limit/*.py, backend/grafana/provisioning/dashboards/rate-limiting-dashboard.json, docs/architecture/rate-limiting.md
Metrics counters renamed to underscored internals and four new record_* methods added; RateLimitService refactored to use EndpointUsageStats, TypeAdapter-based config, and revised check/get_usage flows; Grafana/dashboard/docs updated.
Domain model reorganization
backend/app/domain/events/typed.py, backend/app/domain/execution/models.py, backend/app/domain/events/event_models.py, backend/app/domain/replay/*.py, backend/app/domain/admin/*.py
ResourceUsageDomain moved into events domain; added ExecutionResultSummary, ReplayError, EndpointUsageStats, EventTypeCount/ServiceEventCount; replay/admin models and repositories updated to use structured typed lists instead of raw dict maps.
Event bus & notification events
backend/app/services/event_bus.py, backend/app/services/notification_service.py
EventBus switched to accept/publish DomainEvent/BaseEvent objects; NotificationService now publishes typed notification events (including NotificationAllReadEvent) with EventMetadata.
Idempotency key strategy enum
backend/app/domain/idempotency/models.py, backend/app/services/idempotency/*, backend/app/services/*/*consumer_wrappers*.py, backend/tests/**/idempotency/*.py
Introduces KeyStrategy enum and replaces string key_strategy parameters/usages across idempotency manager, middleware, wrappers, workers, orchestrators, and tests.
Replay & event stats repository changes
backend/app/db/repositories/admin/admin_events_repository.py, backend/app/db/repositories/event_repository.py
Aggregation outputs now return lists of typed count objects (EventTypeCount, ServiceEventCount) instead of dict maps; repository code builds those models.
Type hint modernization & annotations
many modules (e.g., backend/app/core/*, backend/app/schemas_pydantic/*, backend/app/services/*, backend/app/db/docs/*)
Widespread migration from typing generics to built-in generics and PEP 604 unions; added from __future__ import annotations in core modules; many signature/type refinements.
Miscellaneous behavioral tweaks
backend/app/services/execution_service.py, backend/app/core/providers.py, backend/app/services/result_processor/processor.py, backend/app/services/pod_monitor/event_mapper.py
Terminal execution metadata set to None (not {}); Redis teardown uses close(); idempotency default strategies switched to enum values; PodRunningEvent.container_statuses now a list of ContainerStatusInfo objects; result processor emits EventMetadata object instead of dumped dict.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

enhancement

Poem

🐰 I hopped through types with nimble paws,
Moved resource data to kinder laws,
Metrics learned new songs to sing,
DLQ replies now match the schema ring,
A tiny rabbit cheers these tidy paws!

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.96% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'new types, updated rate limit service' is vague and generic, using non-descriptive terms that do not convey the main scope of this significant refactoring. Consider a more specific title that highlights the primary change, such as 'Refactor rate limit service with typed models and simplified metrics' or 'Introduce typed domain models and modernize rate limit implementation'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 32 files

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
backend/app/services/sse/redis_bus.py (1)

37-41: Use aclose() instead of deprecated close() for PubSub cleanup.

In redis-py 7.1.0, PubSub.close() is deprecated and marked with "Use aclose() instead". Although close() internally calls aclose() (so it still works), the code should use await self._pubsub.aclose() to follow current best practices and align with how the asyncio Redis client handles cleanup.

backend/app/domain/execution/models.py (1)

35-43: Metadata field is not persisted to the database and will be silently lost.

Line 42 changes ExecutionResultDomain.metadata to EventMetadata | None, but the write_terminal_result method (backend/app/db/repositories/execution_repository.py, line 48) does not persist the metadata field. While resource_usage is correctly serialized with .model_dump() and error_type is set, metadata is completely omitted from the database update. This causes metadata to be silently discarded whenever an ExecutionResultDomain is saved. Either add metadata persistence to write_terminal_result (which requires adding a metadata field to ExecutionDocument) or remove the field from ExecutionResultDomain if it's not intended for storage.

🤖 Fix all issues with AI agents
In `@backend/app/core/providers.py`:
- Around line 118-121: The async Redis client cleanup uses the deprecated
close() alias; update the finally block that currently does "await
client.close()" to call "await client.aclose()" instead so the
redis.asyncio.Redis client is closed with the modern API (locate the provider
that yields client and replace the await client.close() call with await
client.aclose()).
🧹 Nitpick comments (3)
backend/app/domain/replay/models.py (1)

11-19: Consider using datetime type for timestamp field.

The timestamp field is typed as str but stores ISO-formatted datetime strings. Other models in this file (e.g., ReplaySessionState.created_at, started_at, completed_at) use the datetime type directly. Using str here creates inconsistency and loses type safety benefits.

♻️ Suggested refactor
 class ReplayError(BaseModel):
     """Error details for replay operations."""

     model_config = ConfigDict(from_attributes=True)

-    timestamp: str
+    timestamp: datetime
     error: str
     type: str | None = None  # Present for session-level errors
     event_id: str | None = None  # Present for event-level errors

This would also require updating the call sites in replay_service.py to pass datetime objects directly instead of calling .isoformat().

backend/app/services/event_replay/replay_service.py (1)

174-177: Consider including type for event-level errors as well.

For consistency with _handle_session_error (line 149) and improved debugging, consider populating the type field with the exception class name.

♻️ Suggested refactor
         err = ReplayError(
-            timestamp=datetime.now(timezone.utc).isoformat(), event_id=str(event.event_id), error=str(error)
+            timestamp=datetime.now(timezone.utc).isoformat(),
+            event_id=str(event.event_id),
+            error=str(error),
+            type=type(error).__name__,
         )
backend/app/settings.py (1)

124-124: Document the requirement or hardcode the Redis decode setting.

Using Literal[True] is supported in pydantic-settings 2.5 and enforces the value at both type-check and runtime. However, this introduces a validation-time breaking change: any existing .env with REDIS_DECODE_RESPONSES=false will fail on startup.

Since decode_responses=True appears required by the Redis client instantiation in backend/core/providers.py, either:

  1. Add a comment explaining why it must be True (e.g., downstream code assumes decoded strings).
  2. Remove it as a configurable setting and hardcode decode_responses=True directly in the Redis client setup.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/infrastructure/kafka/mappings.py (1)

38-45: Add missing NOTIFICATION_ALL_READ event type to mapping.

The EventType.NOTIFICATION_ALL_READ is defined in the enum (line 44 of events.py) but not included in EVENT_TYPE_TO_TOPIC. This causes it to fall back to SYSTEM_EVENTS (line 99), inconsistent with all other NOTIFICATION_* events that route to NOTIFICATION_EVENTS.

Proposed fix
     EventType.NOTIFICATION_READ: KafkaTopic.NOTIFICATION_EVENTS,
+    EventType.NOTIFICATION_ALL_READ: KafkaTopic.NOTIFICATION_EVENTS,
     EventType.NOTIFICATION_CLICKED: KafkaTopic.NOTIFICATION_EVENTS,
🤖 Fix all issues with AI agents
In `@backend/app/db/repositories/admin/admin_events_repository.py`:
- Line 146: The list comprehension building events_by_type can crash on unknown
DB values because EventType(t["_id"]) raises for invalid enums; update the logic
in the events_by_type construction (where EventTypeCount and EventType are used
with top_types) to safely convert or skip unknowns — e.g., check
EventType._value2member_map_.get(t["_id"]) (or use a try/except around
EventType(...)) and only create EventTypeCount for valid enum members, skipping
or logging legacy/unknown types so the stats endpoint doesn't fail.

In `@backend/app/db/repositories/event_repository.py`:
- Around line 247-254: The current conversion of aggregation keys to EventType
instances risks using an incorrect filter; replace any suggested membership
check against EventType.__members__.values() with a check against the enum
string values (e.g., test k in [e.value for e in EventType]) before calling
EventType(k), or simply rely on the existing Pydantic validation
(EventDocument(**data)) and leave the current EventType(k) conversion in the
events_by_type mapping as-is; update the logic around events_by_type (and any
related EventTypeCount construction) to use the correct membership check
(EventType) so invalid strings are filtered correctly.

In `@backend/app/services/user_settings_service.py`:
- Around line 72-76: Replace the wildcard subscription pattern with the exact
event type: update the call to bus.subscribe currently using
f"{EventType.USER_SETTINGS_UPDATED}*" to subscribe to
EventType.USER_SETTINGS_UPDATED directly; keep the existing _handle function and
its isinstance(evt, UserSettingsUpdatedEvent) check and ensure the returned
subscription id is still assigned to self._subscription_id and that _handle
continues to call self.invalidate_cache(evt.user_id) as before.

In `@docs/reference/openapi.json`:
- Around line 8494-8505: The examples for EventStatistics and related schemas
use uppercase EventType strings that don't match the EventType enum (which uses
lowercase); update all example values in the EventStatistics examples and any
EventTypeCountSchema/ServiceEventCountSchema example objects to use the
lowercase enum values (e.g., "request", "response", etc.) so generated SDKs and
clients get correct casing; search for examples referencing EventType in
EventStatistics, EventTypeCountSchema and ServiceEventCountSchema and replace
uppercase strings with the corresponding lowercase enum values consistently
across the file.

In `@frontend/src/lib/api/types.gen.ts`:
- Around line 3676-3680: The type for the field container_statuses has changed
from string to Array<ContainerStatusInfo>, so update any code that reads or
assigns this property to handle an array of ContainerStatusInfo objects rather
than a string; locate usages of container_statuses and adjust parsing/iteration
logic (e.g., map/filter loops, type annotations, and any JSON parsing) and
update any variable/type declarations or function signatures that consume it to
expect ContainerStatusInfo[] instead of string so callers of the affected types
(including functions referencing ContainerStatusInfo or container_statuses)
compile and handle the new structure correctly.
♻️ Duplicate comments (1)
backend/app/core/providers.py (1)

118-121: Use aclose() instead of close() for async Redis client cleanup.

This was previously flagged: close() is deprecated since redis-py 5.0.1 and is an alias for aclose(). The code should use the modern API.

🔧 Suggested fix
         try:
             yield client
         finally:
-            await client.close()
+            await client.aclose()
🧹 Nitpick comments (4)
backend/tests/e2e/services/rate_limit/test_rate_limit_service.py (1)

141-142: Consider testing via public API rather than private attributes.

Asserting on _requests (a private attribute by Python convention) couples this test to internal implementation details. If the metrics internals are refactored, this test could break even though the public behavior remains correct.

Since the metrics refactor introduced public recording methods (record_request, etc.), consider verifying wiring by invoking the public API or simply removing this assertion if the other tests already exercise the metrics path.

♻️ Alternative approach
-    # Verify metrics object has _requests counter for checks
-    assert hasattr(svc.metrics, '_requests')
+    # Verify metrics object exposes public recording API
+    assert callable(getattr(svc.metrics, 'record_request', None))

Or remove entirely if metric recording is already validated by other tests.

backend/app/schemas_pydantic/user_settings.py (1)

20-20: Avoid mutable default list for channels.
A list literal can be shared/mutated across instances; prefer a default_factory for isolation.

♻️ Suggested change
-    channels: list[NotificationChannel] = [NotificationChannel.IN_APP]
+    channels: list[NotificationChannel] = Field(default_factory=lambda: [NotificationChannel.IN_APP])
backend/tests/e2e/services/events/test_event_bus.py (1)

28-28: Prefer explicit enum values for the pattern/key.
Avoids reliance on Enum __str__/encode behavior and keeps the intent unambiguous.

♻️ Suggested change
-    await bus.subscribe(f"{EventType.USER_SETTINGS_UPDATED}*", handler)
+    await bus.subscribe(f"{EventType.USER_SETTINGS_UPDATED.value}*", handler)
...
-            key=EventType.USER_SETTINGS_UPDATED.encode("utf-8"),
+            key=EventType.USER_SETTINGS_UPDATED.value.encode("utf-8"),

Also applies to: 45-45

backend/app/services/rate_limit_service.py (1)

310-344: Consider handling missing rules in token bucket stats.

For sliding window stats (line 332-337), when no matching rule is found, limit is set to 0, resulting in remaining = max(0, 0 - count) = 0. However, for token bucket stats (lines 338-344), the bucket's stored tokens are used directly without considering whether a rule exists.

This inconsistency could be intentional (token bucket stores actual remaining tokens), but consider adding a comment explaining the difference, or returning None for endpoints without matching rules if that's more appropriate.

@sonarqubecloud
Copy link

@HardMax71 HardMax71 merged commit 5e71fbf into main Jan 26, 2026
24 checks passed
@HardMax71 HardMax71 deleted the fix/more-typed-objs branch January 26, 2026 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants