Skip to content

[DRAFT] Cache organization usage counts for limit enforcement #3528

@dkrizan

Description

@dkrizan

1. Problem

When a user saves a translation (empty → text), StringKeysLimitListeners runs an org-wide COUNT(DISTINCT ...) query to check if the organization is within its plan limits. This query joins translation → key → project across the entire organization.

Even with proper indexes, this query is inherently O(org size) and can slow down requests for large organizations. Running a full count on every translation or key creation is inefficient — the limit check should be O(1).

Today's baseline: every translation/key creation pays for a full org-wide count query in the request path. This affects the PUT /v2/projects/{projectId}/translations endpoint, key creation, and imports.

2. Appetite

2 weeks for one developer.

The incremental tracking logic already exists (getUsageIncreaseAmount() returns +1/-1 correctly). The work is: build a cached count layer, wire it into the existing listeners, add background reconciliation, and test edge cases (imports, deletions, branching).

3. Solution

Cached usage counts (Redis + DB fallback)

Store the current key count and translation count per organization in Redis. The limit check reads from Redis instead of running the COUNT query.

Element 1: Redis counter per organization

  • Keys: org:{orgId}:key_count, org:{orgId}:translation_count
  • On translation/key create: INCR
  • On translation/key delete: DECR
  • The existing getUsageIncreaseAmount() already calculates +1/-1 correctly — wire it to Redis instead of (or before) the DB count

Element 2: Limit check path

  • StringKeysLimitListeners.onActivity() reads from Redis instead of organizationStatsService.getTranslationCount()
  • If Redis is unavailable, fall back to allowing the operation (performance > exactness)
  • The transaction-scoped bean pattern stays — within one transaction, the Redis read happens once, then increments are tracked in memory

Element 3: Background reconciliation

  • A scheduled job runs periodically (e.g., every 5-10 minutes)
  • Executes the actual COUNT query per organization (can be spread out, not all at once)
  • Compares result to Redis value and corrects if there's drift
  • This ensures the cache doesn't drift far from reality over time

Element 4: Cache initialization

  • On first access (cache miss in Redis), run the COUNT query once and store the result
  • After that, only incremental updates
  • On application startup or Redis flush, counts get lazily re-initialized on first access

Element 5: Usage reporting (StringsKeysUsageListener)

  • The usage storing listener can also read from Redis for the count it stores in UsageEvent
  • Or it can continue using the DB count since it only fires for ADD/DEL (less frequent)

4. Rabbit holes

  • Imports: A large import may create thousands of translations in one batch. The existing EntityPreCommitEvent fires per entity, so Redis would get thousands of individual INCRs. This is fine — Redis handles high throughput. But the existing StringKeysLimitListeners is transaction-scoped and caches the base count, so the Redis read should also be cached per transaction (read once at start, then increment in memory, write final delta to Redis at commit).

  • Multi-pod consistency: Redis is shared across pods, so INCR/DECR is atomic and consistent. No pod-local cache divergence.

  • Transaction rollback: If a transaction increments Redis but then rolls back, the Redis count will be wrong until reconciliation fixes it. Option: use TransactionalEventListener(phase = AFTER_COMMIT) to only update Redis after successful commit. This is the safer approach.

  • Branching count correctness: The cached count should reflect the corrected query (from Organization stats count queries are slow and incorrectly count branch keys #3527) that excludes branch keys when branching is disabled. The reconciliation job uses the corrected query.

5. No-gos

  • No real-time exact counts — the cache may be slightly off between reconciliation runs, and that's acceptable
  • No per-project counts — only org-level counts are cached (that's what the limit check needs)
  • No Redis Cluster setup — use the existing Redis instance
  • No migration of historical usage events — only the live count cache

6. Success criteria

  • PUT /translations (empty → text) limit check becomes O(1) regardless of org size
  • Limit enforcement still blocks users who exceed their plan (within a small margin)
  • No regression in billing accuracy (reconciliation keeps counts aligned)

Metadata

Metadata

Assignees

Labels

pitchThe pitch according to the Shape-up approach

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions