1. Problem
Today, every AI translation is a stateless one-shot: one key, one target language, one prompt, one LLM call. The model gets a fixed template with whatever context we inject (project description, TM suggestions, glossary) and returns a single translation. No reasoning, no tool use, no ability to look things up.
This creates three concrete problems:
- Quality ceiling. The model cannot search for related translations, check existing patterns, or look up terminology beyond what we pre-inject. For large projects, the fixed context window is not enough to capture everything relevant.
- Speed and cost at scale. Auto-translating 50 keys into 10 languages means 500 independent LLM calls. Each call repeats the same project context, same language notes, same glossary. Batching keys and languages into fewer calls would cut both latency and token spend.
- No self-correction. The model cannot verify its own output. If it produces a translation that violates QA checks (length, placeholders, formatting), we only catch that after the fact. An agentic loop could run QA checks as tools and fix violations before committing.
2. Appetite
4 weeks, one developer. This is R&D: the goal is a working prototype and validated approach, not a production-ready feature. If any research area turns out to be a dead end, we document findings and move on. We do not need to deploy anything to customers.
3. Solution
Four research areas, ordered by expected impact:
3.1 Tool-calling (agentic translation)
Give the LLM tools it can call to fetch additional context from the platform before (or during) translation:
- search_translations - search existing translations in the project (similar to TM but model-driven)
- get_key_context - fetch screenshots, descriptions, tags for a key
- search_glossary - look up specific terms in the project glossary
- run_qa_check - validate a candidate translation against QA rules (length, placeholders, ICU format)
The model decides which tools to call based on the translation task. For simple strings it may call none. For complex strings with domain terminology, it may search the glossary and existing translations first.
Research question: How much does tool-calling slow down translation? Thinking tokens are required for tool use. Measure the latency/cost trade-off. Can we offer this as an optional "high quality" mode? Also research the token economics: measure per-key cost for agentic vs. one-shot, model the impact on our credit system, and evaluate whether tokens should be treated more as customer acquisition cost for larger clients.
3.2 QA-check feedback loop
After the model produces a translation, run QA checks automatically and feed violations back as tool results. The model can then fix the translation and resubmit.
- Catches placeholder mismatches, length violations, ICU format errors before they reach the user
- The existing QA check infrastructure can be exposed as a tool
- Limit to 1-2 correction rounds to prevent infinite loops
Research question: Do current Opus/Sonnet models reliably fix QA violations when given structured feedback? What percentage of violations get auto-resolved?
3.3 Multi-language single prompt
Instead of one LLM call per target language, send all target languages in a single prompt for auto-translation. Include all language notes in the prompt and ask the model to return translations for all languages at once.
- Modern context windows (1M+ for Claude Opus) easily fit all language notes + source text for any project
- The model sees all languages together, which can improve cross-language consistency
- For auto-translation of a single key into N languages, this turns N calls into 1
Research question: Does translation quality degrade when asking for multiple languages at once? Measure against single-language baseline on a test set.
3.4 Key batching
When a user imports many keys (or auto-translates a batch), group them into chunks of 20-50 keys per LLM call instead of translating one at a time.
- Keys from the same import are usually related (same feature, same screen), so batching preserves context naturally
- The model does one research pass for the batch instead of repeating context for each key
- Combined with multi-language: a batch of 30 keys x 10 languages could be 1 call instead of 300
Research question: What is the optimal batch size? Too large and the model starts dropping or confusing keys. Find the sweet spot for accuracy vs. throughput.
4. Rabbit Holes
- Thinking tokens are mandatory for tool use. We currently disable thinking for speed. Enabling it will increase latency. The batching and multi-language optimizations should offset this by reducing total call count. Measure net effect.
- Structured output parsing. Multi-language and batched responses need reliable JSON parsing. Use structured output / tool_use response format rather than free-text parsing. If the model returns malformed output, fall back to single-key translation.
- Don't over-engineer the tool set. Start with 2-3 simple tools (search, QA check). We can always add more later. The research should validate the pattern, not build a comprehensive toolkit.
- Existing prompt system is reusable. Stepan designed PromptService to be general-purpose. Tool-calling extends it, it does not replace it. Handlebars templates remain the prompt definition layer.
- Parallelization of batch jobs. We may not be hitting Anthropic/OpenAI rate limits. Check current tier limits and increase parallelism if headroom exists. This is an easy win independent of the agentic work.
5. No-gos
- No production deployment. This is R&D. Results are documented, prototyped, and validated, not shipped.
- Minimal UI only: showing which tools the model used in AI Playground (tool call trace). No full UI redesign.
- No organization-level AI settings migration (separate pitch if needed).
- No style guide / file upload feature (separate, more complex problem).
- No changes to the existing single-key interactive translation flow. Agentic mode is for batch/auto-translation only in this phase.
6. Success Criteria
- Documented findings on which tools the model actually uses and which provide measurable quality improvement
- Benchmark data: quality comparison (BLEU or human eval) of agentic vs. current one-shot on a test project
- Token cost analysis: per-key cost comparison across modes (one-shot, agentic, batched, multi-language), with recommendations for pricing/credit model adjustments
- Benchmark data: latency and token cost comparison for batch sizes of 1, 10, 30, 50 keys
- A clear recommendation for what to productionize in the next cycle
1. Problem
Today, every AI translation is a stateless one-shot: one key, one target language, one prompt, one LLM call. The model gets a fixed template with whatever context we inject (project description, TM suggestions, glossary) and returns a single translation. No reasoning, no tool use, no ability to look things up.
This creates three concrete problems:
2. Appetite
4 weeks, one developer. This is R&D: the goal is a working prototype and validated approach, not a production-ready feature. If any research area turns out to be a dead end, we document findings and move on. We do not need to deploy anything to customers.
3. Solution
Four research areas, ordered by expected impact:
3.1 Tool-calling (agentic translation)
Give the LLM tools it can call to fetch additional context from the platform before (or during) translation:
The model decides which tools to call based on the translation task. For simple strings it may call none. For complex strings with domain terminology, it may search the glossary and existing translations first.
Research question: How much does tool-calling slow down translation? Thinking tokens are required for tool use. Measure the latency/cost trade-off. Can we offer this as an optional "high quality" mode? Also research the token economics: measure per-key cost for agentic vs. one-shot, model the impact on our credit system, and evaluate whether tokens should be treated more as customer acquisition cost for larger clients.
3.2 QA-check feedback loop
After the model produces a translation, run QA checks automatically and feed violations back as tool results. The model can then fix the translation and resubmit.
Research question: Do current Opus/Sonnet models reliably fix QA violations when given structured feedback? What percentage of violations get auto-resolved?
3.3 Multi-language single prompt
Instead of one LLM call per target language, send all target languages in a single prompt for auto-translation. Include all language notes in the prompt and ask the model to return translations for all languages at once.
Research question: Does translation quality degrade when asking for multiple languages at once? Measure against single-language baseline on a test set.
3.4 Key batching
When a user imports many keys (or auto-translates a batch), group them into chunks of 20-50 keys per LLM call instead of translating one at a time.
Research question: What is the optimal batch size? Too large and the model starts dropping or confusing keys. Find the sweet spot for accuracy vs. throughput.
4. Rabbit Holes
5. No-gos
6. Success Criteria