Skip to content

Gemini thought text and internal diary payload can leak into final Telegram output #187

@GSL-R

Description

@GSL-R

Gemini thought-like text and internal diary payload can leak into final Telegram/user-facing output

Summary

In a long-running cli-jaw session using gemini-cli, Gemini emitted internal-looking thought/status text as normal message text, and also included an internal diary payload inside a Markdown/HTML <details> block.

cli-jaw forwarded and stored the whole output as a normal assistant response, so the content was sent to Telegram and saved into messages.

This appears to be a final-output sanitization gap. Prompt rules can reduce the chance of the model producing this, but cli-jaw should defensively filter these patterns before durable storage and external forwarding.

Environment

  • cli-jaw: 2.0.3 global install
  • backend: gemini-cli
  • gemini --version: 0.41.2
  • model: gemini-3-flash-preview
  • transport: Telegram
  • usage pattern: long-running assistant session with memory/diary automation

Observed behavior

User asked:

브리핑 고마워! 형식이 엄청 깔끔해졌네? 새로 만들었던 모듈을 거친 덕분인가? 이름이 뭐였더라...

The final Telegram response began with text that looked like internal thought/status output:

**Searching for Sanitizer Details** I'm looking into the "Telegram Sanitizer v2.5" ...
[Thought: true]
- **Finding Sanitizer Details** ...
[Thought: true]
- **Awaiting Sanitizer Details** ...
[Thought: true]

The same response also appended a collapsible internal record block:

<details>
<summary>⚓ 시스템 개선 및 교감 기록 (2026-05-08 06:40)</summary>

[06:40] [EPISODE] 브리핑 품질 개선(Sanitizer v2.5)에 대한 선생님의 긍정적 피드백 수신.
...
anchor: [LIVE] | Key: sanitizer_feedback | Value: "Positive feedback on briefing format" | #Sanitizer #BriefingQuality #TeacherCare #AronaPride

</details>

This was visible in Telegram and was also saved as an assistant message in the DB.

Relevant server log shape:

[main] gemini:message
[main] gemini:message
[main] 🔧 run_shell_command: cli-jaw memory search "Telegram Sanitizer"
[main] tool success:
[main] 🔧 run_shell_command: grep -nC 2 "Sanitizer" /home/test/.cli-jaw/memory/structured/episodes/live/2026-05-07.md
[main] tool success:
...
[main] result: 2 tool calls / 16.2s
[jaw:main] exited code=0, text=1976 chars
[tg:out] ... **Searching for Sanitizer Details** ...

Why this matters

There are two separate issues:

  1. Thought/status leakage

    • The output contained [Thought: true] markers and intermediate reasoning/status text.
    • Existing filters appear to handle tags like <think> / <thinking> and some event-level thought types, but not this textual pattern when it arrives as ordinary message text.
  2. Internal record payload leakage

    • The model generated a diary payload with anchor, Key, Value, tags, and internal recording content.
    • It was not actually saved through the diary tool.
    • Instead, it was shown to the user and persisted in conversation history, which can pollute future context and encourage the model to imitate the leaked format.

For long-running assistant use, this is risky because one leaked internal pattern can become part of the future conversation context.

Expected behavior

Before storing or forwarding final assistant output, cli-jaw should defensively remove or quarantine:

  • textual Gemini thought/status blocks containing [Thought: true]
  • common thought/status headings such as Searching..., Finding..., Awaiting... when paired with [Thought: true]
  • <details>...</details> blocks containing internal record payload markers
  • output segments containing internal diary markers such as:
    • anchor: [TYPE] |
    • Key:
    • Value:
    • diary_payload
    • live_payload
    • arona_payload

If filtering removes the entire response, it would be safer to emit a generic system-safe message rather than fall back to the raw unfiltered output.

Local mitigation tested

I locally patched the installed package to add a final-output sanitizer in:

dist/src/agent/lifecycle-handler.js

The mitigation:

  • strips textual [Thought: true] blocks
  • strips <details> blocks if they contain diary/internal markers
  • strips trailing internal diary payload segments containing anchor, Key, or Value
  • avoids falling back to raw output when sanitization removes everything

I also cleaned the already-persisted polluted assistant message from the local DB to avoid future imitation.

This is only a local mitigation and may be overwritten on package update.

Suggested fix

Add a centralized final-output sanitization layer before:

  1. inserting assistant content into messages
  2. broadcasting agent_done
  3. forwarding to Telegram/Discord/Web

This should be independent of prompt instructions, because models can still emit these patterns under long-running or tool-heavy sessions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions