Skip to content

fix: 增强 JSON 解析逻辑,兼容带 Markdown 格式或前后对话的模型输出#4

Open
MartinMa2026 wants to merge 1 commit intoTendo33:mainfrom
MartinMa2026:main
Open

fix: 增强 JSON 解析逻辑,兼容带 Markdown 格式或前后对话的模型输出#4
MartinMa2026 wants to merge 1 commit intoTendo33:mainfrom
MartinMa2026:main

Conversation

@MartinMa2026
Copy link
Copy Markdown

@MartinMa2026 MartinMa2026 commented Apr 1, 2026

PR 带来的改动 (What does this PR do?)

src/ideago/llm/chat_model.py 中优化了 _extract_content_text 方法。当 LLM 输出的内容并非纯正的 JSON 字符串时,系统会自动利用正则匹配,过滤掉开头的对话文本并剥离类似 ```json ... ``` 的 Markdown 代码块,从中提取受支持的 JSON 片段。

为什么需要这个改动? (Why is this necessary?)

当前项目在调用 OpenAI 兼容协议的大模型(例如某些国产大模型如 MiniMax 等)时,模型经常会返回带有闲聊包裹的格式(例如:"好的,这是您需要的结果:\n json\n{...}\n "),这会导致底层 json.loads 触发 Expecting value: line 1 column 1 (char 0) 崩溃报错。
通过这个提交,极大提高了底层服务对各路大模型格式的泛化兼容性,让非 GPT 系模型也能非常稳定地跑通分析管道 (Pipeline)。

验证情况 (How it works & Verifications)

  • 使用了具备回退机制的正则来扒取 JSON 格式的 {}[] 树。
  • 全量跑通了 uv run ruff checkuv run mypy 检查。
  • 全量 590 个 pytest 测试用例全部顺利通过。

Summary by CodeRabbit

  • Bug Fixes
    • Improved content parsing to properly handle and extract structured data (JSON/text) from AI model responses, including better handling of Markdown-formatted code blocks and whitespace normalization.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 1, 2026

📝 Walkthrough

Walkthrough

The _extract_content_text helper function is enhanced to build normalized strings from all content types and extract embedded JSON objects/arrays from Markdown-fenced blocks using regex and substring matching, replacing the previous simpler concatenation logic.

Changes

Cohort / File(s) Summary
Content Text Extraction Logic
src/ideago/llm/chat_model.py
Enhanced _extract_content_text to normalize strings, unwrap Markdown-fenced JSON/text blocks, and extract embedded JSON objects or arrays via regex and fallback substring matching. Added whitespace stripping and intelligent block detection.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A rabbit hops through JSON dreams,
Unwrapping fenced blocks with cunning schemes,
With regex spells and substring sleuth,
Extracting truth from nested youth! 🎀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title is in Chinese and describes the main change: enhancing JSON parsing logic to support Markdown-formatted or conversational model outputs. This accurately reflects the core objective of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the _extract_content_text function to handle JSON extraction from model responses, including stripping markdown code blocks and identifying the outermost braces or brackets. The reviewer identified several issues with the current implementation, such as fragile manual stripping of backticks that fails with trailing text, a greedy regex that could incorrectly merge multiple code blocks, and an incomplete fallback logic. A more robust extraction flow was suggested to handle these edge cases more effectively.

Comment on lines +416 to +440
text = text.strip()
if text.startswith("```json"):
text = text[7:]
elif text.startswith("```"):
text = text[3:]
if text.endswith("```"):
text = text[:-3]
text = text.strip()

if not text.startswith("{") and not text.startswith("["):
match = re.search(r"```(?:json)?\s*(\{.*\}|\[.*\])\s*```", text, re.DOTALL)
if match:
return match.group(1).strip()
s_brace = text.find("{")
e_brace = text.rfind("}")
s_brak = text.find("[")
e_brak = text.rfind("]")
s_idx, e_idx = -1, -1
if s_brace != -1 and e_brace != -1:
s_idx, e_idx = s_brace, e_brace
if s_brak != -1 and e_brak != -1 and (s_idx == -1 or s_brak < s_idx):
s_idx, e_idx = s_brak, e_brak
if s_idx != -1 and e_idx > s_idx:
return text[s_idx : e_idx + 1]
return text
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current extraction logic has a few issues that could lead to parsing failures:

  1. Redundant and Fragile Stripping: The manual stripping of backticks (lines 417-422) is fragile. If the LLM output contains trailing text after the code block (e.g., ```json {...} ```\nExplanation), the prefix is stripped but the suffix check fails because the string doesn't end with backticks. This results in a string that starts with { but contains trailing backticks and text, which will cause json.loads to fail.
  2. Greedy Regex: The regex at line 426 uses a greedy match (.*) with re.DOTALL. If the model output contains multiple separate Markdown code blocks, this regex will incorrectly merge them into a single invalid string.
  3. Incomplete Fallback: The if not text.startswith(...) check at line 425 prevents the more robust extraction logic from handling cases where the JSON starts at the beginning but has trailing garbage.

I suggest a more robust flow: first try a clean match, then try to extract the first valid-looking Markdown block non-greedily, and finally fall back to finding the outermost braces/brackets in the entire text.

    text = text.strip()
    if not text:
        return ""

    # If it's already a clean JSON structure, return it immediately.
    if text.startswith(("{", "[")) and text.endswith(("}", "]")):
        return text

    # 1. Try to extract from the first Markdown code block (e.g., ```json ... ```).
    # Non-greedy match on backticks ensures we don't merge multiple separate blocks.
    match = re.search(r"```(?:json)?\s*(.*?)\s*```", text, re.DOTALL)
    if match:
        candidate = match.group(1).strip()
        if candidate.startswith(("{", "[")):
            return candidate

    # 2. Fallback: Find the outermost JSON-like structure (braces or brackets).
    # This handles conversational text around the JSON when no code blocks are used.
    s_brace, e_brace = text.find("{"), text.rfind("}")
    s_brak, e_brak = text.find("["), text.rfind("]")

    s_idx, e_idx = -1, -1
    if s_brace != -1 and e_brace != -1:
        s_idx, e_idx = s_brace, e_brace
    if s_brak != -1 and e_brak != -1 and (s_idx == -1 or s_brak < s_idx):
        s_idx, e_idx = s_brak, e_brak

    if s_idx != -1 and e_idx > s_idx:
        return text[s_idx : e_idx + 1]

    return text

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/ideago/llm/chat_model.py`:
- Around line 417-427: The markdown-fence handling for model outputs is
case-sensitive and misses uppercase variants like ```JSON; update the checks
that inspect the message string (variable text) to use a case-insensitive test
(e.g., text.lower().startswith("```json") or equivalent) and change the regex
search call that currently uses re.DOTALL to include re.IGNORECASE as well so
the pattern r"```(?:json)?\s*(\{.*\}|\[.*\])\s*```" matches JSON fences
regardless of case; modify the code paths around the existing
text.startswith(...) branches and the re.search(...) invocation to apply these
changes.
- Around line 425-439: The current fallback that picks from the first `{`/`[` to
the last `}`/`]` (the block using s_brace, e_brace, s_brak, e_brak, s_idx/e_idx)
can over-capture; replace it with a balanced-brace/bracket scanner: locate the
first opening brace or bracket (s_brace/s_brak), then iterate forward
maintaining a nesting counter (increment on '{'/'[', decrement on '}'/']') and
stop when the counter returns to zero to get the matching end index; only return
the substring if a properly balanced JSON span was found, otherwise return None
(or explicitly log/raise) so failures are handled rather than silently returning
an incorrect slice.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0ace563f-95f3-4afb-a64e-eb335376f9e9

📥 Commits

Reviewing files that changed from the base of the PR and between 8a79c34 and 1f5484c.

📒 Files selected for processing (1)
  • src/ideago/llm/chat_model.py

Comment on lines +417 to +427
if text.startswith("```json"):
text = text[7:]
elif text.startswith("```"):
text = text[3:]
if text.endswith("```"):
text = text[:-3]
text = text.strip()

if not text.startswith("{") and not text.startswith("["):
match = re.search(r"```(?:json)?\s*(\{.*\}|\[.*\])\s*```", text, re.DOTALL)
if match:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Markdown fence matching is too strict for common model outputs.

Line 417 and Line 426 only match lowercase json. Outputs like JSON ... won’t be unwrapped by these branches.

Proposed fix
-    if text.startswith("```json"):
+    if text.lower().startswith("```json"):
         text = text[7:]
@@
-        match = re.search(r"```(?:json)?\s*(\{.*\}|\[.*\])\s*```", text, re.DOTALL)
+        match = re.search(
+            r"```(?:json)?\s*(\{.*\}|\[.*\])\s*```",
+            text,
+            re.DOTALL | re.IGNORECASE,
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ideago/llm/chat_model.py` around lines 417 - 427, The markdown-fence
handling for model outputs is case-sensitive and misses uppercase variants like
```JSON; update the checks that inspect the message string (variable text) to
use a case-insensitive test (e.g., text.lower().startswith("```json") or
equivalent) and change the regex search call that currently uses re.DOTALL to
include re.IGNORECASE as well so the pattern
r"```(?:json)?\s*(\{.*\}|\[.*\])\s*```" matches JSON fences regardless of case;
modify the code paths around the existing text.startswith(...) branches and the
re.search(...) invocation to apply these changes.

Comment on lines +425 to +439
if not text.startswith("{") and not text.startswith("["):
match = re.search(r"```(?:json)?\s*(\{.*\}|\[.*\])\s*```", text, re.DOTALL)
if match:
return match.group(1).strip()
s_brace = text.find("{")
e_brace = text.rfind("}")
s_brak = text.find("[")
e_brak = text.rfind("]")
s_idx, e_idx = -1, -1
if s_brace != -1 and e_brace != -1:
s_idx, e_idx = s_brace, e_brace
if s_brak != -1 and e_brak != -1 and (s_idx == -1 or s_brak < s_idx):
s_idx, e_idx = s_brak, e_brak
if s_idx != -1 and e_idx > s_idx:
return text[s_idx : e_idx + 1]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

JSON substring fallback can over-capture invalid spans.

At Line 429–Line 439, selecting from first {/[ to last }/] can include non-JSON prose when multiple brace/bracket groups exist, causing avoidable parse failures.

Proposed fix
-        s_brace = text.find("{")
-        e_brace = text.rfind("}")
-        s_brak = text.find("[")
-        e_brak = text.rfind("]")
-        s_idx, e_idx = -1, -1
-        if s_brace != -1 and e_brace != -1:
-            s_idx, e_idx = s_brace, e_brace
-        if s_brak != -1 and e_brak != -1 and (s_idx == -1 or s_brak < s_idx):
-            s_idx, e_idx = s_brak, e_brak
-        if s_idx != -1 and e_idx > s_idx:
-            return text[s_idx : e_idx + 1]
+        for start_index, opening, closing in (
+            *[(i, "{", "}") for i, ch in enumerate(text) if ch == "{"],
+            *[(i, "[", "]") for i, ch in enumerate(text) if ch == "["],
+        ):
+            depth = 0
+            in_string = False
+            escaped = False
+            for end_index in range(start_index, len(text)):
+                ch = text[end_index]
+                if in_string:
+                    if escaped:
+                        escaped = False
+                    elif ch == "\\":
+                        escaped = True
+                    elif ch == '"':
+                        in_string = False
+                    continue
+                if ch == '"':
+                    in_string = True
+                elif ch == opening:
+                    depth += 1
+                elif ch == closing:
+                    depth -= 1
+                    if depth == 0:
+                        return text[start_index : end_index + 1]

As per coding guidelines "Handle failures explicitly; avoid silent failures".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ideago/llm/chat_model.py` around lines 425 - 439, The current fallback
that picks from the first `{`/`[` to the last `}`/`]` (the block using s_brace,
e_brace, s_brak, e_brak, s_idx/e_idx) can over-capture; replace it with a
balanced-brace/bracket scanner: locate the first opening brace or bracket
(s_brace/s_brak), then iterate forward maintaining a nesting counter (increment
on '{'/'[', decrement on '}'/']') and stop when the counter returns to zero to
get the matching end index; only return the substring if a properly balanced
JSON span was found, otherwise return None (or explicitly log/raise) so failures
are handled rather than silently returning an incorrect slice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant