fix: 增强 JSON 解析逻辑，兼容带 Markdown 格式或前后对话的模型输出 by MartinMa2026 · Pull Request #4 · Tendo33/IdeaGo

MartinMa2026 · 2026-04-01T05:02:09Z

PR 带来的改动 (What does this PR do?)

在 src/ideago/llm/chat_model.py 中优化了 _extract_content_text 方法。当 LLM 输出的内容并非纯正的 JSON 字符串时，系统会自动利用正则匹配，过滤掉开头的对话文本并剥离类似 ```json ... ``` 的 Markdown 代码块，从中提取受支持的 JSON 片段。

为什么需要这个改动？ (Why is this necessary?)

当前项目在调用 OpenAI 兼容协议的大模型（例如某些国产大模型如 MiniMax 等）时，模型经常会返回带有闲聊包裹的格式（例如："好的，这是您需要的结果：\n json\n{...}\n "），这会导致底层 json.loads 触发 Expecting value: line 1 column 1 (char 0) 崩溃报错。
通过这个提交，极大提高了底层服务对各路大模型格式的泛化兼容性，让非 GPT 系模型也能非常稳定地跑通分析管道 (Pipeline)。

验证情况 (How it works & Verifications)

使用了具备回退机制的正则来扒取 JSON 格式的 {} 与 [] 树。
全量跑通了 uv run ruff check 和 uv run mypy 检查。
全量 590 个 pytest 测试用例全部顺利通过。

Summary by CodeRabbit

Bug Fixes
- Improved content parsing to properly handle and extract structured data (JSON/text) from AI model responses, including better handling of Markdown-formatted code blocks and whitespace normalization.

… wrappers

coderabbitai · 2026-04-01T05:02:24Z

📝 Walkthrough

Walkthrough

The _extract_content_text helper function is enhanced to build normalized strings from all content types and extract embedded JSON objects/arrays from Markdown-fenced blocks using regex and substring matching, replacing the previous simpler concatenation logic.

Changes

Cohort / File(s)	Summary
Content Text Extraction Logic `src/ideago/llm/chat_model.py`	Enhanced `_extract_content_text` to normalize strings, unwrap Markdown-fenced JSON/text blocks, and extract embedded JSON objects or arrays via regex and fallback substring matching. Added whitespace stripping and intelligent block detection.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A rabbit hops through JSON dreams,
Unwrapping fenced blocks with cunning schemes,
With regex spells and substring sleuth,
Extracting truth from nested youth! 🎀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title is in Chinese and describes the main change: enhancing JSON parsing logic to support Markdown-formatted or conversational model outputs. This accurately reflects the core objective of the PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request updates the _extract_content_text function to handle JSON extraction from model responses, including stripping markdown code blocks and identifying the outermost braces or brackets. The reviewer identified several issues with the current implementation, such as fragile manual stripping of backticks that fails with trailing text, a greedy regex that could incorrectly merge multiple code blocks, and an incomplete fallback logic. A more robust extraction flow was suggested to handle these edge cases more effectively.

gemini-code-assist · 2026-04-01T05:04:27Z

src/ideago/llm/chat_model.py

+    text = text.strip()
+    if text.startswith("```json"):
+        text = text[7:]
+    elif text.startswith("```"):
+        text = text[3:]
+    if text.endswith("```"):
+        text = text[:-3]
+    text = text.strip()
+
+    if not text.startswith("{") and not text.startswith("["):
+        match = re.search(r"```(?:json)?\s*(\{.*\}|\[.*\])\s*```", text, re.DOTALL)
+        if match:
+            return match.group(1).strip()
+        s_brace = text.find("{")
+        e_brace = text.rfind("}")
+        s_brak = text.find("[")
+        e_brak = text.rfind("]")
+        s_idx, e_idx = -1, -1
+        if s_brace != -1 and e_brace != -1:
+            s_idx, e_idx = s_brace, e_brace
+        if s_brak != -1 and e_brak != -1 and (s_idx == -1 or s_brak < s_idx):
+            s_idx, e_idx = s_brak, e_brak
+        if s_idx != -1 and e_idx > s_idx:
+            return text[s_idx : e_idx + 1]
+    return text


The current extraction logic has a few issues that could lead to parsing failures:

Redundant and Fragile Stripping: The manual stripping of backticks (lines 417-422) is fragile. If the LLM output contains trailing text after the code block (e.g., ```json {...} ```\nExplanation), the prefix is stripped but the suffix check fails because the string doesn't end with backticks. This results in a string that starts with { but contains trailing backticks and text, which will cause json.loads to fail.

Greedy Regex: The regex at line 426 uses a greedy match (.*) with re.DOTALL. If the model output contains multiple separate Markdown code blocks, this regex will incorrectly merge them into a single invalid string.

Incomplete Fallback: The if not text.startswith(...) check at line 425 prevents the more robust extraction logic from handling cases where the JSON starts at the beginning but has trailing garbage.

I suggest a more robust flow: first try a clean match, then try to extract the first valid-looking Markdown block non-greedily, and finally fall back to finding the outermost braces/brackets in the entire text.

text = text.strip() if not text: return "" # If it's already a clean JSON structure, return it immediately. if text.startswith(("{", "[")) and text.endswith(("}", "]")): return text # 1. Try to extract from the first Markdown code block (e.g., ```json ... ```). # Non-greedy match on backticks ensures we don't merge multiple separate blocks. match = re.search(r"```(?:json)?\s*(.*?)\s*```", text, re.DOTALL) if match: candidate = match.group(1).strip() if candidate.startswith(("{", "[")): return candidate # 2. Fallback: Find the outermost JSON-like structure (braces or brackets). # This handles conversational text around the JSON when no code blocks are used. s_brace, e_brace = text.find("{"), text.rfind("}") s_brak, e_brak = text.find("["), text.rfind("]") s_idx, e_idx = -1, -1 if s_brace != -1 and e_brace != -1: s_idx, e_idx = s_brace, e_brace if s_brak != -1 and e_brak != -1 and (s_idx == -1 or s_brak < s_idx): s_idx, e_idx = s_brak, e_brak if s_idx != -1 and e_idx > s_idx: return text[s_idx : e_idx + 1] return text

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/ideago/llm/chat_model.py`:
- Around line 417-427: The markdown-fence handling for model outputs is
case-sensitive and misses uppercase variants like ```JSON; update the checks
that inspect the message string (variable text) to use a case-insensitive test
(e.g., text.lower().startswith("```json") or equivalent) and change the regex
search call that currently uses re.DOTALL to include re.IGNORECASE as well so
the pattern r"```(?:json)?\s*(\{.*\}|\[.*\])\s*```" matches JSON fences
regardless of case; modify the code paths around the existing
text.startswith(...) branches and the re.search(...) invocation to apply these
changes.
- Around line 425-439: The current fallback that picks from the first `{`/`[` to
the last `}`/`]` (the block using s_brace, e_brace, s_brak, e_brak, s_idx/e_idx)
can over-capture; replace it with a balanced-brace/bracket scanner: locate the
first opening brace or bracket (s_brace/s_brak), then iterate forward
maintaining a nesting counter (increment on '{'/'[', decrement on '}'/']') and
stop when the counter returns to zero to get the matching end index; only return
the substring if a properly balanced JSON span was found, otherwise return None
(or explicitly log/raise) so failures are handled rather than silently returning
an incorrect slice.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0ace563f-95f3-4afb-a64e-eb335376f9e9

📥 Commits

Reviewing files that changed from the base of the PR and between 8a79c34 and 1f5484c.

📒 Files selected for processing (1)

src/ideago/llm/chat_model.py

coderabbitai · 2026-04-01T05:05:44Z

src/ideago/llm/chat_model.py

+    if text.startswith("```json"):
+        text = text[7:]
+    elif text.startswith("```"):
+        text = text[3:]
+    if text.endswith("```"):
+        text = text[:-3]
+    text = text.strip()
+
+    if not text.startswith("{") and not text.startswith("["):
+        match = re.search(r"```(?:json)?\s*(\{.*\}|\[.*\])\s*```", text, re.DOTALL)
+        if match:


⚠️ Potential issue | 🟡 Minor

Markdown fence matching is too strict for common model outputs.

Line 417 and Line 426 only match lowercase json. Outputs like JSON ... won’t be unwrapped by these branches.

Proposed fix

- if text.startswith("```json"): + if text.lower().startswith("```json"): text = text[7:] @@ - match = re.search(r"```(?:json)?\s*(\{.*\}|\[.*\])\s*```", text, re.DOTALL) + match = re.search( + r"```(?:json)?\s*(\{.*\}|\[.*\])\s*```", + text, + re.DOTALL | re.IGNORECASE, + )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/ideago/llm/chat_model.py` around lines 417 - 427, The markdown-fence handling for model outputs is case-sensitive and misses uppercase variants like ```JSON; update the checks that inspect the message string (variable text) to use a case-insensitive test (e.g., text.lower().startswith("```json") or equivalent) and change the regex search call that currently uses re.DOTALL to include re.IGNORECASE as well so the pattern r"```(?:json)?\s*(\{.*\}|\[.*\])\s*```" matches JSON fences regardless of case; modify the code paths around the existing text.startswith(...) branches and the re.search(...) invocation to apply these changes.

coderabbitai · 2026-04-01T05:05:44Z

src/ideago/llm/chat_model.py

+    if not text.startswith("{") and not text.startswith("["):
+        match = re.search(r"```(?:json)?\s*(\{.*\}|\[.*\])\s*```", text, re.DOTALL)
+        if match:
+            return match.group(1).strip()
+        s_brace = text.find("{")
+        e_brace = text.rfind("}")
+        s_brak = text.find("[")
+        e_brak = text.rfind("]")
+        s_idx, e_idx = -1, -1
+        if s_brace != -1 and e_brace != -1:
+            s_idx, e_idx = s_brace, e_brace
+        if s_brak != -1 and e_brak != -1 and (s_idx == -1 or s_brak < s_idx):
+            s_idx, e_idx = s_brak, e_brak
+        if s_idx != -1 and e_idx > s_idx:
+            return text[s_idx : e_idx + 1]


⚠️ Potential issue | 🟠 Major

JSON substring fallback can over-capture invalid spans.

At Line 429–Line 439, selecting from first {/[ to last }/] can include non-JSON prose when multiple brace/bracket groups exist, causing avoidable parse failures.

Proposed fix

- s_brace = text.find("{") - e_brace = text.rfind("}") - s_brak = text.find("[") - e_brak = text.rfind("]") - s_idx, e_idx = -1, -1 - if s_brace != -1 and e_brace != -1: - s_idx, e_idx = s_brace, e_brace - if s_brak != -1 and e_brak != -1 and (s_idx == -1 or s_brak < s_idx): - s_idx, e_idx = s_brak, e_brak - if s_idx != -1 and e_idx > s_idx: - return text[s_idx : e_idx + 1] + for start_index, opening, closing in ( + *[(i, "{", "}") for i, ch in enumerate(text) if ch == "{"], + *[(i, "[", "]") for i, ch in enumerate(text) if ch == "["], + ): + depth = 0 + in_string = False + escaped = False + for end_index in range(start_index, len(text)): + ch = text[end_index] + if in_string: + if escaped: + escaped = False + elif ch == "\\": + escaped = True + elif ch == '"': + in_string = False + continue + if ch == '"': + in_string = True + elif ch == opening: + depth += 1 + elif ch == closing: + depth -= 1 + if depth == 0: + return text[start_index : end_index + 1]

As per coding guidelines "Handle failures explicitly; avoid silent failures".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/ideago/llm/chat_model.py` around lines 425 - 439, The current fallback that picks from the first `{`/`[` to the last `}`/`]` (the block using s_brace, e_brace, s_brak, e_brak, s_idx/e_idx) can over-capture; replace it with a balanced-brace/bracket scanner: locate the first opening brace or bracket (s_brace/s_brak), then iterate forward maintaining a nesting counter (increment on '{'/'[', decrement on '}'/']') and stop when the counter returns to zero to get the matching end index; only return the substring if a properly balanced JSON span was found, otherwise return None (or explicitly log/raise) so failures are handled rather than silently returning an incorrect slice.

fix(llm): enhance json parsing for markdown blocks and conversational…

1f5484c

… wrappers

gemini-code-assist bot reviewed Apr 1, 2026

View reviewed changes

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: 增强 JSON 解析逻辑，兼容带 Markdown 格式或前后对话的模型输出#4

fix: 增强 JSON 解析逻辑，兼容带 Markdown 格式或前后对话的模型输出#4
MartinMa2026 wants to merge 1 commit intoTendo33:mainfrom
MartinMa2026:main

MartinMa2026 commented Apr 1, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 1, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 1, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 1, 2026

Uh oh!

coderabbitai bot Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MartinMa2026 commented Apr 1, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR 带来的改动 (What does this PR do?)

为什么需要这个改动？ (Why is this necessary?)

验证情况 (How it works & Verifications)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MartinMa2026 commented Apr 1, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 1, 2026 •

edited

Loading