fix: 增强 JSON 解析逻辑,兼容带 Markdown 格式或前后对话的模型输出#4
fix: 增强 JSON 解析逻辑,兼容带 Markdown 格式或前后对话的模型输出#4MartinMa2026 wants to merge 1 commit intoTendo33:mainfrom
Conversation
📝 WalkthroughWalkthroughThe Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the _extract_content_text function to handle JSON extraction from model responses, including stripping markdown code blocks and identifying the outermost braces or brackets. The reviewer identified several issues with the current implementation, such as fragile manual stripping of backticks that fails with trailing text, a greedy regex that could incorrectly merge multiple code blocks, and an incomplete fallback logic. A more robust extraction flow was suggested to handle these edge cases more effectively.
| text = text.strip() | ||
| if text.startswith("```json"): | ||
| text = text[7:] | ||
| elif text.startswith("```"): | ||
| text = text[3:] | ||
| if text.endswith("```"): | ||
| text = text[:-3] | ||
| text = text.strip() | ||
|
|
||
| if not text.startswith("{") and not text.startswith("["): | ||
| match = re.search(r"```(?:json)?\s*(\{.*\}|\[.*\])\s*```", text, re.DOTALL) | ||
| if match: | ||
| return match.group(1).strip() | ||
| s_brace = text.find("{") | ||
| e_brace = text.rfind("}") | ||
| s_brak = text.find("[") | ||
| e_brak = text.rfind("]") | ||
| s_idx, e_idx = -1, -1 | ||
| if s_brace != -1 and e_brace != -1: | ||
| s_idx, e_idx = s_brace, e_brace | ||
| if s_brak != -1 and e_brak != -1 and (s_idx == -1 or s_brak < s_idx): | ||
| s_idx, e_idx = s_brak, e_brak | ||
| if s_idx != -1 and e_idx > s_idx: | ||
| return text[s_idx : e_idx + 1] | ||
| return text |
There was a problem hiding this comment.
The current extraction logic has a few issues that could lead to parsing failures:
- Redundant and Fragile Stripping: The manual stripping of backticks (lines 417-422) is fragile. If the LLM output contains trailing text after the code block (e.g.,
```json {...} ```\nExplanation), the prefix is stripped but the suffix check fails because the string doesn't end with backticks. This results in a string that starts with{but contains trailing backticks and text, which will causejson.loadsto fail. - Greedy Regex: The regex at line 426 uses a greedy match (
.*) withre.DOTALL. If the model output contains multiple separate Markdown code blocks, this regex will incorrectly merge them into a single invalid string. - Incomplete Fallback: The
if not text.startswith(...)check at line 425 prevents the more robust extraction logic from handling cases where the JSON starts at the beginning but has trailing garbage.
I suggest a more robust flow: first try a clean match, then try to extract the first valid-looking Markdown block non-greedily, and finally fall back to finding the outermost braces/brackets in the entire text.
text = text.strip()
if not text:
return ""
# If it's already a clean JSON structure, return it immediately.
if text.startswith(("{", "[")) and text.endswith(("}", "]")):
return text
# 1. Try to extract from the first Markdown code block (e.g., ```json ... ```).
# Non-greedy match on backticks ensures we don't merge multiple separate blocks.
match = re.search(r"```(?:json)?\s*(.*?)\s*```", text, re.DOTALL)
if match:
candidate = match.group(1).strip()
if candidate.startswith(("{", "[")):
return candidate
# 2. Fallback: Find the outermost JSON-like structure (braces or brackets).
# This handles conversational text around the JSON when no code blocks are used.
s_brace, e_brace = text.find("{"), text.rfind("}")
s_brak, e_brak = text.find("["), text.rfind("]")
s_idx, e_idx = -1, -1
if s_brace != -1 and e_brace != -1:
s_idx, e_idx = s_brace, e_brace
if s_brak != -1 and e_brak != -1 and (s_idx == -1 or s_brak < s_idx):
s_idx, e_idx = s_brak, e_brak
if s_idx != -1 and e_idx > s_idx:
return text[s_idx : e_idx + 1]
return textThere was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/ideago/llm/chat_model.py`:
- Around line 417-427: The markdown-fence handling for model outputs is
case-sensitive and misses uppercase variants like ```JSON; update the checks
that inspect the message string (variable text) to use a case-insensitive test
(e.g., text.lower().startswith("```json") or equivalent) and change the regex
search call that currently uses re.DOTALL to include re.IGNORECASE as well so
the pattern r"```(?:json)?\s*(\{.*\}|\[.*\])\s*```" matches JSON fences
regardless of case; modify the code paths around the existing
text.startswith(...) branches and the re.search(...) invocation to apply these
changes.
- Around line 425-439: The current fallback that picks from the first `{`/`[` to
the last `}`/`]` (the block using s_brace, e_brace, s_brak, e_brak, s_idx/e_idx)
can over-capture; replace it with a balanced-brace/bracket scanner: locate the
first opening brace or bracket (s_brace/s_brak), then iterate forward
maintaining a nesting counter (increment on '{'/'[', decrement on '}'/']') and
stop when the counter returns to zero to get the matching end index; only return
the substring if a properly balanced JSON span was found, otherwise return None
(or explicitly log/raise) so failures are handled rather than silently returning
an incorrect slice.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0ace563f-95f3-4afb-a64e-eb335376f9e9
📒 Files selected for processing (1)
src/ideago/llm/chat_model.py
| if text.startswith("```json"): | ||
| text = text[7:] | ||
| elif text.startswith("```"): | ||
| text = text[3:] | ||
| if text.endswith("```"): | ||
| text = text[:-3] | ||
| text = text.strip() | ||
|
|
||
| if not text.startswith("{") and not text.startswith("["): | ||
| match = re.search(r"```(?:json)?\s*(\{.*\}|\[.*\])\s*```", text, re.DOTALL) | ||
| if match: |
There was a problem hiding this comment.
Markdown fence matching is too strict for common model outputs.
Line 417 and Line 426 only match lowercase json. Outputs like JSON ... won’t be unwrapped by these branches.
Proposed fix
- if text.startswith("```json"):
+ if text.lower().startswith("```json"):
text = text[7:]
@@
- match = re.search(r"```(?:json)?\s*(\{.*\}|\[.*\])\s*```", text, re.DOTALL)
+ match = re.search(
+ r"```(?:json)?\s*(\{.*\}|\[.*\])\s*```",
+ text,
+ re.DOTALL | re.IGNORECASE,
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ideago/llm/chat_model.py` around lines 417 - 427, The markdown-fence
handling for model outputs is case-sensitive and misses uppercase variants like
```JSON; update the checks that inspect the message string (variable text) to
use a case-insensitive test (e.g., text.lower().startswith("```json") or
equivalent) and change the regex search call that currently uses re.DOTALL to
include re.IGNORECASE as well so the pattern
r"```(?:json)?\s*(\{.*\}|\[.*\])\s*```" matches JSON fences regardless of case;
modify the code paths around the existing text.startswith(...) branches and the
re.search(...) invocation to apply these changes.
| if not text.startswith("{") and not text.startswith("["): | ||
| match = re.search(r"```(?:json)?\s*(\{.*\}|\[.*\])\s*```", text, re.DOTALL) | ||
| if match: | ||
| return match.group(1).strip() | ||
| s_brace = text.find("{") | ||
| e_brace = text.rfind("}") | ||
| s_brak = text.find("[") | ||
| e_brak = text.rfind("]") | ||
| s_idx, e_idx = -1, -1 | ||
| if s_brace != -1 and e_brace != -1: | ||
| s_idx, e_idx = s_brace, e_brace | ||
| if s_brak != -1 and e_brak != -1 and (s_idx == -1 or s_brak < s_idx): | ||
| s_idx, e_idx = s_brak, e_brak | ||
| if s_idx != -1 and e_idx > s_idx: | ||
| return text[s_idx : e_idx + 1] |
There was a problem hiding this comment.
JSON substring fallback can over-capture invalid spans.
At Line 429–Line 439, selecting from first {/[ to last }/] can include non-JSON prose when multiple brace/bracket groups exist, causing avoidable parse failures.
Proposed fix
- s_brace = text.find("{")
- e_brace = text.rfind("}")
- s_brak = text.find("[")
- e_brak = text.rfind("]")
- s_idx, e_idx = -1, -1
- if s_brace != -1 and e_brace != -1:
- s_idx, e_idx = s_brace, e_brace
- if s_brak != -1 and e_brak != -1 and (s_idx == -1 or s_brak < s_idx):
- s_idx, e_idx = s_brak, e_brak
- if s_idx != -1 and e_idx > s_idx:
- return text[s_idx : e_idx + 1]
+ for start_index, opening, closing in (
+ *[(i, "{", "}") for i, ch in enumerate(text) if ch == "{"],
+ *[(i, "[", "]") for i, ch in enumerate(text) if ch == "["],
+ ):
+ depth = 0
+ in_string = False
+ escaped = False
+ for end_index in range(start_index, len(text)):
+ ch = text[end_index]
+ if in_string:
+ if escaped:
+ escaped = False
+ elif ch == "\\":
+ escaped = True
+ elif ch == '"':
+ in_string = False
+ continue
+ if ch == '"':
+ in_string = True
+ elif ch == opening:
+ depth += 1
+ elif ch == closing:
+ depth -= 1
+ if depth == 0:
+ return text[start_index : end_index + 1]As per coding guidelines "Handle failures explicitly; avoid silent failures".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ideago/llm/chat_model.py` around lines 425 - 439, The current fallback
that picks from the first `{`/`[` to the last `}`/`]` (the block using s_brace,
e_brace, s_brak, e_brak, s_idx/e_idx) can over-capture; replace it with a
balanced-brace/bracket scanner: locate the first opening brace or bracket
(s_brace/s_brak), then iterate forward maintaining a nesting counter (increment
on '{'/'[', decrement on '}'/']') and stop when the counter returns to zero to
get the matching end index; only return the substring if a properly balanced
JSON span was found, otherwise return None (or explicitly log/raise) so failures
are handled rather than silently returning an incorrect slice.
PR 带来的改动 (What does this PR do?)
在
src/ideago/llm/chat_model.py中优化了_extract_content_text方法。当 LLM 输出的内容并非纯正的 JSON 字符串时,系统会自动利用正则匹配,过滤掉开头的对话文本并剥离类似```json ... ```的 Markdown 代码块,从中提取受支持的 JSON 片段。为什么需要这个改动? (Why is this necessary?)
当前项目在调用 OpenAI 兼容协议的大模型(例如某些国产大模型如 MiniMax 等)时,模型经常会返回带有闲聊包裹的格式(例如:"好的,这是您需要的结果:\n
json\n{...}\n"),这会导致底层json.loads触发Expecting value: line 1 column 1 (char 0)崩溃报错。通过这个提交,极大提高了底层服务对各路大模型格式的泛化兼容性,让非 GPT 系模型也能非常稳定地跑通分析管道 (Pipeline)。
验证情况 (How it works & Verifications)
{}与[]树。uv run ruff check和uv run mypy检查。pytest测试用例全部顺利通过。Summary by CodeRabbit