Skip to content

UnicodeEncodeError with surrogate characters when accessing pages with Chinese content on Windows #359

@D-remh

Description

@D-remh

Before submitting

  • I searched existing issues for duplicates.
  • I ran browser-harness --doctor and read the output.
  • I read the troubleshooting section of install.md.
  • This is a reproducible bug in browser-harness — not a question, feature request, or cloud.browser-use.com issue.

Summary

When accessing web pages containing Chinese characters on Windows, browser-harness throws a UnicodeEncodeError due to surrogate characters (U+DC80-U+DCFF) in the CDP response data. The error occurs during exec() execution in run.py, making it impossible to retrieve or print page information from Chinese websites.

The CDP (Chrome DevTools Protocol) returns page data (titles, URLs, etc.) that may contain surrogate characters when the page has certain encodings or special characters. These surrogate characters (U+DC80-U+DCFF) are invalid in UTF-8 and cause encoding errors when Python tries to process them.

Affected Operations

  • js() - Cannot execute JavaScript that returns Chinese text content ❌
  • page_info() - Works correctly ✅
  • Basic navigation - Works correctly ✅

Repro

  1. Start Chrome with remote debugging enabled on Windows 11
  2. Run the following command to test basic page info (this works):
browser-harness <<'PY'
goto_url("https://cloud.tencent.com/developer/article/2663247")
wait_for_load()
info = page_info()
print(f"Title: {info['title']}")
PY

Result: ✅ Success - prints the page title correctly:

Title: 🐴 600行代码统治浏览器:Browser Harness如何让AI智能体获得完全自由-腾讯云开发者社区-腾讯云
  1. Now try to extract article content using js():
browser-harness <<'PY'
goto_url("https://cloud.tencent.com/developer/article/2663247")
wait_for_load()

# Try to extract article content
content = js("""
    const article = document.querySelector('.article-content, .content, article');
    if (article) {
        return article.innerText.substring(0, 500);
    }
    return "No article content found";
""")

print(content[:200])
PY

Result: ❌ Error:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Administrator\.local\bin\browser-harness.exe\__main__.py", line 10, in <module>
    sys.exit(main())
             ~~~~^^
  File "C:\Users\Administrator\browser-harness\src\browser_harness\run.py", line 140, in main
    exec(code, exec_globals)
    ~~~~^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 125-126: surrogates not allowed

Environment

OS: Windows 11 Pro 10.0.26200

Chrome version: 148.0.7778.168

browser-harness --version: 0.1.0 (git)

browser-harness --doctor output:

browser-harness doctor
  platform          Windows 11
  python            3.14.3
  version           0.1.0 (git)
  latest release    (could not reach github)
  [ok  ] chrome running
  [ok  ] daemon alive
  [ok  ] active browser connections — 1
        default — active page: 🐴 600行代码统治浏览器:Browser Harness如何让AI智能体获得完全自由-腾讯云开发者社区-腾讯云 — https://cloud.tencent.com/developer/article/2663247
  [FAIL] profile-use installed — optional: curl -fsSL https://browser-use.com/profile.sh | sh
  [FAIL] BROWSER_USE_API_KEY set — optional: needed only for cloud browsers / profile sync

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions