From 8afc610747d9ec85ae805275bf0ce4fe7bd4ca0c Mon Sep 17 00:00:00 2001
From: Forge <forge@local.invalid>
Date: Sun, 15 Mar 2026 06:28:23 +0800
Subject: [PATCH] Add blog post on Unicode string comparison traps

---
 .../index.md                                  | 508 ++++++++++++++++++
 .../index.md                                  | 435 +++++++++++++++
 .../index.md                                  | 425 +++++++++++++++
 static/img/2023/0924-unicode-string-traps.svg |  31 ++
 4 files changed, 1399 insertions(+)
 create mode 100644 blog/2023/09-24-strings-look-same-but-still-dont-match/index.md
 create mode 100644 i18n/en/docusaurus-plugin-content-blog/2023/09-24-strings-look-same-but-still-dont-match/index.md
 create mode 100644 i18n/ja/docusaurus-plugin-content-blog/2023/09-24-strings-look-same-but-still-dont-match/index.md
 create mode 100644 static/img/2023/0924-unicode-string-traps.svg

diff --git a/blog/2023/09-24-strings-look-same-but-still-dont-match/index.md b/blog/2023/09-24-strings-look-same-but-still-dont-match/index.md
new file mode 100644
index 00000000000..65c1ec579a3
--- /dev/null
+++ b/blog/2023/09-24-strings-look-same-but-still-dont-match/index.md
@@ -0,0 +1,508 @@
+---
+slug: strings-look-same-but-still-dont-match
+title: 看起來一樣，為什麼字串還是比對失敗？
+authors: Z. Yuan
+date: 2023-09-24T09:56:27+08:00
+tags: [unicode, python, javascript, text-processing, debugging]
+image: /img/2023/0924-unicode-string-traps.svg
+description: 字串看起來一樣，不代表它們真的一樣。問題通常出在 Unicode、不可見字元、正規化，以及你對電腦的過度信任。
+---
+
+你看到兩個字串。
+
+它們看起來一樣。
+
+你用 `==` 一比。
+
+失敗。
+
+這時候人通常會進入三個階段：
+
+1. 先懷疑自己眼花
+2. 再懷疑編碼壞掉
+3. 最後開始懷疑整個宇宙
+
+其實大多數情況下，宇宙沒有針對你。
+
+只是字串這種東西，**長得像**跟**真的是同一串位元**，本來就是兩回事。
+
+這篇想拆幾個最常見的坑：
+
+1. Unicode 組成不同，但畫面一樣
+2. 混進不可見字元
+3. 全形半形、不同 dash、不同空白
+4. 你以為 trim 過就沒事，其實沒有
+5. 該在什麼時候正規化，什麼時候不要亂正規化
+
+我會用 Python 和 JavaScript 都示範一次，因為這兩邊都很會坑人，只是坑法略有地方特色。
+
+<!-- truncate -->
+
+## 先講結論：字串一樣，不等於 code point 一樣
+
+如果你現在正卡在：
+
+- 資料庫查不到
+- API 簽名對不上
+- 搜尋結果怪怪的
+- 使用者說「我明明貼的一樣」
+
+那先不要急著 blame encoding。
+
+先記住這句：
+
+> **你看到的是字形，電腦比的是位元序列或 code point 序列。**
+
+兩者常常不是同一回事。
+
+最經典的例子是字母 `é`。
+
+它可以是：
+
+- 一個單一字元：`U+00E9`
+- 也可以是：`e` + 結合重音 `U+0301`
+
+畫面上都像 `é`。
+
+但底層不一樣。
+
+### Python
+
+```python
+s1 = "é"
+s2 = "e\u0301"
+
+print(s1 == s2)          # False
+print(len(s1), len(s2))  # 1 2
+print([hex(ord(c)) for c in s1])
+print([hex(ord(c)) for c in s2])
+```
+
+輸出大概會像這樣：
+
+```text
+False
+1 2
+['0xe9']
+['0x65', '0x301']
+```
+
+### JavaScript
+
+```js
+const s1 = "é";
+const s2 = "e\u0301";
+
+console.log(s1 === s2); // false
+console.log(s1.length, s2.length); // 1 2
+console.log([...s1].map(ch => ch.codePointAt(0).toString(16)));
+console.log([...s2].map(ch => ch.codePointAt(0).toString(16)));
+```
+
+如果你只看畫面，你會覺得這根本同一個字。
+
+電腦不這麼想。
+
+它比較冷酷，也比較誠實。
+
+## 解法一：用 Unicode normalization 先把字串整理成同一種形式
+
+這不是萬靈丹，但通常是第一步。
+
+常見形式有：
+
+- `NFC`: 偏向組合後的標準形式
+- `NFD`: 偏向拆解後的形式
+- `NFKC`: 相容性正規化，會做更激進的折疊
+- `NFKD`: 相容性拆解版本
+
+大多數一般文字比對，先考慮 `NFC`。
+
+### Python
+
+```python
+import unicodedata
+
+s1 = "é"
+s2 = "e\u0301"
+
+n1 = unicodedata.normalize("NFC", s1)
+n2 = unicodedata.normalize("NFC", s2)
+
+print(n1 == n2)  # True
+```
+
+### JavaScript
+
+```js
+const s1 = "é";
+const s2 = "e\u0301";
+
+console.log(s1.normalize("NFC") === s2.normalize("NFC")); // true
+```
+
+### 什麼時候用 `NFKC`？
+
+當你做的是：
+
+- 使用者輸入的寬鬆搜尋
+- 帳號、代號、標籤這類想收斂輸入形式的欄位
+- 想把全形英數折成半形英數
+
+例如：
+
+```python
+import unicodedata
+
+print(unicodedata.normalize("NFKC", "ＡＢＣ１２３"))
+# ABC123
+```
+
+這很方便。
+
+也很危險。
+
+因為 `NFKC` 不只是整理，還會做**相容性折疊**。
+
+也就是說，它有時不是「保留原文但換個標準形式」，而是「我幫你把看起來差不多的東西直接壓成同一類」。
+
+對搜尋很有用。
+
+對密碼、簽名、法務文本、原文保存，很可能是災難。
+
+所以規則很簡單：
+
+- **搜尋 / 寬鬆比對**：可以考慮 `NFKC`
+- **資料保存 / 安全敏感比對**：通常只做 `NFC`，甚至保留原文另存
+
+## 解法二：把不可見字元抓出來，不要靠肉眼 debug
+
+另一種常見翻車點是：
+
+- 零寬空白 `U+200B`
+- 不換行空白 `U+00A0`
+- word joiner `U+2060`
+- BOM `U+FEFF`
+- tab、carriage return、奇怪換行
+
+這些字元很喜歡混進：
+
+- 從網頁複製的文字
+- Excel / Word 匯出的內容
+- IME 輸入結果
+- OCR 後處理文本
+- 外部 API 回傳資料
+
+例如：
+
+```python
+s1 = "token=abc123"
+s2 = "token=abc123\u200b"
+
+print(s1 == s2)  # False
+print(repr(s2))
+```
+
+輸出：
+
+```text
+False
+'token=abc123\\u200b'
+```
+
+如果你不用 `repr()`，你甚至很難發現那個字元存在。
+
+### 我常用的 debug 方式
+
+#### Python
+
+```python
+def inspect_string(s: str):
+    for i, ch in enumerate(s):
+        print(i, hex(ord(ch)), repr(ch))
+```
+
+#### JavaScript
+
+```js
+function inspectString(s) {
+  [...s].forEach((ch, i) => {
+    console.log(i, "U+" + ch.codePointAt(0).toString(16).toUpperCase(), JSON.stringify(ch));
+  });
+}
+```
+
+這種做法很土。
+
+但有效。
+
+debug 時，我寧可土，也不要高雅地浪費兩小時。
+
+## `trim()` 很有用，但不要把它當神
+
+很多人一看到字串問題就先：
+
+- Python：`s.strip()`
+- JavaScript：`s.trim()`
+
+這可以解掉一部分問題。
+
+但不是全部。
+
+因為：
+
+1. 它只處理頭尾，不處理中間
+2. 對某些 Unicode 格式字元未必有你期待的效果
+3. 它不會替你處理 composed/decomposed 的問題
+
+例如這種：
+
+```text
+Hello\u0000World
+Hello\u000bWorld
+Hello\rWorld
+Hello\nWorld
+```
+
+或：
+
+```text
+2025-09-01
+2025‑09‑01
+2025–09–01
+2025—09—01
+```
+
+你眼裡都是 dash。
+
+電腦眼裡不是。
+
+## 長得像 dash，不代表就是 `-`
+
+實務上很常出現這些：
+
+- Hyphen-minus: `-` (`U+002D`)
+- Non-breaking hyphen: `‑` (`U+2011`)
+- En dash: `–` (`U+2013`)
+- Em dash: `—` (`U+2014`)
+- Minus sign: `−` (`U+2212`)
+
+如果你的 parser、正則、split、檔名規則只接受 ASCII `-`，那這些都會讓你翻車。
+
+### Python 範例
+
+```python
+samples = ["2025-09-01", "2025‑09‑01", "2025–09–01", "2025—09—01"]
+
+for s in samples:
+    print(s, [hex(ord(c)) for c in s if not c.isdigit()])
+```
+
+### 實務做法
+
+如果欄位本質上就只該接受 ASCII，例如：
+
+- slug
+- 檔名規格
+- internal ID
+- command option
+
+那就不要裝寬容。
+
+**明確限制輸入集合**，通常比事後猜測字元意圖穩很多。
+
+## 資料清洗的正確姿勢：保留原文，再做 canonical form
+
+這是我比較推薦的做法。
+
+不要一進系統就把使用者原文亂折。
+
+比較穩的流程通常是：
+
+1. **保留原始輸入**
+2. 建立一個 **canonical form** 供搜尋 / 去重 / 比對使用
+3. 規則寫死，並且可重現
+
+例如 Python：
+
+```python
+import unicodedata
+
+
+def canonicalize(text: str) -> str:
+    text = unicodedata.normalize("NFC", text)
+    text = text.replace("\u00A0", " ")      # nbsp -> normal space
+    text = text.replace("\u200B", "")       # zero width space -> remove
+    text = text.strip()
+    return text
+```
+
+如果你需要更寬鬆搜尋：
+
+```python
+import re
+import unicodedata
+
+
+def search_key(text: str) -> str:
+    text = unicodedata.normalize("NFKC", text)
+    text = text.casefold()
+    text = re.sub(r"\s+", " ", text).strip()
+    return text
+```
+
+這兩個函式不該混為一談。
+
+- `canonicalize()`：偏保守
+- `search_key()`：偏搜尋導向
+
+把兩者混在一起，後面通常會補 bug 補到心情不太穩定。
+
+## `lower()` 不夠，文字比對通常該考慮 `casefold()`
+
+如果你在做不分大小寫的 Unicode 文字比對，Python 裡通常 `casefold()` 比 `lower()` 更合適。
+
+```python
+print("Straße".lower())
+print("Straße".casefold())
+```
+
+輸出：
+
+```text
+straße
+strasse
+```
+
+這在某些歐洲語系場景尤其重要。
+
+JavaScript 沒有直接對等的 `casefold()`，通常只能靠：
+
+- `toLowerCase()` / `toLocaleLowerCase()`
+- 再搭配你自己的正規化規則
+
+也就是說，如果你做的是跨語系的嚴肅全文檢索，前端順手比一比可以，真正的 canonicalization 最好放在後端做。
+
+## 千萬別對密碼、簽名、token 亂做 normalization
+
+這一點值得單獨拉出來講。
+
+有些工程師一看到字串問題，就會想：
+
+> 「那我把所有輸入都 normalize 一下，不就天下太平？」
+
+不。
+
+那通常只是把 bug 從「顯性」變成「更難查」。
+
+以下資料通常**不能亂做寬鬆正規化**：
+
+- 密碼
+- HMAC / API signature
+- JWT / token
+- 雜湊輸入
+- 法律或審計要求保真原文的欄位
+
+這些欄位要的是：
+
+- 明確位元一致
+- 規則穩定
+- 不偷偷替使用者解讀
+
+你可以在 UI 顯示提醒。
+
+你可以在輸入時檢測可疑字元。
+
+但不要擅自幫它「修正」。
+
+## 一套比較實用的排查順序
+
+如果你遇到「看起來一樣但比對失敗」，我通常這樣查：
+
+1. **先印 `repr()` / `JSON.stringify()`**
+2. **列出每個 code point**
+3. **檢查是否混入零寬或特殊空白**
+4. **對照 `NFC` 後結果是否一致**
+5. **確認欄位語意是否允許更激進的 `NFKC`**
+6. **把規則收斂成一個共用函式，不要每個地方各自亂洗**
+
+很多 bug 不是因為 Unicode 太複雜。
+
+而是因為團隊裡：
+
+- A 用 `trim()`
+- B 用 `lower()`
+- C 用 `NFKC`
+- D 什麼都不做
+
+最後大家都說自己是對的。
+
+技術上來說，這很民主。
+
+系統上來說，這很難維運。
+
+## Python / JavaScript 各給一個實用版本
+
+### Python
+
+```python
+import re
+import unicodedata
+
+ZERO_WIDTH = {
+    "\u200b",  # zero width space
+    "\u200c",  # zero width non-joiner
+    "\u200d",  # zero width joiner
+    "\ufeff",  # BOM / zero width no-break space
+}
+
+
+def clean_for_search(text: str) -> str:
+    text = unicodedata.normalize("NFKC", text)
+    text = "".join(ch for ch in text if ch not in ZERO_WIDTH)
+    text = text.casefold()
+    text = re.sub(r"\s+", " ", text).strip()
+    return text
+```
+
+### JavaScript
+
+```js
+function cleanForSearch(text) {
+  return text
+    .normalize("NFKC")
+    .replace(/[\u200B\u200C\u200D\uFEFF]/g, "")
+    .toLocaleLowerCase("en-US")
+    .replace(/\s+/g, " ")
+    .trim();
+}
+```
+
+這不是宇宙真理。
+
+但對「搜尋、標籤、一般使用者輸入比對」這類場景，通常已經比裸用 `==` 靠譜很多。
+
+## 小結
+
+字串問題最煩的地方，在於它常常**看起來像資料沒問題**。
+
+但只要底層 code point 不同、混入不可見字元、或 normalization 策略不一致，系統就會開始表演。
+
+所以真正有用的原則不是：
+
+- 「看到怪字就 trim 一下」
+- 「全都 lower 一下」
+- 「全部丟進 NFKC」
+
+而是：
+
+1. **先看清楚底層字元是什麼**
+2. **依欄位語意決定清洗強度**
+3. **保留原文，另外建立 canonical form**
+4. **把規則集中管理，不要每段程式各自發揮**
+
+畢竟，電腦其實沒有那麼難搞。
+
+它只是拒絕替你腦補。
+
+這點雖然冷酷，但老實說，挺專業的。
diff --git a/i18n/en/docusaurus-plugin-content-blog/2023/09-24-strings-look-same-but-still-dont-match/index.md b/i18n/en/docusaurus-plugin-content-blog/2023/09-24-strings-look-same-but-still-dont-match/index.md
new file mode 100644
index 00000000000..c180d56041f
--- /dev/null
+++ b/i18n/en/docusaurus-plugin-content-blog/2023/09-24-strings-look-same-but-still-dont-match/index.md
@@ -0,0 +1,435 @@
+---
+slug: strings-look-same-but-still-dont-match
+title: They Look the Same. Why Does String Matching Still Fail?
+authors: Z. Yuan
+date: 2023-09-24T09:56:27+08:00
+tags: [unicode, python, javascript, text-processing, debugging]
+image: /img/2023/0924-unicode-string-traps.svg
+description: Two strings can look identical and still fail comparison. The usual suspects are Unicode normalization, invisible characters, and excessive trust in your own eyes.
+---
+
+You look at two strings.
+
+They look identical.
+
+You compare them with `==`.
+
+It fails.
+
+At that point people usually go through three stages:
+
+1. suspect their eyesight
+2. suspect the encoding
+3. suspect the universe
+
+Usually the universe is innocent.
+
+The real problem is simpler:
+
+> **visual equality is not the same thing as identical code points or identical bytes.**
+
+This post covers the usual traps:
+
+1. same glyph, different Unicode composition
+2. invisible characters mixed into the text
+3. full-width vs half-width characters, strange dashes, strange spaces
+4. why `trim()` helps less than people hope
+5. when to normalize, and when normalization is the wrong move
+
+Examples use both Python and JavaScript, because both can hurt you here. They just have different manners.
+
+<!-- truncate -->
+
+## First principle: same appearance does not mean same code points
+
+A classic example is `é`.
+
+It can be represented as:
+
+- one character: `U+00E9`
+- or `e` plus a combining acute accent: `U+0301`
+
+They render the same.
+
+They are not the same sequence.
+
+### Python
+
+```python
+s1 = "é"
+s2 = "e\u0301"
+
+print(s1 == s2)          # False
+print(len(s1), len(s2))  # 1 2
+print([hex(ord(c)) for c in s1])
+print([hex(ord(c)) for c in s2])
+```
+
+### JavaScript
+
+```js
+const s1 = "é";
+const s2 = "e\u0301";
+
+console.log(s1 === s2); // false
+console.log(s1.length, s2.length); // 1 2
+console.log([...s1].map(ch => ch.codePointAt(0).toString(16)));
+console.log([...s2].map(ch => ch.codePointAt(0).toString(16)));
+```
+
+If you only inspect the rendered text, this looks absurd.
+
+From the computer’s perspective, it is perfectly normal.
+
+## Fix 1: normalize Unicode before comparison
+
+This is not magic, but it is often the first correct step.
+
+Common forms are:
+
+- `NFC`: composed standard form
+- `NFD`: decomposed form
+- `NFKC`: compatibility normalization, more aggressive
+- `NFKD`: compatibility decomposition
+
+For ordinary text matching, start with `NFC`.
+
+### Python
+
+```python
+import unicodedata
+
+s1 = "é"
+s2 = "e\u0301"
+
+n1 = unicodedata.normalize("NFC", s1)
+n2 = unicodedata.normalize("NFC", s2)
+
+print(n1 == n2)  # True
+```
+
+### JavaScript
+
+```js
+const s1 = "é";
+const s2 = "e\u0301";
+
+console.log(s1.normalize("NFC") === s2.normalize("NFC")); // true
+```
+
+### When should you use `NFKC`?
+
+Useful cases include:
+
+- forgiving search
+- usernames, labels, or identifiers where you want to collapse input variants
+- folding full-width Latin letters and digits into ASCII forms
+
+Example:
+
+```python
+import unicodedata
+
+print(unicodedata.normalize("NFKC", "ＡＢＣ１２３"))
+# ABC123
+```
+
+Convenient, yes.
+
+Also dangerous.
+
+`NFKC` does more than standardize representation. It may fold compatibility characters into the same canonical shape.
+
+That is great for search.
+
+It can be terrible for passwords, signatures, legal text, and anything that must preserve the exact original input.
+
+So the short rule is:
+
+- **search / loose matching**: `NFKC` can be reasonable
+- **storage / security-sensitive comparison**: usually `NFC`, or even preserve the original exactly
+
+## Fix 2: inspect invisible characters instead of trusting your eyes
+
+Another common failure mode is hidden characters such as:
+
+- zero-width space `U+200B`
+- no-break space `U+00A0`
+- word joiner `U+2060`
+- BOM `U+FEFF`
+- tabs, carriage returns, and odd line separators
+
+These often arrive from:
+
+- copied web content
+- Excel or Word exports
+- IMEs
+- OCR pipelines
+- third-party APIs
+
+Example:
+
+```python
+s1 = "token=abc123"
+s2 = "token=abc123\u200b"
+
+print(s1 == s2)  # False
+print(repr(s2))
+```
+
+If you do not print `repr()`, you may not even notice the extra character.
+
+### Debug helpers
+
+#### Python
+
+```python
+def inspect_string(s: str):
+    for i, ch in enumerate(s):
+        print(i, hex(ord(ch)), repr(ch))
+```
+
+#### JavaScript
+
+```js
+function inspectString(s) {
+  [...s].forEach((ch, i) => {
+    console.log(i, "U+" + ch.codePointAt(0).toString(16).toUpperCase(), JSON.stringify(ch));
+  });
+}
+```
+
+This is not elegant.
+
+It is effective.
+
+In debugging, effective beats elegant very quickly.
+
+## `trim()` helps, but it is not a religion
+
+A lot of people respond to string bugs with:
+
+- Python: `s.strip()`
+- JavaScript: `s.trim()`
+
+Useful, yes.
+
+Sufficient, no.
+
+Why not?
+
+1. it only touches the edges, not the middle
+2. it does not solve composed vs decomposed Unicode
+3. it does not normalize different dash-like or space-like characters the way you might expect
+
+## Not every dash is `-`
+
+In practice you will see all of these:
+
+- Hyphen-minus: `-` (`U+002D`)
+- Non-breaking hyphen: `‑` (`U+2011`)
+- En dash: `–` (`U+2013`)
+- Em dash: `—` (`U+2014`)
+- Minus sign: `−` (`U+2212`)
+
+Humans read “dash”.
+
+Parsers do not.
+
+If your regex, parser, file naming rule, or split logic only accepts ASCII `-`, the others will break it.
+
+If a field is supposed to be ASCII-only by design, such as:
+
+- slugs
+- internal IDs
+- command-line options
+- file naming conventions
+
+then do not pretend it is flexible. Reject invalid input explicitly.
+
+That is usually cheaper than trying to infer user intent after the fact.
+
+## A safer data-cleaning pattern: preserve the original, derive a canonical form
+
+This is the pattern I trust more.
+
+Do not immediately rewrite user input into something else and hope for the best.
+
+A more stable approach is:
+
+1. **preserve the raw original input**
+2. create a **canonical form** for search, deduplication, or loose matching
+3. make the rules explicit and reproducible
+
+Python example:
+
+```python
+import unicodedata
+
+
+def canonicalize(text: str) -> str:
+    text = unicodedata.normalize("NFC", text)
+    text = text.replace("\u00A0", " ")
+    text = text.replace("\u200B", "")
+    text = text.strip()
+    return text
+```
+
+If you need a looser search key:
+
+```python
+import re
+import unicodedata
+
+
+def search_key(text: str) -> str:
+    text = unicodedata.normalize("NFKC", text)
+    text = text.casefold()
+    text = re.sub(r"\s+", " ", text).strip()
+    return text
+```
+
+These are different tools for different jobs.
+
+Mix them carelessly and you get a bug farm.
+
+## For Unicode-insensitive matching, `casefold()` is usually better than `lower()`
+
+In Python, `casefold()` is often more appropriate than `lower()` for case-insensitive text matching.
+
+```python
+print("Straße".lower())
+print("Straße".casefold())
+```
+
+Output:
+
+```text
+straße
+strasse
+```
+
+JavaScript does not give you a direct `casefold()` equivalent. Usually you are limited to:
+
+- `toLowerCase()`
+- `toLocaleLowerCase()`
+- plus your own normalization rules
+
+So for serious multilingual matching, do not rely on ad hoc frontend logic alone. Put canonicalization rules in the backend and keep them consistent.
+
+## Do not normalize passwords, signatures, or tokens just because you can
+
+This deserves its own section.
+
+A common overreaction is:
+
+> “Fine, I will normalize everything.”
+
+That usually turns a visible bug into a subtler one.
+
+The following data should not be “helpfully” normalized in a loose way:
+
+- passwords
+- HMAC or API signatures
+- JWTs or tokens
+- hash inputs
+- legally or operationally sensitive original text
+
+For those fields you want:
+
+- exact byte stability
+- explicit rules
+- no silent reinterpretation
+
+You can warn the user.
+
+You can detect suspicious characters.
+
+You should not quietly rewrite the input and pretend that was safe.
+
+## A practical debugging checklist
+
+When two strings look the same but comparison fails, I usually do this:
+
+1. print `repr()` or `JSON.stringify()`
+2. list every code point
+3. check for zero-width and unusual whitespace characters
+4. compare again after `NFC`
+5. decide whether the field semantics allow `NFKC`
+6. centralize the rule in one shared function instead of re-inventing it everywhere
+
+A lot of Unicode bugs are not caused by Unicode being impossibly complicated.
+
+They are caused by teams doing four different “small fixes” in four different places.
+
+That is democratic.
+
+It is also how string handling becomes an operational problem.
+
+## Two practical helpers
+
+### Python
+
+```python
+import re
+import unicodedata
+
+ZERO_WIDTH = {
+    "\u200b",
+    "\u200c",
+    "\u200d",
+    "\ufeff",
+}
+
+
+def clean_for_search(text: str) -> str:
+    text = unicodedata.normalize("NFKC", text)
+    text = "".join(ch for ch in text if ch not in ZERO_WIDTH)
+    text = text.casefold()
+    text = re.sub(r"\s+", " ", text).strip()
+    return text
+```
+
+### JavaScript
+
+```js
+function cleanForSearch(text) {
+  return text
+    .normalize("NFKC")
+    .replace(/[\u200B\u200C\u200D\uFEFF]/g, "")
+    .toLocaleLowerCase("en-US")
+    .replace(/\s+/g, " ")
+    .trim();
+}
+```
+
+This is not universal truth.
+
+It is, however, a lot better than pretending raw `==` is a text-processing strategy.
+
+## Summary
+
+String bugs are annoying because the data often looks fine.
+
+But once code points differ, invisible characters slip in, or normalization rules vary across the stack, your system starts improvising.
+
+The useful rules are not:
+
+- “just trim it”
+- “just lowercase it”
+- “just normalize everything with `NFKC`”
+
+The useful rules are:
+
+1. **inspect the actual characters**
+2. **choose cleaning strength based on field semantics**
+3. **preserve the original and derive a canonical form separately**
+4. **centralize the rule instead of scattering string folklore across the codebase**
+
+Computers are not being dramatic here.
+
+They are simply refusing to guess what you meant.
+
+Cold behavior, perhaps.
+
+Also professional.
diff --git a/i18n/ja/docusaurus-plugin-content-blog/2023/09-24-strings-look-same-but-still-dont-match/index.md b/i18n/ja/docusaurus-plugin-content-blog/2023/09-24-strings-look-same-but-still-dont-match/index.md
new file mode 100644
index 00000000000..0ab3c5c3def
--- /dev/null
+++ b/i18n/ja/docusaurus-plugin-content-blog/2023/09-24-strings-look-same-but-still-dont-match/index.md
@@ -0,0 +1,425 @@
+---
+slug: strings-look-same-but-still-dont-match
+title: 同じに見えるのに、なぜ文字列比較は失敗するのか？
+authors: Z. Yuan
+date: 2023-09-24T09:56:27+08:00
+tags: [unicode, python, javascript, text-processing, debugging]
+image: /img/2023/0924-unicode-string-traps.svg
+description: 文字列は見た目が同じでも一致しないことがあります。原因はたいてい Unicode 正規化、不可視文字、そして人間の目への過信です。
+---
+
+2 つの文字列を見る。
+
+見た目は同じ。
+
+`==` で比較する。
+
+失敗する。
+
+この手の事故が起きると、人はだいたい次の順で壊れます。
+
+1. まず自分の目を疑う
+2. 次に文字コードを疑う
+3. 最後に宇宙の悪意を疑う
+
+たいてい宇宙は無実です。
+
+問題はもっと地味です。
+
+> **見た目が同じことと、code point や byte 列が同じことは別です。**
+
+この記事では、よくある落とし穴を整理します。
+
+1. 同じ字形でも Unicode の構成が違う
+2. 不可視文字が混ざる
+3. 全角・半角、ダッシュ、空白の違い
+4. `trim()` では足りない理由
+5. いつ正規化すべきか、いつ正規化してはいけないか
+
+例は Python と JavaScript の両方を使います。どちらもこの分野では十分に厄介です。
+
+<!-- truncate -->
+
+## まず原則：同じ見た目でも、同じ code point とは限らない
+
+代表例は `é` です。
+
+これは次の 2 通りで表現できます。
+
+- 1 文字の `U+00E9`
+- `e` と結合アクセント `U+0301`
+
+表示上は同じです。
+
+でも内部表現は違います。
+
+### Python
+
+```python
+s1 = "é"
+s2 = "e\u0301"
+
+print(s1 == s2)          # False
+print(len(s1), len(s2))  # 1 2
+print([hex(ord(c)) for c in s1])
+print([hex(ord(c)) for c in s2])
+```
+
+### JavaScript
+
+```js
+const s1 = "é";
+const s2 = "e\u0301";
+
+console.log(s1 === s2); // false
+console.log(s1.length, s2.length); // 1 2
+console.log([...s1].map(ch => ch.codePointAt(0).toString(16)));
+console.log([...s2].map(ch => ch.codePointAt(0).toString(16)));
+```
+
+画面だけ見ていると理不尽ですが、コンピュータ側の挙動としては完全に正常です。
+
+## 対処 1：Unicode 正規化で表現形式をそろえる
+
+万能薬ではありませんが、最初にやるべきこととしてはかなり正しいです。
+
+代表的な形式は次の通りです。
+
+- `NFC`: 合成寄りの標準形
+- `NFD`: 分解寄りの形式
+- `NFKC`: 互換正規化。より強く畳み込む
+- `NFKD`: 互換分解
+
+普通の文字列比較なら、まず `NFC` を考えます。
+
+### Python
+
+```python
+import unicodedata
+
+s1 = "é"
+s2 = "e\u0301"
+
+n1 = unicodedata.normalize("NFC", s1)
+n2 = unicodedata.normalize("NFC", s2)
+
+print(n1 == n2)  # True
+```
+
+### JavaScript
+
+```js
+const s1 = "é";
+const s2 = "e\u0301";
+
+console.log(s1.normalize("NFC") === s2.normalize("NFC")); // true
+```
+
+### `NFKC` はいつ使うべきか
+
+向いているのは、たとえば次のような場面です。
+
+- ゆるい検索
+- ユーザー入力の揺れを吸収したい識別子
+- 全角英数字を ASCII に寄せたい場合
+
+```python
+import unicodedata
+
+print(unicodedata.normalize("NFKC", "ＡＢＣ１２３"))
+# ABC123
+```
+
+便利です。
+
+同時に、雑に使うと危険です。
+
+`NFKC` は単なる形式統一ではなく、互換文字をより積極的に畳み込みます。
+
+検索には向いています。
+
+パスワード、署名、法的原文、厳密な入力保持には向かないことがあります。
+
+なので雑にまとめるとこうです。
+
+- **検索・あいまい比較**: `NFKC` は有力
+- **保存・セキュリティ上厳密な比較**: まずは `NFC`、あるいは原文そのものを保持
+
+## 対処 2：不可視文字を目視ではなくコードで暴く
+
+もう 1 つの定番事故が不可視文字です。
+
+- zero-width space `U+200B`
+- no-break space `U+00A0`
+- word joiner `U+2060`
+- BOM `U+FEFF`
+- タブ、復帰、変な改行
+
+これらは次の経路で平然と混ざります。
+
+- Web ページからのコピー
+- Excel / Word の出力
+- IME
+- OCR 後処理
+- 外部 API
+
+例：
+
+```python
+s1 = "token=abc123"
+s2 = "token=abc123\u200b"
+
+print(s1 == s2)  # False
+print(repr(s2))
+```
+
+`repr()` を出さないと、そもそも異物があることに気づけないことがあります。
+
+### デバッグ用の定番
+
+#### Python
+
+```python
+def inspect_string(s: str):
+    for i, ch in enumerate(s):
+        print(i, hex(ord(ch)), repr(ch))
+```
+
+#### JavaScript
+
+```js
+function inspectString(s) {
+  [...s].forEach((ch, i) => {
+    console.log(i, "U+" + ch.codePointAt(0).toString(16).toUpperCase(), JSON.stringify(ch));
+  });
+}
+```
+
+上品ではありません。
+
+でも効きます。
+
+デバッグで大事なのは、だいたい品格ではなく再現性です。
+
+## `trim()` は便利だが、救世主ではない
+
+文字列の不具合を見ると、すぐにこうしたくなります。
+
+- Python: `s.strip()`
+- JavaScript: `s.trim()`
+
+役には立ちます。
+
+しかし十分ではありません。
+
+理由は単純です。
+
+1. 先頭と末尾しか触らない
+2. Unicode の合成・分解問題は解決しない
+3. ダッシュや空白の種類違いまでは吸収しない
+
+## ダッシュは全部 `-` ではない
+
+実務では次のような文字が混ざります。
+
+- Hyphen-minus: `-` (`U+002D`)
+- Non-breaking hyphen: `‑` (`U+2011`)
+- En dash: `–` (`U+2013`)
+- Em dash: `—` (`U+2014`)
+- Minus sign: `−` (`U+2212`)
+
+人間は全部「ダッシュっぽいもの」と読みます。
+
+パーサはそうしません。
+
+もし正規表現、split、ファイル名ルール、slug 規則が ASCII の `-` だけを期待しているなら、他は普通に事故要因です。
+
+フィールドの意味として ASCII しか許さないべきなら、最初からそう制約した方が安いです。
+
+後から「たぶんこういう意味だろう」と推測するのは、だいたい保守コストの前払いです。
+
+## クリーニングの基本：原文を保持し、別に canonical form を作る
+
+これはかなり重要です。
+
+ユーザー入力を受けた瞬間に書き換えてしまうのは、長期的にはあまり賢くありません。
+
+安定しやすい流れはこうです。
+
+1. **原文は保持する**
+2. 検索・重複判定・緩い比較用に **canonical form** を作る
+3. ルールを明文化し、再現可能にする
+
+Python の例：
+
+```python
+import unicodedata
+
+
+def canonicalize(text: str) -> str:
+    text = unicodedata.normalize("NFC", text)
+    text = text.replace("\u00A0", " ")
+    text = text.replace("\u200B", "")
+    text = text.strip()
+    return text
+```
+
+もっと緩い検索用なら：
+
+```python
+import re
+import unicodedata
+
+
+def search_key(text: str) -> str:
+    text = unicodedata.normalize("NFKC", text)
+    text = text.casefold()
+    text = re.sub(r"\s+", " ", text).strip()
+    return text
+```
+
+この 2 つを混ぜると、あとで静かに面倒が育ちます。
+
+## Python では `lower()` より `casefold()` を検討する
+
+Unicode を含む大小無視比較では、Python では `lower()` より `casefold()` の方が適切な場面があります。
+
+```python
+print("Straße".lower())
+print("Straße".casefold())
+```
+
+出力：
+
+```text
+straße
+strasse
+```
+
+JavaScript にはこれと完全に同等の `casefold()` はありません。通常は：
+
+- `toLowerCase()`
+- `toLocaleLowerCase()`
+- それに自前の正規化ルール
+
+という構成になります。
+
+なので、多言語を真面目に扱う比較ルールは、フロントで各自が気分で書くより、バックエンド側で一元化した方が安定します。
+
+## パスワード、署名、token に雑な正規化を入れない
+
+ここは分けて強調しておきます。
+
+文字列事故を見ると、ついこう考えがちです。
+
+> 「じゃあ全部 normalize すれば平和では？」
+
+平和にはなりません。
+
+むしろバグが見えにくくなることがあります。
+
+次のようなデータは、ゆるい正規化を勝手にかけるべきではありません。
+
+- パスワード
+- HMAC / API 署名
+- JWT / token
+- ハッシュ入力
+- 原文保持が必要な法務・監査系データ
+
+こういうものに必要なのは：
+
+- byte 単位の安定性
+- 明示的なルール
+- 入力内容の勝手な再解釈をしないこと
+
+警告は出していい。
+
+怪しい文字を検出していい。
+
+でも、黙って「直す」のはだいたい危ないです。
+
+## 実務で使う確認手順
+
+「同じに見えるのに比較が失敗する」とき、私はだいたい次の順で見ます。
+
+1. `repr()` / `JSON.stringify()` を出す
+2. code point を全部並べる
+3. zero-width や特殊空白の混入を疑う
+4. `NFC` 後に一致するか確認する
+5. フィールドの意味として `NFKC` を許せるか考える
+6. ルールを共通関数に寄せる
+
+Unicode が難しすぎるから事故る、というよりは、
+
+- A は `trim()`
+- B は `lower()`
+- C は `NFKC`
+- D は何もしない
+
+みたいなチーム構成で事故ることの方が多いです。
+
+民主的ではあります。
+
+運用しやすくはありません。
+
+## 実用ヘルパーを 2 つ
+
+### Python
+
+```python
+import re
+import unicodedata
+
+ZERO_WIDTH = {
+    "\u200b",
+    "\u200c",
+    "\u200d",
+    "\ufeff",
+}
+
+
+def clean_for_search(text: str) -> str:
+    text = unicodedata.normalize("NFKC", text)
+    text = "".join(ch for ch in text if ch not in ZERO_WIDTH)
+    text = text.casefold()
+    text = re.sub(r"\s+", " ", text).strip()
+    return text
+```
+
+### JavaScript
+
+```js
+function cleanForSearch(text) {
+  return text
+    .normalize("NFKC")
+    .replace(/[\u200B\u200C\u200D\uFEFF]/g, "")
+    .toLocaleLowerCase("en-US")
+    .replace(/\s+/g, " ")
+    .trim();
+}
+```
+
+絶対解ではありません。
+
+ただ、素の `==` をそのまま信仰するよりは、かなり現実的です。
+
+## まとめ
+
+文字列バグが面倒なのは、見た目には問題なさそうに見えることです。
+
+でも、code point が違い、不可視文字が入り、正規化方針がスタック全体で揃っていないと、システムは静かに壊れます。
+
+役に立つ原則は次の 4 つです。
+
+1. **実際の文字を確認する**
+2. **フィールドの意味に応じて洗浄強度を決める**
+3. **原文を保持し、canonical form は別に作る**
+4. **ルールを一箇所に集める**
+
+コンピュータはここで意地悪をしているわけではありません。
+
+ただ、あなたの意図を勝手に補完しないだけです。
+
+冷たい態度です。
+
+でも、かなり仕事はできる態度でもあります。
diff --git a/static/img/2023/0924-unicode-string-traps.svg b/static/img/2023/0924-unicode-string-traps.svg
new file mode 100644
index 00000000000..7d4585bc997
--- /dev/null
+++ b/static/img/2023/0924-unicode-string-traps.svg
@@ -0,0 +1,31 @@
+<svg xmlns="http://www.w3.org/2000/svg" width="1200" height="630" viewBox="0 0 1200 630" role="img" aria-labelledby="title desc">
+  <title id="title">Unicode string traps</title>
+  <desc id="desc">Cover image for a blog post about Unicode normalization, invisible characters, and debugging string mismatches.</desc>
+  <defs>
+    <linearGradient id="bg" x1="0" y1="0" x2="1" y2="1">
+      <stop offset="0%" stop-color="#0b1020"/>
+      <stop offset="100%" stop-color="#111827"/>
+    </linearGradient>
+    <linearGradient id="accent" x1="0" y1="0" x2="1" y2="0">
+      <stop offset="0%" stop-color="#60a5fa"/>
+      <stop offset="100%" stop-color="#34d399"/>
+    </linearGradient>
+  </defs>
+  <rect width="1200" height="630" fill="url(#bg)"/>
+  <rect x="56" y="52" width="1088" height="526" rx="24" fill="#0f172a" stroke="#334155"/>
+  <text x="88" y="130" fill="#93c5fd" font-family="Menlo, Monaco, Consolas, monospace" font-size="26">Unicode / Normalization / Invisible chars</text>
+  <text x="88" y="220" fill="#f8fafc" font-family="-apple-system, BlinkMacSystemFont, Segoe UI, sans-serif" font-size="56" font-weight="700">看起來一樣，為什麼</text>
+  <text x="88" y="290" fill="#f8fafc" font-family="-apple-system, BlinkMacSystemFont, Segoe UI, sans-serif" font-size="56" font-weight="700">字串還是比對失敗？</text>
+  <text x="88" y="368" fill="#94a3b8" font-family="-apple-system, BlinkMacSystemFont, Segoe UI, sans-serif" font-size="28">é  ≠  é   ·   zero-width space   ·   NFKC is not a toy</text>
+  <rect x="88" y="420" width="420" height="90" rx="16" fill="#020617" stroke="#1e293b"/>
+  <text x="110" y="455" fill="#e2e8f0" font-family="Menlo, Monaco, Consolas, monospace" font-size="24">"token=abc123"</text>
+  <text x="110" y="492" fill="#f87171" font-family="Menlo, Monaco, Consolas, monospace" font-size="24">"token=abc123\u00A0"</text>
+  <rect x="650" y="170" width="420" height="280" rx="20" fill="#020617" stroke="#1e293b"/>
+  <text x="680" y="220" fill="#e2e8f0" font-family="Menlo, Monaco, Consolas, monospace" font-size="28">s1 = "é"</text>
+  <text x="680" y="265" fill="#e2e8f0" font-family="Menlo, Monaco, Consolas, monospace" font-size="28">s2 = "e\u0301"</text>
+  <text x="680" y="330" fill="#f87171" font-family="Menlo, Monaco, Consolas, monospace" font-size="40">s1 == s2  // false</text>
+  <rect x="680" y="370" width="280" height="12" rx="6" fill="#1e293b"/>
+  <rect x="680" y="370" width="210" height="12" rx="6" fill="url(#accent)"/>
+  <text x="680" y="420" fill="#94a3b8" font-family="-apple-system, BlinkMacSystemFont, Segoe UI, sans-serif" font-size="26">Normalize first. Trust your logs more than your eyes.</text>
+  <text x="88" y="554" fill="#64748b" font-family="-apple-system, BlinkMacSystemFont, Segoe UI, sans-serif" font-size="22">docsaid.org</text>
+</svg>