Skip to content

fix: preserve original .md URL from llms.txt for markdown-availability checks#78

Merged
dacharyc merged 1 commit into
mainfrom
fix/markdown-url-support-regression
May 2, 2026
Merged

fix: preserve original .md URL from llms.txt for markdown-availability checks#78
dacharyc merged 1 commit into
mainfrom
fix/markdown-url-support-regression

Conversation

@dacharyc
Copy link
Copy Markdown
Member

@dacharyc dacharyc commented May 2, 2026

Summary

  • Fixes markdown-url-support: regression — llms.txt URL normalization breaks detection for sites using .html.md suffix #77: markdown-url-support regressed from 100% → 0% on sites whose llms.txt links use a .html.md suffix (Plaid, and any site emitting <page>.html.md siblings). normalizePageUrl rewrote the URL to its HTML form for sitemap dedup, then toMdUrls() regenerated candidates that missed the URL the site actually published.
  • Carries the original .md/.mdx URL alongside the normalized URL through discovery as a sidecar map (originalMdUrls). markdown-url-support tries it first, then toMdUrls() candidates (with the existing mdFormPreference heuristic), then a parent-clean fallback (/foo/index.html.md/foo.md) gated to URLs whose llms.txt original matched /index.html?\.md$.
  • toMdUrls() itself is unchanged. The parent-clean candidate lives in markdown-url-support only, so other consumers (llms-txt-directive-md, llms-txt-links-markdown, tabbed-content-serialization, get-markdown-content) cannot regress to the prior false-positive class where unrelated sibling .md files passed validation.
  • The mdFormPreference heuristic skips wins served via originalMdUrl so a run of .html.md sites doesn't skew the heuristic for other pages.

Test plan

  • npm test — 1267/1267 passing
  • npm run lint — clean
  • New unit tests for originalMdUrls plumbing (.html.md, .mdx, plain .md, sitemap-only no-map, .md-form dedup, sampled-subset filtering, aggregate .txt walker propagation)
  • New behavior tests in markdown-url-support covering issue markdown-url-support: regression — llms.txt URL normalization breaks detection for sites using .html.md suffix #77 plus sibling-page trap, no-leak guards (sitemap-only / plain .md / .mdx / /page.html.md no-index), parent-clean fallback when /index.html.md 404s, bounded request count, mdFormPreference skew guard, and pageResults.mdUrl reflects the served URL
  • Fix-B isolation guardrail tests in llms-txt-directive-md and llms-txt-links-markdown confirming the parent-clean candidate cannot leak via toMdUrls()
  • Smoke test against https://plaid.com/docs: markdown-url-support 0/155 → 10/10 (100%); every sample served via the original index.html.md URL

…y checks

When llms.txt linked to a .md/.mdx URL (notably Plaid's /index.html.md
form), normalizePageUrl rewrote it to its HTML equivalent for sitemap
dedup, then toMdUrls regenerated candidates from the HTML form that
missed the URL the site actually published. markdown-url-support
scored 0% on otherwise-compliant sites.

Carry the original .md URL alongside the normalized URL through
discovery as originalMdUrls. markdown-url-support tries it first,
then falls through to toMdUrls() candidates, then a parent-clean
fallback (gated to /index.html.md sources). toMdUrls itself is
unchanged so other checks (llms-txt-directive-md, llms-txt-links-
markdown) cannot regress to the prior false-positive class.

Closes #77
@dacharyc dacharyc merged commit b3f0d59 into main May 2, 2026
2 checks passed
@dacharyc dacharyc deleted the fix/markdown-url-support-regression branch May 2, 2026 02:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

markdown-url-support: regression — llms.txt URL normalization breaks detection for sites using .html.md suffix

1 participant