fix: preserve original .md URL from llms.txt for markdown-availability checks by dacharyc · Pull Request #78 · agent-ecosystem/afdocs

dacharyc · 2026-05-02T02:30:22Z

Summary

Fixes markdown-url-support: regression — llms.txt URL normalization breaks detection for sites using .html.md suffix #77: markdown-url-support regressed from 100% → 0% on sites whose llms.txt links use a .html.md suffix (Plaid, and any site emitting <page>.html.md siblings). normalizePageUrl rewrote the URL to its HTML form for sitemap dedup, then toMdUrls() regenerated candidates that missed the URL the site actually published.
Carries the original .md/.mdx URL alongside the normalized URL through discovery as a sidecar map (originalMdUrls). markdown-url-support tries it first, then toMdUrls() candidates (with the existing mdFormPreference heuristic), then a parent-clean fallback (/foo/index.html.md → /foo.md) gated to URLs whose llms.txt original matched /index.html?\.md$.
toMdUrls() itself is unchanged. The parent-clean candidate lives in markdown-url-support only, so other consumers (llms-txt-directive-md, llms-txt-links-markdown, tabbed-content-serialization, get-markdown-content) cannot regress to the prior false-positive class where unrelated sibling .md files passed validation.
The mdFormPreference heuristic skips wins served via originalMdUrl so a run of .html.md sites doesn't skew the heuristic for other pages.

Test plan

npm test — 1267/1267 passing
npm run lint — clean
New unit tests for originalMdUrls plumbing (.html.md, .mdx, plain .md, sitemap-only no-map, .md-form dedup, sampled-subset filtering, aggregate .txt walker propagation)
New behavior tests in markdown-url-support covering issue markdown-url-support: regression — llms.txt URL normalization breaks detection for sites using .html.md suffix #77 plus sibling-page trap, no-leak guards (sitemap-only / plain .md / .mdx / /page.html.md no-index), parent-clean fallback when /index.html.md 404s, bounded request count, mdFormPreference skew guard, and pageResults.mdUrl reflects the served URL
Fix-B isolation guardrail tests in llms-txt-directive-md and llms-txt-links-markdown confirming the parent-clean candidate cannot leak via toMdUrls()
Smoke test against https://plaid.com/docs: markdown-url-support 0/155 → 10/10 (100%); every sample served via the original index.html.md URL

…y checks When llms.txt linked to a .md/.mdx URL (notably Plaid's /index.html.md form), normalizePageUrl rewrote it to its HTML equivalent for sitemap dedup, then toMdUrls regenerated candidates from the HTML form that missed the URL the site actually published. markdown-url-support scored 0% on otherwise-compliant sites. Carry the original .md URL alongside the normalized URL through discovery as originalMdUrls. markdown-url-support tries it first, then falls through to toMdUrls() candidates, then a parent-clean fallback (gated to /index.html.md sources). toMdUrls itself is unchanged so other checks (llms-txt-directive-md, llms-txt-links- markdown) cannot regress to the prior false-positive class. Closes #77

dacharyc merged commit b3f0d59 into main May 2, 2026
2 checks passed

dacharyc deleted the fix/markdown-url-support-regression branch May 2, 2026 02:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: preserve original .md URL from llms.txt for markdown-availability checks#78

fix: preserve original .md URL from llms.txt for markdown-availability checks#78
dacharyc merged 1 commit into
mainfrom
fix/markdown-url-support-regression

dacharyc commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dacharyc commented May 2, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant