fix: preserve original .md URL from llms.txt for markdown-availability checks#78
Merged
Merged
Conversation
…y checks When llms.txt linked to a .md/.mdx URL (notably Plaid's /index.html.md form), normalizePageUrl rewrote it to its HTML equivalent for sitemap dedup, then toMdUrls regenerated candidates from the HTML form that missed the URL the site actually published. markdown-url-support scored 0% on otherwise-compliant sites. Carry the original .md URL alongside the normalized URL through discovery as originalMdUrls. markdown-url-support tries it first, then falls through to toMdUrls() candidates, then a parent-clean fallback (gated to /index.html.md sources). toMdUrls itself is unchanged so other checks (llms-txt-directive-md, llms-txt-links- markdown) cannot regress to the prior false-positive class. Closes #77
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
markdown-url-supportregressed from 100% → 0% on sites whose llms.txt links use a.html.mdsuffix (Plaid, and any site emitting<page>.html.mdsiblings).normalizePageUrlrewrote the URL to its HTML form for sitemap dedup, thentoMdUrls()regenerated candidates that missed the URL the site actually published..md/.mdxURL alongside the normalized URL through discovery as a sidecar map (originalMdUrls).markdown-url-supporttries it first, thentoMdUrls()candidates (with the existingmdFormPreferenceheuristic), then a parent-clean fallback (/foo/index.html.md→/foo.md) gated to URLs whose llms.txt original matched/index.html?\.md$.toMdUrls()itself is unchanged. The parent-clean candidate lives inmarkdown-url-supportonly, so other consumers (llms-txt-directive-md,llms-txt-links-markdown,tabbed-content-serialization,get-markdown-content) cannot regress to the prior false-positive class where unrelated sibling.mdfiles passed validation.mdFormPreferenceheuristic skips wins served viaoriginalMdUrlso a run of.html.mdsites doesn't skew the heuristic for other pages.Test plan
npm test— 1267/1267 passingnpm run lint— cleanoriginalMdUrlsplumbing (.html.md, .mdx, plain .md, sitemap-only no-map, .md-form dedup, sampled-subset filtering, aggregate.txtwalker propagation)markdown-url-supportcovering issue markdown-url-support: regression — llms.txt URL normalization breaks detection for sites using .html.md suffix #77 plus sibling-page trap, no-leak guards (sitemap-only / plain .md / .mdx //page.html.mdno-index), parent-clean fallback when/index.html.md404s, bounded request count,mdFormPreferenceskew guard, andpageResults.mdUrlreflects the served URLllms-txt-directive-mdandllms-txt-links-markdownconfirming the parent-clean candidate cannot leak viatoMdUrls()markdown-url-support0/155 → 10/10 (100%); every sample served via the originalindex.html.mdURL