feat(extract): recognize PARA-numbered Obsidian dirs in link extraction#1186
Open
rwbaker wants to merge 1 commit into
Open
feat(extract): recognize PARA-numbered Obsidian dirs in link extraction#1186rwbaker wants to merge 1 commit into
rwbaker wants to merge 1 commit into
Conversation
DIR_PATTERN's fixed semantic-dir whitelist (people|companies|meetings|...) matched zero of the 947 wikilinks in a 1008-page Obsidian + PARA vault. PARA layouts use numeric-prefixed dirs to force sidebar order (10_Projects/, 20_Meetings/, 30_Resources/, 40_Areas/, 50_Pulse/, 80_Archived/), and the resulting graph had no edges. Add a `\d+_word` alternative to DIR_PATTERN (using [A-Za-z] so PascalCase matches without forcing an `i` flag on QUALIFIED_WIKILINK_RE, whose source-id sub-expression is intentionally kebab-only). Extracted slugs are then run through slugifyPath so wikilink paths like `[[10_Projects/Meeting Transcripts/Foo|Foo]]` reduce to the lowercased, hyphen-segmented DB slug `10_projects/meeting-transcripts/foo` that `allSlugs.has()` expects. After this patch, `gbrain extract links --source db` reports 45 links created against the timelycare vault (previously 0), and `gbrain graph-query` returns typed edges for PARA-style pages. Tests: 6 new cases for canonical PARA dirs (10_projects, 20_meetings, 30_resources, 40_areas, 50_pulse, 80_archived), PascalCase normalization, spaced-segment normalization, and the markdown-link variant. Tracked locally as TIM-27 in the Paperclip TimelyCare project. Co-Authored-By: Paperclip <noreply@paperclip.ing>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
DIR_PATTERNinsrc/core/link-extraction.tsmatches a fixed semantic-dir whitelist (people|companies|meetings|concepts|deal|...). Any Obsidian vault that follows the Tiago-Forte PARA convention — which prefixes top-level dirs with a numeric sort key to control sidebar order (10_Projects/,20_Meetings/,30_Resources/,40_Areas/,50_Pulse/,80_Archived/) — gets 0 extracted links even when the source markdown contains hundreds of valid wikilinks.Concretely, on a 1008-page TimelyCare vault with 947 wikilinks in the source markdown,
gbrain extract links --source dbreportedLinks: created 0and everygbrain graph-queryreturnedNo edges found.Fix
Two small additive changes to
src/core/link-extraction.ts(each marked with a[PARA-PATCH]comment for easy review):Extend
DIR_PATTERNwith a\d+_[A-Za-z][A-Za-z0-9_-]*alternative so PARA-numbered dirs match. The existing canonical / domain dirs are left untouched, so behaviour for non-PARA vaults is unchanged.I used explicit
[A-Za-z]rather than adding aniflag to the regex on purpose —QUALIFIED_WIKILINK_RE's source-id sub-expression is intentionally kebab-only (there's an existing test that pins that), and a global case-insensitive flag would relax it.Normalize extracted slugs through
slugifyPathinextractEntityRefs(the wikilink + markdown-link paths). PARA wikilinks in real vaults use PascalCase and spaces ([[10_Projects/Meeting Transcripts/Foo|Foo]]); the DB stores the lowercased, hyphen-segmented slug (10_projects/meeting-transcripts/foo). Without normalization, even with the regex extended,allSlugs.has(slug)misses every match.slugifyPathis a no-op for already-canonical refs likepeople/alice-chen.False positives are bounded the same way they were before: every extracted ref is filtered through
allSlugs.has(slug)later in the pipeline, so dirs that look like\d+_wordbut aren't actually pages in the brain are dropped at that boundary.Tests
6 new cases in
test/link-extraction.test.ts:10_projects,20_meetings,30_resources,40_areas,50_pulse,80_archived)10_Projects/Foo→10_projects/foo)20_Meetings/1-1s/Alice Chen→20_meetings/1-1s/alice-chen)[Foo](10_projects/foo))Full suite: 104 pass / 0 fail (was 98). The previously-passing kebab-only QUALIFIED_WIKILINK source-id test still passes — that was the reason I avoided the
iflag.Verified against a real PARA vault
After the patch, on the same 1008-page PARA vault that previously extracted 0:
(45 is the count of wikilink targets that actually exist as DB pages. The remaining 902 wikilinks point at pages the vault references but hasn't ingested — a vault-content gap, not a regex gap.)
Why this is worth landing
The PARA convention is widespread in the Obsidian/Tiago-Forte community and is incompatible with GBrain out of the box today. The patch is additive (no behaviour change for non-PARA vaults), gated on the existing
allSlugs.has()boundary (so no false-positive bloat), and fully covered by tests.Happy to break this into smaller commits or take any reshaping you'd prefer.