feat(extract): recognize PARA-numbered Obsidian dirs in link extraction by rwbaker · Pull Request #1186 · garrytan/gbrain

rwbaker · 2026-05-19T00:48:57Z

Problem

DIR_PATTERN in src/core/link-extraction.ts matches a fixed semantic-dir whitelist (people|companies|meetings|concepts|deal|...). Any Obsidian vault that follows the Tiago-Forte PARA convention — which prefixes top-level dirs with a numeric sort key to control sidebar order (10_Projects/, 20_Meetings/, 30_Resources/, 40_Areas/, 50_Pulse/, 80_Archived/) — gets 0 extracted links even when the source markdown contains hundreds of valid wikilinks.

Concretely, on a 1008-page TimelyCare vault with 947 wikilinks in the source markdown, gbrain extract links --source db reported Links: created 0 and every gbrain graph-query returned No edges found.

Fix

Two small additive changes to src/core/link-extraction.ts (each marked with a [PARA-PATCH] comment for easy review):

Extend DIR_PATTERN with a \d+_[A-Za-z][A-Za-z0-9_-]* alternative so PARA-numbered dirs match. The existing canonical / domain dirs are left untouched, so behaviour for non-PARA vaults is unchanged.
```
- const DIR_PATTERN = '(?:people|companies|meetings|concepts|deal|civic|project|projects|source|media|yc|tech|finance|personal|openclaw|entities)';
+ const DIR_PATTERN = '(?:\\d+_[A-Za-z][A-Za-z0-9_-]*|people|companies|meetings|concepts|deal|civic|project|projects|source|media|yc|tech|finance|personal|openclaw|entities)';
```
I used explicit [A-Za-z] rather than adding an i flag to the regex on purpose — QUALIFIED_WIKILINK_RE's source-id sub-expression is intentionally kebab-only (there's an existing test that pins that), and a global case-insensitive flag would relax it.
Normalize extracted slugs through slugifyPath in extractEntityRefs (the wikilink + markdown-link paths). PARA wikilinks in real vaults use PascalCase and spaces ([[10_Projects/Meeting Transcripts/Foo|Foo]]); the DB stores the lowercased, hyphen-segmented slug (10_projects/meeting-transcripts/foo). Without normalization, even with the regex extended, allSlugs.has(slug) misses every match. slugifyPath is a no-op for already-canonical refs like people/alice-chen.

False positives are bounded the same way they were before: every extracted ref is filtered through allSlugs.has(slug) later in the pipeline, so dirs that look like \d+_word but aren't actually pages in the brain are dropped at that boundary.

Tests

6 new cases in test/link-extraction.test.ts:

Canonical PARA tops (10_projects, 20_meetings, 30_resources, 40_areas, 50_pulse, 80_archived)
PascalCase normalization (10_Projects/Foo → 10_projects/foo)
Spaced-segment normalization (20_Meetings/1-1s/Alice Chen → 20_meetings/1-1s/alice-chen)
Markdown-link variant ([Foo](10_projects/foo))

Full suite: 104 pass / 0 fail (was 98). The previously-passing kebab-only QUALIFIED_WIKILINK source-id test still passes — that was the reason I avoided the i flag.

Verified against a real PARA vault

After the patch, on the same 1008-page PARA vault that previously extracted 0:

$ gbrain extract links --source db
Links: created 45 from 1008 pages

$ gbrain graph-query 20_meetings/10_people/design/julia-campbell-1-1 --depth 2
[depth 0] 20_meetings/10_people/design/julia-campbell-1-1
  --mentions-> 20_meetings/30_meeting-transcripts/1-1s/julia-_-richard-weekly-1_1-2026-1-15-thu (depth 1)
  --mentions-> 20_meetings/30_meeting-transcripts/1-1s/julia-_-richard-weekly-1_1-2026-1-8-thu (depth 1)
  --mentions-> 20_meetings/30_meeting-transcripts/2026/05/2026-05-08_julia-_-richard-weekly-1_1 (depth 1)

(45 is the count of wikilink targets that actually exist as DB pages. The remaining 902 wikilinks point at pages the vault references but hasn't ingested — a vault-content gap, not a regex gap.)

Why this is worth landing

The PARA convention is widespread in the Obsidian/Tiago-Forte community and is incompatible with GBrain out of the box today. The patch is additive (no behaviour change for non-PARA vaults), gated on the existing allSlugs.has() boundary (so no false-positive bloat), and fully covered by tests.

Happy to break this into smaller commits or take any reshaping you'd prefer.

DIR_PATTERN's fixed semantic-dir whitelist (people|companies|meetings|...) matched zero of the 947 wikilinks in a 1008-page Obsidian + PARA vault. PARA layouts use numeric-prefixed dirs to force sidebar order (10_Projects/, 20_Meetings/, 30_Resources/, 40_Areas/, 50_Pulse/, 80_Archived/), and the resulting graph had no edges. Add a `\d+_word` alternative to DIR_PATTERN (using [A-Za-z] so PascalCase matches without forcing an `i` flag on QUALIFIED_WIKILINK_RE, whose source-id sub-expression is intentionally kebab-only). Extracted slugs are then run through slugifyPath so wikilink paths like `[[10_Projects/Meeting Transcripts/Foo|Foo]]` reduce to the lowercased, hyphen-segmented DB slug `10_projects/meeting-transcripts/foo` that `allSlugs.has()` expects. After this patch, `gbrain extract links --source db` reports 45 links created against the timelycare vault (previously 0), and `gbrain graph-query` returns typed edges for PARA-style pages. Tests: 6 new cases for canonical PARA dirs (10_projects, 20_meetings, 30_resources, 40_areas, 50_pulse, 80_archived), PascalCase normalization, spaced-segment normalization, and the markdown-link variant. Tracked locally as TIM-27 in the Paperclip TimelyCare project. Co-Authored-By: Paperclip <noreply@paperclip.ing>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(extract): recognize PARA-numbered Obsidian dirs in link extraction#1186

feat(extract): recognize PARA-numbered Obsidian dirs in link extraction#1186
rwbaker wants to merge 1 commit into
garrytan:masterfrom
rwbaker:fix/para-numbered-dirs

rwbaker commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rwbaker commented May 19, 2026

Problem

Fix

Tests

Verified against a real PARA vault

Why this is worth landing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant