Skip to content

feat(extract): recognize PARA-numbered Obsidian dirs in link extraction#1186

Open
rwbaker wants to merge 1 commit into
garrytan:masterfrom
rwbaker:fix/para-numbered-dirs
Open

feat(extract): recognize PARA-numbered Obsidian dirs in link extraction#1186
rwbaker wants to merge 1 commit into
garrytan:masterfrom
rwbaker:fix/para-numbered-dirs

Conversation

@rwbaker
Copy link
Copy Markdown

@rwbaker rwbaker commented May 19, 2026

Problem

DIR_PATTERN in src/core/link-extraction.ts matches a fixed semantic-dir whitelist (people|companies|meetings|concepts|deal|...). Any Obsidian vault that follows the Tiago-Forte PARA convention — which prefixes top-level dirs with a numeric sort key to control sidebar order (10_Projects/, 20_Meetings/, 30_Resources/, 40_Areas/, 50_Pulse/, 80_Archived/) — gets 0 extracted links even when the source markdown contains hundreds of valid wikilinks.

Concretely, on a 1008-page TimelyCare vault with 947 wikilinks in the source markdown, gbrain extract links --source db reported Links: created 0 and every gbrain graph-query returned No edges found.

Fix

Two small additive changes to src/core/link-extraction.ts (each marked with a [PARA-PATCH] comment for easy review):

  1. Extend DIR_PATTERN with a \d+_[A-Za-z][A-Za-z0-9_-]* alternative so PARA-numbered dirs match. The existing canonical / domain dirs are left untouched, so behaviour for non-PARA vaults is unchanged.

    - const DIR_PATTERN = '(?:people|companies|meetings|concepts|deal|civic|project|projects|source|media|yc|tech|finance|personal|openclaw|entities)';
    + const DIR_PATTERN = '(?:\\d+_[A-Za-z][A-Za-z0-9_-]*|people|companies|meetings|concepts|deal|civic|project|projects|source|media|yc|tech|finance|personal|openclaw|entities)';

    I used explicit [A-Za-z] rather than adding an i flag to the regex on purpose — QUALIFIED_WIKILINK_RE's source-id sub-expression is intentionally kebab-only (there's an existing test that pins that), and a global case-insensitive flag would relax it.

  2. Normalize extracted slugs through slugifyPath in extractEntityRefs (the wikilink + markdown-link paths). PARA wikilinks in real vaults use PascalCase and spaces ([[10_Projects/Meeting Transcripts/Foo|Foo]]); the DB stores the lowercased, hyphen-segmented slug (10_projects/meeting-transcripts/foo). Without normalization, even with the regex extended, allSlugs.has(slug) misses every match. slugifyPath is a no-op for already-canonical refs like people/alice-chen.

False positives are bounded the same way they were before: every extracted ref is filtered through allSlugs.has(slug) later in the pipeline, so dirs that look like \d+_word but aren't actually pages in the brain are dropped at that boundary.

Tests

6 new cases in test/link-extraction.test.ts:

  • Canonical PARA tops (10_projects, 20_meetings, 30_resources, 40_areas, 50_pulse, 80_archived)
  • PascalCase normalization (10_Projects/Foo10_projects/foo)
  • Spaced-segment normalization (20_Meetings/1-1s/Alice Chen20_meetings/1-1s/alice-chen)
  • Markdown-link variant ([Foo](10_projects/foo))

Full suite: 104 pass / 0 fail (was 98). The previously-passing kebab-only QUALIFIED_WIKILINK source-id test still passes — that was the reason I avoided the i flag.

Verified against a real PARA vault

After the patch, on the same 1008-page PARA vault that previously extracted 0:

$ gbrain extract links --source db
Links: created 45 from 1008 pages

$ gbrain graph-query 20_meetings/10_people/design/julia-campbell-1-1 --depth 2
[depth 0] 20_meetings/10_people/design/julia-campbell-1-1
  --mentions-> 20_meetings/30_meeting-transcripts/1-1s/julia-_-richard-weekly-1_1-2026-1-15-thu (depth 1)
  --mentions-> 20_meetings/30_meeting-transcripts/1-1s/julia-_-richard-weekly-1_1-2026-1-8-thu (depth 1)
  --mentions-> 20_meetings/30_meeting-transcripts/2026/05/2026-05-08_julia-_-richard-weekly-1_1 (depth 1)

(45 is the count of wikilink targets that actually exist as DB pages. The remaining 902 wikilinks point at pages the vault references but hasn't ingested — a vault-content gap, not a regex gap.)

Why this is worth landing

The PARA convention is widespread in the Obsidian/Tiago-Forte community and is incompatible with GBrain out of the box today. The patch is additive (no behaviour change for non-PARA vaults), gated on the existing allSlugs.has() boundary (so no false-positive bloat), and fully covered by tests.

Happy to break this into smaller commits or take any reshaping you'd prefer.

DIR_PATTERN's fixed semantic-dir whitelist (people|companies|meetings|...)
matched zero of the 947 wikilinks in a 1008-page Obsidian + PARA vault.
PARA layouts use numeric-prefixed dirs to force sidebar order
(10_Projects/, 20_Meetings/, 30_Resources/, 40_Areas/, 50_Pulse/,
80_Archived/), and the resulting graph had no edges.

Add a `\d+_word` alternative to DIR_PATTERN (using [A-Za-z] so PascalCase
matches without forcing an `i` flag on QUALIFIED_WIKILINK_RE, whose
source-id sub-expression is intentionally kebab-only). Extracted slugs
are then run through slugifyPath so wikilink paths like
`[[10_Projects/Meeting Transcripts/Foo|Foo]]` reduce to the lowercased,
hyphen-segmented DB slug `10_projects/meeting-transcripts/foo` that
`allSlugs.has()` expects.

After this patch, `gbrain extract links --source db` reports 45 links
created against the timelycare vault (previously 0), and
`gbrain graph-query` returns typed edges for PARA-style pages.

Tests: 6 new cases for canonical PARA dirs (10_projects, 20_meetings,
30_resources, 40_areas, 50_pulse, 80_archived), PascalCase normalization,
spaced-segment normalization, and the markdown-link variant.

Tracked locally as TIM-27 in the Paperclip TimelyCare project.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant