Storage Tiering: db-tracked vs db-only directories

Overview

GBrain supports storage tiering to separate version-controlled content from bulk machine-generated data. This prevents git repositories from becoming bloated with large amounts of automatically generated content while still preserving it in the database.

Note on naming: prior to v0.22.11 the keys were git_tracked / supabase_only. The canonical names are now db_tracked / db_only (engine-agnostic — works on both PGLite and Postgres). The deprecated keys still load with a once-per-process warning. Run gbrain doctor --fix for an automated rename when that path lands.

Configuration

Add a storage section to your gbrain.yml file in the brain repository root:

storage:
  # Directories that are version-controlled (human-edited, committed to git).
  db_tracked:
    - people/
    - companies/
    - deals/
    - concepts/
    - yc/
    - ideas/
    - projects/

  # Directories persisted via the brain database only (bulk machine-generated
  # content). Written to disk as a local cache but not committed to git;
  # `gbrain sync` auto-manages .gitignore for these paths. `gbrain export
  # --restore-only` repopulates missing files from the database.
  db_only:
    - media/x/
    - media/articles/
    - meetings/transcripts/

Path requirements:

Each directory must end with / for canonical form. The validator auto-normalizes missing trailing slashes (one-time info note shows what changed).
A directory cannot appear in both tiers — that's a tier-overlap error and loadStorageConfig throws StorageConfigError. Edit gbrain.yml to remove the overlap and try again.

Behavior Changes

1. `gbrain sync` — automatic .gitignore management

When storage configuration is present, gbrain sync automatically manages .gitignore entries on every successful sync:

Adds missing db_only directory patterns to .gitignore.
Idempotent — re-running adds no duplicate entries.
Stable comment header so the managed block is grep-able.
Skipped on --dry-run (don't mutate disk in preview mode).
Skipped on blocked_by_failures status (sync state is inconsistent).
Skipped when the repo is a git submodule (.git is a file, not a directory) — submodule .gitignore changes don't survive parent updates. A warning explains.
Skipped entirely when GBRAIN_NO_GITIGNORE=1 is set (escape hatch for shared-repo setups where a maintainer wants gbrain to leave .gitignore alone).
Failures (write permission denied, etc.) are caught and logged, never crash sync.

Example .gitignore addition:

# Auto-managed by gbrain (db_only directories)
media/x/
media/articles/
meetings/transcripts/

2. `gbrain export --restore-only` — repopulate missing db_only files

# Restore only missing db_only files from the database.
gbrain export --restore-only --repo /path/to/brain

# Filter by page type.
gbrain export --restore-only --type media --repo /path/to/brain

# Filter by slug prefix.
gbrain export --restore-only --slug-prefix media/x/ --repo /path/to/brain

# Combine filters.
gbrain export --restore-only --type media --slug-prefix media/x/ --repo /path/to/brain

The --restore-only flag:

Resolves repoPath via the chain --repo → typed sources.getDefault() → hard error. Never falls through to the current directory.
Only exports pages that match db_only patterns AND are missing from disk.
Ideal for container restart recovery and fresh clones.

3. `gbrain storage status` — storage-tier health dashboard

# Human-readable status.
gbrain storage status --repo /path/to/brain

# JSON output for scripts and orchestrators.
gbrain storage status --repo /path/to/brain --json

Output includes:

Total page counts by storage tier.
Disk usage breakdown by tier.
Missing files that need restoration (top 10 shown; full list in --json).
Configuration validation warnings.
Current tier directory listing.

Example output:

Storage Status
==============

Repository: /data/brain
Total pages: 15,243

Storage Tiers:
-------------
DB tracked:     2,156 pages
DB only:        12,887 pages
Unspecified:    200 pages

Disk Usage:
-----------
DB tracked:     45.2 MB
DB only:        2.1 GB

Missing Files (need restore):
-----------------------------
  media/x/tweet-1234567890
  media/x/tweet-0987654321
  ... and 47 more

Use: gbrain export --restore-only --repo "/data/brain"

Configuration:
--------------
DB tracked directories:
  - people/
  - companies/
  - deals/

DB-only directories:
  - media/x/
  - media/articles/
  - meetings/transcripts/

Validation

loadStorageConfig runs normalizeAndValidateStorageConfig after parsing:

Auto-fixes (silent, with one-time info note showing what changed):
- Missing trailing / is added: 'media/x' → 'media/x/'.
Throws StorageConfigError (caller sees a clean exit-1 with actionable message):
- Same directory in both db_tracked and db_only (ambiguous routing).

Use cases

Brain repository scaling

Perfect for brain repositories crossing 50K-200K+ files where:

Core knowledge (people, companies, deals) remains git-tracked.
Bulk data (tweets, articles, transcripts) moves to db_only.
Development stays fast with smaller git repos.
Full data remains available via the database.

Container-based deployments

Essential for ephemeral container environments:

Git repo contains only essential files.
Container restarts don't lose db_only data.
gbrain export --restore-only quickly restores bulk files when needed.
Local disk acts as a cache layer.

Multi-environment consistency

Enables consistent data access across environments:

Development: small git clone, restore bulk data on demand.
Production: full dataset via the database, selective local caching.
CI/CD: fast tests with git-tracked data only.

Migration strategy

Assess current repository: use gbrain storage status to understand current distribution.
Plan directory structure: identify which directories should be db_tracked vs db_only.
Create gbrain.yml: add storage configuration to the repository root.
Test with dry-run: gbrain sync --dry-run to verify behavior; .gitignore is NOT touched on dry-run.
Run a real sync: gbrain sync updates .gitignore automatically on success.
Verify restore: test gbrain export --restore-only --repo . against a small db_only directory.

Best practices

Directory naming: end storage paths with / (canonical form). The validator normalizes if you forget.
Start small: begin with clearly machine-generated directories in db_only.
Address validation errors: tier overlap is an error, not a warning. Fix it before sync.
Test restore: regularly test --restore-only in staging environments.
Document decisions: comment your gbrain.yml to explain tier choices.

PGLite engine note

On the PGLite engine (gbrain's local-only embedded Postgres), the "DB" your db_only pages live in IS the local file gbrain uses for everything else. The .gitignore housekeeping still helps (keeps bulk content out of git history), but the offload-to-DB promise is technically vacuous. A once-per-process soft-warn explains when the engine is detected. To get full tiering, migrate to Postgres with gbrain migrate --to supabase.

Compatibility

Backward compatible: systems without gbrain.yml work unchanged.
Progressive enhancement: add configuration when needed.
Database unchanged: all data remains in Postgres regardless of tier.
Existing workflows: all existing sync and export behavior preserved.
Deprecated keys: git_tracked / supabase_only still load with a once-per-process warning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storage Tiering: db-tracked vs db-only directories

Overview

Configuration

Behavior Changes

1. `gbrain sync` — automatic .gitignore management

2. `gbrain export --restore-only` — repopulate missing db_only files

3. `gbrain storage status` — storage-tier health dashboard

Validation

Use cases

Brain repository scaling

Container-based deployments

Multi-environment consistency

Migration strategy

Best practices

PGLite engine note

Compatibility

FilesExpand file tree

storage-tiering.md

Latest commit

History

storage-tiering.md

File metadata and controls

Storage Tiering: db-tracked vs db-only directories

Overview

Configuration

Behavior Changes

1. gbrain sync — automatic .gitignore management

2. gbrain export --restore-only — repopulate missing db_only files

3. gbrain storage status — storage-tier health dashboard

Validation

Use cases

Brain repository scaling

Container-based deployments

Multi-environment consistency

Migration strategy

Best practices

PGLite engine note

Compatibility

1. `gbrain sync` — automatic .gitignore management

2. `gbrain export --restore-only` — repopulate missing db_only files

3. `gbrain storage status` — storage-tier health dashboard