Skip to content

docs: Plan for remote filesystem seeds#736

Merged
mikeknep merged 6 commits into
mainfrom
remote-filesystem-seeds/mknepper
Jun 16, 2026
Merged

docs: Plan for remote filesystem seeds#736
mikeknep merged 6 commits into
mainfrom
remote-filesystem-seeds/mknepper

Conversation

@mikeknep

@mikeknep mikeknep commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

📋 Summary

Plan doc for supporting an injectable filesystem layer so that existing seed readers can use remote data. This would make the Directory, FileContents, and AgentRollout seed sources/readers usable on the nemo-platform DD service.

🔗 Related Issue

N/A

🔄 Changes

Just a plan doc for review to start!

Most of this is behind the scenes—seed readers will default to local filesystem and no changes will be required of the end user. The biggest UX change proposed here is relaxing some of the seed source Pydantic validation, deferring it to validate calls. This is covered extensively in section A4 of the plan doc.

🧪 Testing

N/A (plan doc only)

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@mikeknep mikeknep requested a review from a team as a code owner June 4, 2026 13:56
@greptile-apps

greptile-apps Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a design plan for making directory and file_contents seed sources usable in remote execution by introducing an injectable FileSystemProvider protocol into the upstream Data Designer library and having the NeMo plugin supply a fileset-backed implementation.

  • Part A (upstream): Adds the FileSystemProvider protocol with create_context / ensure_root_exists, a default LocalFileSystemProvider reproducing today's behavior, loosens root_path typing, and defers the Pydantic is_dir() / Path.resolve() validators to a provider-layer preflight — ensuring error quality is preserved rather than degraded at the validate() boundary.
  • Part B (NeMo): Implements FilesetFileSystemProvider, wires DirectorySeedReader / FileContentsSeedReader into RemoteDataDesignerContext, and extends validate_seed + _SUPPORTED_SEED_TYPES to admit the two new remote-capable types while keeping agent_rollout out of scope.

Confidence Score: 5/5

Documentation-only change; no production code is modified.

The PR adds a single plan document with no code changes. The design is internally consistent, the two-layer validation approach is well-reasoned, and the alternatives section is thorough. The only finding is a cosmetic local-path leak in the "Affected files" section.

No files require special attention.

Important Files Changed

Filename Overview
plans/remote-filesystem-seeds/remote-filesystem-seeds-plan.md New RFC introducing an injectable FileSystemProvider seam to enable directory/file_contents seed sources in remote execution; design is well-reasoned with clear rationale for each decision. One minor issue: the "Affected files" section embeds local developer absolute paths.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant User
    participant Config as DirectorySeedSource<br/>(Pydantic)
    participant Context as RemoteDataDesignerContext
    participant VS as validate_seed (B3)
    participant Reader as DirectorySeedReader
    participant Provider as FilesetFileSystemProvider
    participant FS as FilesetFileSystem

    User->>Config: "DirectorySeedSource(path="ws/fs#dir")"
    Note over Config: Layer 1: structural checks only<br/>(no is_dir(), no Path.resolve())
    Config-->>User: source object

    User->>Context: validate(config)
    Context->>VS: validate_seed(source, workspace, sdk)
    VS->>FS: sdk.files.filesets.retrieve(...)
    FS-->>VS: 200 OK
    VS-->>Context: canonical root ref
    Context->>Context: _validated_filesystem_roots.add(root)

    User->>Context: get_seed_readers()
    Context->>Provider: FilesetFileSystemProvider(sdk, validated_roots)
    Context-->>User: "[... DirectorySeedReader(fs_provider=provider)]"

    User->>Reader: get_column_names() / read
    Reader->>Reader: _get_filesystem_context()
    Reader->>Provider: ensure_root_exists(runtime_path)
    Note over Provider: root in validated_roots → skip network call
    Provider-->>Reader: (no-op)
    Reader->>Provider: create_context(runtime_path)
    Provider->>FS: FilesetFileSystem(sdk)
    Provider-->>Reader: "SeedReaderFileSystemContext(fs=DirFileSystem(rooted))"
    Reader->>FS: context.fs.find(...) / context.fs.open(...)
    FS-->>Reader: file listing / contents
    Reader-->>User: seed DataFrame
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant User
    participant Config as DirectorySeedSource<br/>(Pydantic)
    participant Context as RemoteDataDesignerContext
    participant VS as validate_seed (B3)
    participant Reader as DirectorySeedReader
    participant Provider as FilesetFileSystemProvider
    participant FS as FilesetFileSystem

    User->>Config: "DirectorySeedSource(path="ws/fs#dir")"
    Note over Config: Layer 1: structural checks only<br/>(no is_dir(), no Path.resolve())
    Config-->>User: source object

    User->>Context: validate(config)
    Context->>VS: validate_seed(source, workspace, sdk)
    VS->>FS: sdk.files.filesets.retrieve(...)
    FS-->>VS: 200 OK
    VS-->>Context: canonical root ref
    Context->>Context: _validated_filesystem_roots.add(root)

    User->>Context: get_seed_readers()
    Context->>Provider: FilesetFileSystemProvider(sdk, validated_roots)
    Context-->>User: "[... DirectorySeedReader(fs_provider=provider)]"

    User->>Reader: get_column_names() / read
    Reader->>Reader: _get_filesystem_context()
    Reader->>Provider: ensure_root_exists(runtime_path)
    Note over Provider: root in validated_roots → skip network call
    Provider-->>Reader: (no-op)
    Reader->>Provider: create_context(runtime_path)
    Provider->>FS: FilesetFileSystem(sdk)
    Provider-->>Reader: "SeedReaderFileSystemContext(fs=DirFileSystem(rooted))"
    Reader->>FS: context.fs.find(...) / context.fs.open(...)
    FS-->>Reader: file listing / contents
    Reader-->>User: seed DataFrame
Loading

Reviews (4): Last reviewed commit: "Update with explicit notes about cwd beh..." | Re-trigger Greptile

Comment on lines +696 to +712
4. **A4 (Layer 1) — config refactor.** Drop `is_dir()` from the upstream
`validate_path`; make `runtime_path` a pass-through that preserves a `#`-ref.
Add a **precise root-existence assertion** in `FileSystemSeedReader` / the
provider so a missing root raises a clear "does not exist" error (local: via
`LocalFileSystem`; fileset: via SDK) instead of the vague "no files matched".
Verify upstream `DataDesigner.validate()` and the CLI still surface a precise
error for a missing directory (now via the provider, reached through
`get_column_names()`), and that existing tests pass.
5. **A4 (Layer 2, NeMo) + B3 — remote pre-flight validation.** Extend
`validate_seed` to check `directory` / `file_contents` / `agent_rollout`
fileset refs via the SDK; expand `_SUPPORTED_SEED_TYPES`; reject
local-default-path reliance for remote `agent_rollout`.
6. **SDK / OpenAPI** — regenerate only if request models change. They should
not: the seed source config types already exist in the `dd` config and are
serializable.

## Affected files (reference)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Step 2 depends on Step 4 but is listed first

The implementation plan is described as "ordered to keep each step independently testable," but step 2 ("NeMo B1 + partial B2 — integration-test against a real Fileset") cannot be meaningfully tested before step 4 (A4 config refactor). Without A4, constructing DirectorySeedSource(path="default/my-seeds#data/traces") still raises at Pydantic validation time because validate_path still calls is_dir() on the ref string, and runtime_path still calls Path(...).resolve() which mangles the #. The NeMo wiring added in step 2 is unreachable until the config relaxation lands. Either A4 (Layer 1) should move before step 2, or a note should be added that step 2's remote integration test is blocked until A4 is complete.

Prompt To Fix With AI
This is a comment left during a code review.
Path: plans/remote-filesystem-seeds/remote-filesystem-seeds-plan.md
Line: 696-712

Comment:
**Step 2 depends on Step 4 but is listed first**

The implementation plan is described as "ordered to keep each step independently testable," but step 2 ("NeMo B1 + partial B2 — integration-test against a real Fileset") cannot be meaningfully tested before step 4 (A4 config refactor). Without A4, constructing `DirectorySeedSource(path="default/my-seeds#data/traces")` still raises at Pydantic validation time because `validate_path` still calls `is_dir()` on the ref string, and `runtime_path` still calls `Path(...).resolve()` which mangles the `#`. The NeMo wiring added in step 2 is unreachable until the config relaxation lands. Either A4 (Layer 1) should move before step 2, or a note should be added that step 2's remote integration test is blocked until A4 is complete.

How can I resolve this? If you propose a fix, please make it concise.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Review of PR #736: Plan for remote filesystem seeds

Summary

Plan-only PR adding plans/remote-filesystem-seeds/remote-filesystem-seeds-plan.md (731 lines, no code changes). The plan proposes a FileSystemProvider injection seam in the upstream library so directory, file_contents, and agent_rollout seed sources can read from the NeMo Platform Files service when running remotely. The upstream library remains NeMo-agnostic; the platform plugin injects a FilesetFileSystem-backed provider.

This review focuses on plan completeness, feasibility, architectural alignment, and gaps — per the recipe contract for plans/-only changes (no linting/style scrutiny).

Strengths

  • Problem framing is precise. The four hardcoded coupling points to local disk (seed_reader.py:245-249, :58, seed_source.py:205-211, atif.py:195) are enumerated with line numbers and difficulty estimates, making the blast radius concrete.
  • Two-layer validation (A4) is the load-bearing section, and it earns it. The error-quality table that contrasts today vs. naive deferral vs. designed deferral is exactly the right artifact to evaluate this trade-off; the four explicit requirements (preflight not discovery, per-backend wording, stable classification at the boundary, cheap structural checks retained) leave little room for the "we deleted the check and the symptom is now opaque" failure mode.
  • Correct architectural posture per AGENTS.md. Upstream stays NeMo-agnostic; dependency direction interface → engine → config is preserved; injection mirrors the existing FilesetFileSeedReader(self._sdk) pattern. The FileSystemProvider protocol fits the registries-and-protocols design principle.
  • LocalFileSeedSource deferral is deliberately rejected with a paper trail. The "Considered and dropped" note for symmetric local deferral is the right call — local is unsupported remotely either way, and splitting unsupported-type rejection across before+after validators is less cohesive. Future readers won't relitigate this.
  • Alternatives are real, not strawmen. Alt 1 (extend FilesetFileSeedReader) is acknowledged as worth doing independently for the tabular path; Alt 2 (fsspec-URL-aware) is rejected for a concrete reason (SDK can't be encoded in a URL). Alt 3 (stage-in) is dismissed for clear reasons.

Findings

Major gaps

  1. No testing strategy. AGENTS.md states "No untested code paths — new logic requires tests." The plan describes implementation steps but never specifies the test layer. Concretely missing: (a) unit tests for LocalFileSystemProvider (default path) and the new ensure_root_exists path; (b) contract tests on the FileSystemProvider protocol that any provider must satisfy; (c) integration tests for FilesetFileSystemProvider against a real Fileset (or recorded fixtures); (d) regression tests for the relocated existence error (matching the wording in the table at lines 353-358); (e) how the agent rollout handler refactor (A3) is exercised — handler tests today read from tmp_path, so they need to switch to an injected fsspec memory filesystem. Each implementation-plan step (lines 690-716) should name its test deliverable.

  2. A3 may break out-of-tree handlers. AgentRolloutFormatHandler.parse_file signature change is described as "a breaking internal contract within the library." If AgentRolloutFormatHandler subclasses are discoverable via the plugin entry-point system (per AGENTS.md core concept "Plugin"), third-party handlers will break silently at instantiation. The plan should state (a) whether handlers are an extension point, (b) if so, the deprecation policy, and (c) whether this warrants a data_designer.config major-version bump rather than just a coordinated release.

  3. Provider lifecycle / resource ownership is unspecified. The snippet at lines 446-456 caches _filesystem_context on the reader for the reader's lifetime. With LocalFileSystem this is free; with FilesetFileSystem(self._sdk) this is a long-lived AsyncFileSystem bound to an SDK instance with potentially expiring credentials. The plan doesn't address: (a) when/how the cached context is invalidated, (b) whether multiple readers share one FilesetFileSystem or each gets its own (the B2 wiring at line 543-552 instantiates one provider but each reader's context is per-instance), (c) what happens on auth refresh, and (d) cleanup of fsspec resources (AsyncFileSystem typically owns event loop / aiohttp session state). For long-running remote jobs this matters.

  4. Error normalization is incomplete. AGENTS.md core design principle 3: "Errors normalize at boundaries. Third-party exceptions are wrapped into canonical project error types at module boundaries." The plan handles FileNotFoundErrorSeedReaderError (good), but the broader surface — SDK auth errors, network timeouts, 5xx, partial reads inside find() — is not addressed. A2 introduces context.open() as the blessed read path; that helper is the natural normalization seam, and the plan should specify exactly which SDK/fsspec exceptions get wrapped to which SeedReaderError/NDDError flavors.

Moderate gaps

  1. Implementation step 1 is incomplete relative to A4 Layer 2. Step 1 lands "A1 + A2" and verifies existing tests pass. But A1's protocol (lines 220-231) defines ensure_root_exists, and Layer 2 of A4 (lines 408-456) is what gives that method its purpose. If ensure_root_exists lands in step 1 as part of the protocol, its local implementation and reader integration must also land then — otherwise the protocol is half-implemented and step 4 has to amend it. Either ship ensure_root_exists together with the existence-check relocation in step 1, or land step 1 with create_context only and add ensure_root_exists in step 4. The plan should pick one and label it explicitly.

  2. Open question about agent rollout source_path semantics is real and may block A3. Lines 666-669 flag that ATIF records embed cwd/project_path/source_path and "Need to confirm downstream consumers treat these as opaque identifiers." This is not just a risk — it's a precondition for A3 being usable. If a downstream consumer (in this repo or in data_designer_nemo) calls Path(source_path).open(), A3 ships a regression. This should move from "Risks" to a blocking open question with an explicit owner/answer-by step in the implementation plan.

  3. validate_seed_source_for_execution_context interaction with the new types is one-sentence. B3 says "expand _SUPPORTED_SEED_TYPES" (line 558-560). But the section "Server-side request validation" (lines 577-612) explains that this validator runs mode="before" and indexes raw dicts. Worth a sentence on whether directory/file_contents/agent_rollout payloads have any raw-dict shape concerns server-side (e.g. df is special; do the new types have any non-JSON-serializable fields like enums or pathlib types?). A quick eyeball of the three configs in seed_source.py would suffice.

  4. Performance discussion stops at "out of scope" without bounds. Lines 676-679 acknowledge sequential per-file hydration over network RTT is a concern and that batching is future work. Fine, but for agent_rollout against a Fileset with thousands of trace files this may produce a job that nominally "succeeds" but takes 30+ minutes when the user expected seconds. The plan should at least promise (a) a soft file-count cap with a clear error or warning, or (b) explicit documentation of expected latency, so this isn't a footgun on day one.

Minor / nits

  1. Path grammar shift between local and remote is implicit. A DirectorySeedSource config built locally with path="/data/traces" and submitted remotely will be rejected by the before-validator; one built remotely with path="default/my-seeds#data/traces" will fail locally because Path(...).is_dir() is false on disk. The plan implicitly assumes configs are mode-specific, but should state this explicitly under "End-user interface" so users (and SDK consumers) understand a config is not portable across execution contexts.

  2. Affected files reference includes absolute home-directory paths (lines 720, 730: /Users/mknepper/code/...). Replace with repo-root-relative paths so the doc remains valid for other contributors.

  3. A4 requirement 3 ("stable error classification at the validate() boundary") implies new code that isn't in the affected-files list. The plan says compile_data_designer_config / DataDesigner.validate must classify SeedReaderError from existence checks as a config violation. That's a behavioral change in interface/data_designer.py and engine/compiler.py, neither of which appear under "Affected files (reference)" (lines 718-737). Add them, or note explicitly that the existing classification handles SeedReaderError already.

  4. Step 6 ("SDK / OpenAPI — regenerate only if request models change") is vague. State the preflight check ("we will diff DataDesignerJobConfig / PreviewSpec discriminated-union members; if no change, skip regen") so a future implementer doesn't accidentally skip a needed regeneration.

Architectural alignment notes

  • Layering. FileSystemProvider belongs in data_designer.engine.resources (line 217) — correct: it's an engine-side abstraction consumed by readers. The provider does not appear in data_designer.config. ✅
  • Lazy imports. A1's protocol references Path and DirFileSystem. DirFileSystem (from fsspec) is already a heavy import; confirm whether the provider module is on the cold-import path or lazy-loaded per data_designer.lazy_heavy_imports. AGENTS.md flags this as a structural invariant. The plan doesn't mention import cost; worth a sentence.
  • Plugin system. The FileSystemProvider is conceptually a plugin extension surface (third parties could in principle inject their own). The plan doesn't say whether this becomes a registered plugin type or stays an internal protocol passed via constructor. Today's pattern (constructor injection from the platform plugin) is fine for the immediate goal but worth stating as a deliberate non-goal so it's not later confused for a registry.

Verdict

Request changes (plan-only). The plan is unusually thorough on the load-bearing decision (validation deferral) and correctly aligned with the project's architectural invariants. The major gaps are: (1) missing testing strategy across all six implementation steps, (2) unaddressed lifecycle/error-normalization details for the long-lived remote provider, (3) an unresolved blocking question about source_path semantics for agent rollout that should not stay in "Risks," and (4) a minor inconsistency between A1's protocol shape and the implementation-plan ordering. These are addressable with edits to the plan itself before code starts; the design is sound enough that the changes are clarifications, not redirections.


- `validate_path` drops `is_dir()` and keeps only cheap structural checks
(non-empty, etc.). It no longer asserts existence.
- `runtime_path` stops unconditionally calling `Path(...).resolve()`; it becomes

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this also needs to preserve when local relative paths get anchored. Today FileSystemSeedSource.model_post_init resolves and caches an absolute path when the source is constructed, but the provider sketch resolves local paths later when the reader creates the context. If cwd changes between config load and validate/read, DirectorySeedSource(path="seed-dir") can point somewhere else. Maybe keep construction-time normalization for local-looking paths and only pass through fileset refs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explored a few ways to preserve construction-time anchoring for local paths while still letting remote fileset refs pass through. The tricky bit is that the core Data Designer library should not know how to identify Nemo Platform fileset refs; that grammar belongs to the Nemo Platform provider/plugin. We also considered storing a construction-time local cwd as a private attr and having only the local provider use it, but that creates a footgun: future generic core helpers could accidentally use that local anchoring path in a way that breaks platform refs.

I think the cleaner approach is to make this an intentional semantics change:

  • Config construction validates only structural shape, not existence.
  • path remains the user-authored declarative string.
  • Filesystem interpretation/resolution happens in the active provider at validate/read time.
  • The local provider resolves local relative paths against the cwd at validate/read time.
  • The NeMo provider interprets the raw path as its fileset ref grammar.
  • Users who need stable local anchoring can opt into it explicitly by passing an absolute path, e.g. path=str(Path("./seed-dir").resolve()).

That does change today’s behavior for local relative paths if the process changes cwd between config construction and validation/read. But I think it is defensible, and arguably better for portable/shared configs: path="./my-data" can mean “relative to where this config/job is run” instead of being frozen to one author’s machine at construction time.

We should make that accepted behavior change explicit in the plan/docs/changelog and add tests for it, rather than accidentally preserving old anchoring with hidden local-only state.

WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed in a sync meeting and agreed on this approach; updated the plan accordingly in 4705407

#### B2. Wire it into `RemoteDataDesignerContext.get_seed_readers()`

```python
provider = FilesetFileSystemProvider(self._sdk)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This provider probably needs the same async-to-sync SDK handling as FilesetFileSeedReader. Remote preview builds the context with an AsyncNeMoPlatform, and FilesetFileSystem switches into async mode for that SDK type, while these seed readers call sync find/open/exists. Could convert AsyncNeMoPlatform with async_to_sync_sdk in the provider constructor before creating the filesystem.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I checked the FilesetFileSystem implementation: it always normalizes the internal client to AsyncNeMoPlatform, but the constructor argument still controls fsspec mode. Passing AsyncNeMoPlatform sets asynchronous=True, which leaves fsspec's sync loop unset; sync calls like exists/find/open can raise "Loop is not running". Existing FilesetFileSeedReader and FilesetsPersonReader already convert AsyncNeMoPlatform to a sync NeMoPlatform before creating FilesetFileSystem, so the new provider should do the same.

Updated in 8744a9f


Two implementation notes:

- **Avoid a double round trip.** `ensure_root_exists` is free locally (`is_dir`)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: the double-round-trip note needs a concrete cache/reuse story. Remote validation will check the fileset via validate_seed, and _get_filesystem_context as sketched always calls ensure_root_exists again, so preview/job flows still look like they can hit Files twice unless the provider knows the path was already validated.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, fleshed this out in 227914e

mikeknep added 3 commits June 16, 2026 12:39
Signed-off-by: Mike Knepper <mknepper@nvidia.com>
Signed-off-by: Mike Knepper <mknepper@nvidia.com>
Signed-off-by: Mike Knepper <mknepper@nvidia.com>
@mikeknep mikeknep force-pushed the remote-filesystem-seeds/mknepper branch from a130978 to 8744a9f Compare June 16, 2026 18:56
Comment thread plans/remote-filesystem-seeds/remote-filesystem-seeds-plan.md Outdated
mikeknep added 2 commits June 16, 2026 14:08
Signed-off-by: Mike Knepper <mknepper@nvidia.com>
Signed-off-by: Mike Knepper <mknepper@nvidia.com>
johnnygreco
johnnygreco previously approved these changes Jun 16, 2026
Signed-off-by: Mike Knepper <mknepper@nvidia.com>
@mikeknep mikeknep merged commit 7c2c15e into main Jun 16, 2026
61 checks passed
@mikeknep mikeknep deleted the remote-filesystem-seeds/mknepper branch June 16, 2026 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants