docs: Plan for remote filesystem seeds#736
Conversation
| 4. **A4 (Layer 1) — config refactor.** Drop `is_dir()` from the upstream | ||
| `validate_path`; make `runtime_path` a pass-through that preserves a `#`-ref. | ||
| Add a **precise root-existence assertion** in `FileSystemSeedReader` / the | ||
| provider so a missing root raises a clear "does not exist" error (local: via | ||
| `LocalFileSystem`; fileset: via SDK) instead of the vague "no files matched". | ||
| Verify upstream `DataDesigner.validate()` and the CLI still surface a precise | ||
| error for a missing directory (now via the provider, reached through | ||
| `get_column_names()`), and that existing tests pass. | ||
| 5. **A4 (Layer 2, NeMo) + B3 — remote pre-flight validation.** Extend | ||
| `validate_seed` to check `directory` / `file_contents` / `agent_rollout` | ||
| fileset refs via the SDK; expand `_SUPPORTED_SEED_TYPES`; reject | ||
| local-default-path reliance for remote `agent_rollout`. | ||
| 6. **SDK / OpenAPI** — regenerate only if request models change. They should | ||
| not: the seed source config types already exist in the `dd` config and are | ||
| serializable. | ||
|
|
||
| ## Affected files (reference) |
There was a problem hiding this comment.
Step 2 depends on Step 4 but is listed first
The implementation plan is described as "ordered to keep each step independently testable," but step 2 ("NeMo B1 + partial B2 — integration-test against a real Fileset") cannot be meaningfully tested before step 4 (A4 config refactor). Without A4, constructing DirectorySeedSource(path="default/my-seeds#data/traces") still raises at Pydantic validation time because validate_path still calls is_dir() on the ref string, and runtime_path still calls Path(...).resolve() which mangles the #. The NeMo wiring added in step 2 is unreachable until the config relaxation lands. Either A4 (Layer 1) should move before step 2, or a note should be added that step 2's remote integration test is blocked until A4 is complete.
Prompt To Fix With AI
This is a comment left during a code review.
Path: plans/remote-filesystem-seeds/remote-filesystem-seeds-plan.md
Line: 696-712
Comment:
**Step 2 depends on Step 4 but is listed first**
The implementation plan is described as "ordered to keep each step independently testable," but step 2 ("NeMo B1 + partial B2 — integration-test against a real Fileset") cannot be meaningfully tested before step 4 (A4 config refactor). Without A4, constructing `DirectorySeedSource(path="default/my-seeds#data/traces")` still raises at Pydantic validation time because `validate_path` still calls `is_dir()` on the ref string, and `runtime_path` still calls `Path(...).resolve()` which mangles the `#`. The NeMo wiring added in step 2 is unreachable until the config relaxation lands. Either A4 (Layer 1) should move before step 2, or a note should be added that step 2's remote integration test is blocked until A4 is complete.
How can I resolve this? If you propose a fix, please make it concise.
Review of PR #736: Plan for remote filesystem seedsSummaryPlan-only PR adding This review focuses on plan completeness, feasibility, architectural alignment, and gaps — per the recipe contract for Strengths
FindingsMajor gaps
Moderate gaps
Minor / nits
Architectural alignment notes
VerdictRequest changes (plan-only). The plan is unusually thorough on the load-bearing decision (validation deferral) and correctly aligned with the project's architectural invariants. The major gaps are: (1) missing testing strategy across all six implementation steps, (2) unaddressed lifecycle/error-normalization details for the long-lived remote provider, (3) an unresolved blocking question about |
|
|
||
| - `validate_path` drops `is_dir()` and keeps only cheap structural checks | ||
| (non-empty, etc.). It no longer asserts existence. | ||
| - `runtime_path` stops unconditionally calling `Path(...).resolve()`; it becomes |
There was a problem hiding this comment.
I think this also needs to preserve when local relative paths get anchored. Today FileSystemSeedSource.model_post_init resolves and caches an absolute path when the source is constructed, but the provider sketch resolves local paths later when the reader creates the context. If cwd changes between config load and validate/read, DirectorySeedSource(path="seed-dir") can point somewhere else. Maybe keep construction-time normalization for local-looking paths and only pass through fileset refs.
There was a problem hiding this comment.
I explored a few ways to preserve construction-time anchoring for local paths while still letting remote fileset refs pass through. The tricky bit is that the core Data Designer library should not know how to identify Nemo Platform fileset refs; that grammar belongs to the Nemo Platform provider/plugin. We also considered storing a construction-time local cwd as a private attr and having only the local provider use it, but that creates a footgun: future generic core helpers could accidentally use that local anchoring path in a way that breaks platform refs.
I think the cleaner approach is to make this an intentional semantics change:
- Config construction validates only structural shape, not existence.
pathremains the user-authored declarative string.- Filesystem interpretation/resolution happens in the active provider at validate/read time.
- The local provider resolves local relative paths against the cwd at validate/read time.
- The NeMo provider interprets the raw path as its fileset ref grammar.
- Users who need stable local anchoring can opt into it explicitly by passing an absolute path, e.g.
path=str(Path("./seed-dir").resolve()).
That does change today’s behavior for local relative paths if the process changes cwd between config construction and validation/read. But I think it is defensible, and arguably better for portable/shared configs: path="./my-data" can mean “relative to where this config/job is run” instead of being frozen to one author’s machine at construction time.
We should make that accepted behavior change explicit in the plan/docs/changelog and add tests for it, rather than accidentally preserving old anchoring with hidden local-only state.
WDYT?
There was a problem hiding this comment.
Discussed in a sync meeting and agreed on this approach; updated the plan accordingly in 4705407
| #### B2. Wire it into `RemoteDataDesignerContext.get_seed_readers()` | ||
|
|
||
| ```python | ||
| provider = FilesetFileSystemProvider(self._sdk) |
There was a problem hiding this comment.
This provider probably needs the same async-to-sync SDK handling as FilesetFileSeedReader. Remote preview builds the context with an AsyncNeMoPlatform, and FilesetFileSystem switches into async mode for that SDK type, while these seed readers call sync find/open/exists. Could convert AsyncNeMoPlatform with async_to_sync_sdk in the provider constructor before creating the filesystem.
There was a problem hiding this comment.
Good catch. I checked the FilesetFileSystem implementation: it always normalizes the internal client to AsyncNeMoPlatform, but the constructor argument still controls fsspec mode. Passing AsyncNeMoPlatform sets asynchronous=True, which leaves fsspec's sync loop unset; sync calls like exists/find/open can raise "Loop is not running". Existing FilesetFileSeedReader and FilesetsPersonReader already convert AsyncNeMoPlatform to a sync NeMoPlatform before creating FilesetFileSystem, so the new provider should do the same.
Updated in 8744a9f
|
|
||
| Two implementation notes: | ||
|
|
||
| - **Avoid a double round trip.** `ensure_root_exists` is free locally (`is_dir`) |
There was a problem hiding this comment.
suggestion: the double-round-trip note needs a concrete cache/reuse story. Remote validation will check the fileset via validate_seed, and _get_filesystem_context as sketched always calls ensure_root_exists again, so preview/job flows still look like they can hit Files twice unless the provider knows the path was already validated.
Signed-off-by: Mike Knepper <mknepper@nvidia.com>
Signed-off-by: Mike Knepper <mknepper@nvidia.com>
Signed-off-by: Mike Knepper <mknepper@nvidia.com>
a130978 to
8744a9f
Compare
Signed-off-by: Mike Knepper <mknepper@nvidia.com>
Signed-off-by: Mike Knepper <mknepper@nvidia.com>
Signed-off-by: Mike Knepper <mknepper@nvidia.com>
📋 Summary
Plan doc for supporting an injectable filesystem layer so that existing seed readers can use remote data. This would make the
Directory,FileContents, andAgentRolloutseed sources/readers usable on the nemo-platform DD service.🔗 Related Issue
N/A
🔄 Changes
Just a plan doc for review to start!
Most of this is behind the scenes—seed readers will default to local filesystem and no changes will be required of the end user. The biggest UX change proposed here is relaxing some of the seed source Pydantic validation, deferring it to
validatecalls. This is covered extensively in sectionA4of the plan doc.🧪 Testing
N/A (plan doc only)
✅ Checklist