Skip to content

ModuleCoordinator & Blueprint.py changes#1724

Draft
jeff-hykin wants to merge 118 commits intodevfrom
jeff/fix/docker4
Draft

ModuleCoordinator & Blueprint.py changes#1724
jeff-hykin wants to merge 118 commits intodevfrom
jeff/fix/docker4

Conversation

@jeff-hykin
Copy link
Copy Markdown
Member

Draft b/c: contains misc, and isn't up to date with dev

Problem

Closes DIM-XXX

Solution

Breaking Changes

How to Test

Contributor License Agreement

  • I have read and approved the CLA.

spomichter and others added 30 commits January 23, 2026 07:34
… Unitree Go2 Navigation & Exploration Beta

Pre-Release v0.0.8: Unitree Go2 Navigation & Exploration Beta, Transport Updates, Documentation updates, Rerun fixes, Person follow, Readme updates

## What's Changed
* Small docs clarification about stream getters by @leshy in #1043
* Fix split view on wide monitors by @jeff-hykin in #1048
* Docs: Install & Develop  by @jeff-hykin in #1022
* Add uv to nix and fix resulting problems by @jeff-hykin in #1021
* v0.0.8 by @paul-nechifor in #1050
* Style changes in docs by @paul-nechifor in #1051
* Revert "Add uv to nix and fix resulting problems" by @leshy in #1053
* Transport benchmarks + Raw ros transport by @leshy in #1038
* feat: default to rerun-web and auto-open browser on startup (browser … by @Nabla7 in #1019
* bbox detections visual check by @leshy in #1017
* fix: only auto-open browser for rerun-web viewer backend by @Nabla7 in #1066
* move slow tests to integration by @paul-nechifor in #1063
* Streamline transport start/stop methods by @Kaweees in #1062
* Person follow skill with EdgeTAM by @paul-nechifor in #1042
* fix: increase costmap floor z_offset to avoid z-fighting by @Nabla7 in #1073
* Fixed issue #1074 by @alexlin2 in #1075
* ROS transports initial by @leshy in #1057
* Fix System Config Values for LCM on MacOS and Refactor by @jeff-hykin in #1065
* SHM Transport basic fixes by @leshy in #1041
* commented out Mem Transport test case by @leshy in #1077
* Docs/advanced streams update 2 by @leshy in #1078
* Fix more tests by @paul-nechifor in #1071
* feat: navigation docker updates from bona_local_dev by @baishibona in #1081
* Fix missing dependencies by @Kaweees in #1085
* Release readme fixes by @spomichter in #1076

## New Contributors
* @baishibona made their first contribution in #1081

**Full Changelog**: v0.0.7...v0.0.8
…HTTPS from SSH, get_data change, LFS changes

v0.0.9 Release Patch: Git clone change to HTTPS from SSH, get_data change, LFS changes
Release v0.0.10: Manipulation Stack, MuJoCo Simulation, DDS Transport, Web and Native Visualization via Rerun


## Highlights

88+ commits, 20 contributors, 700+ files changed.

The TLDR: **a complete manipulation stack**, **MuJoCo simulation**, **DDS transport**, and **a rewritten visualization pipeline**. Agents are no longer bolted on top — they're refactored as native modules with direct stream access. The entire ROS message dependency has been removed from core DimOS, and we've added VR, phone, and arm teleoperation stacks. You can now vibecode a pick-and-place task from natural language to motor commands. Installation has been significantly streamlined — no more direnv, simpler setup, and the web viewer is now the default.

---

## 🚀 New Features

### Simulation
- **MuJoCo simulation module** — Run any DimOS blueprint in simulation with no hardware. Supports xArm and Unitree embodiments, parses MJCF/URDF for robot properties, monotonic clock timing (no `time.sleep`). `dimos --simulation run unitree-go2` ([#1035](#1035)) by @jca0
- **Simulation teleop blueprints** — Added simulation teleop blueprints for Piper, xArm6, and xArm7. ([#1308](#1308)) by @mustafab0

### Manipulation
- **Modular manipulation stack** — Full planning stack with Drake: FK/IK solvers (Jacobian + Drake optimization), RRT path planning, world model with obstacle monitoring, multi-robot management. xArm6/7 and Piper support. ([#1079](#1079)) by @mustafab0
- **Joint servo and cartesian controllers** — Joint position/velocity controllers and cartesian IK task with Pinocchio solver. PoseStamped stream input for real-time control. ([#1116](#1116)) by @mustafab0
- **GraspGen integration** — Grasp generation via Docker-hosted GPU model. Lazy container startup, thread-safe init, RPC `generate_grasps()` returns ranked PoseArray. ([#1119](#1119), [#1234](#1234)) by @JalajShuklaSS
- **Gripper control** — Gripper RPC methods on control coordinator, exposed adapter property for custom implementations. ([#1213](#1213)) by @mustafab0
- **Detection3D and Object support** — Object input topics, TF support on manipulation module, pointcloud-to-convex-hull for Drake imports. ([#1236](#1236)) by @mustafab0
- **Agentic pick and place** — Reimplemented manipulation skills for agent-driven pick-and-place workflows. ([#1237](#1237)) by @mustafab0

### Teleoperation
- **Quest VR teleoperation** — Full WebXR + Deno bridge stack. Quest controller data (pose, trigger, grip) streamed to DimOS modules. Monitor-style locking for control loops. ([#1215](#1215)) by @ruthwikdasyam
- **Phone teleoperation** — Control Go2 from your phone with a web-based teleop interface. ([#1280](#1280)) by @ruthwikdasyam
- **Arm teleop with Pinocchio IK** — Single and dual arm teleoperation using Pinocchio inverse kinematics. Blueprints for xArm, Piper, and dual configurations. ([#1246](#1246)) by @ruthwikdasyam

### Transports & Infrastructure
- **DDS transport protocol** — CycloneDDS transport with configurable QoS (high-throughput and reliable profiles). Optional install, benchmark integration. ([#1174](#1174)) by @Kaweees
- **Pubsub pattern subscriptions** — Glob and regex pattern matching for topic subscriptions. `subscribe_all()` for bridge-style consumers. Topic type encoding in channel strings (`/topic#module.ClassName`). ([#1114](#1114)) by @leshy
- **LCM raw bytes passthrough** — Skip `lcm_encode()` when message is already bytes. ([#1223](#1223)) by @leshy
- **Unified TimeSeriesStore** — Pluggable backends (InMemory, SQLite, Pickle, PostgreSQL) with SortedKeyList for O(log n) operations. Replaces the old replay system and TimestampedCollection. Collection API with slice, range, and streaming methods. ([#1080](#1080)) by @leshy
- **DimosROS benchmark tests** — Benchmark suite for ROS transport performance. ([#1087](#1087)) by @leshy

### Navigation
- **FASTLIO2 support** — Hardware-verified localization with arm64 support. Docker deployment with FAR Planner, terrain analysis, and bagfile playback mode. Builds or-tools from source on arm64. ([#1149](#1149)) by @baishibona
- **Native Livox + FASTLIO2 module** — First-class DimOS native module for Livox Mid-360 lidar with FASTLIO2 localization. ([#1235](#1235)) by @leshy

### Visualization
- **RerunBridge module and CLI** — New bridge that subscribes to all LCM messages and logs those with `to_rerun()` to Rerun viewer. GlobalConfig singleton, web viewer support. Replaces the old rerun initialization system. ([#1154](#1154)) by @leshy
- **Webcam rerun visualization** — Camera module logs to Rerun with pinhole projection for 3D visualization. ([#1117](#1117)) by @ruthwikdasyam
- **Default viewer switched to rerun-web** — Browser-based viewer is now the default for broader compatibility. No native viewer install needed. ([#1324](#1324)) by @spomichter

### Agents
- **Agent refactor** — Restructured agent module with cleaner imports and global config integration. ([#1211](#1211)) by @paul-nechifor
- **Timestamp knowledge** — Agents now have timestamp awareness in prompts for temporal reasoning. ([#1093](#1093)) by @ClaireBookworm
- **Observe skill** — Go2 can now observe (capture and describe) its environment via agent skill. ([#1109](#1109)) by @paul-nechifor

### Platform & Hardware
- **G1 without ROS** — Unitree G1 blueprints decoupled from ROS dependency. Lazy imports for fast startup. ([#1221](#1221)) by @jeff-hykin
- **ARM (aarch64) support** — DimOS runs on ARM hardware. Platform-conditional dependencies, open3d source builds for arm64. ([#1229](#1229)) by @jeff-hykin
- **Universal joint/hardware schema** — `HardwareComponent` dataclass with `JointState`, `JointName` type aliases. Backend registry with auto-discovery for SDK adapters. ([#1040](#1040), [#1067](#1067)) by @mustafab0

---

## 🔧 Improvements

- **Optional Dask** — Start without Dask using `--no-dask` flag. Startup time reduced from ~60s to ~45s. ([#1111](#1111), [#1232](#1232)) by @paul-nechifor
- **RPC rework** — Renamed `ModuleBlueprint` → `_BlueprintAtom`, `ModuleBlueprintSet` → `Blueprint`, `ModuleConnection` → `Stream`. Added `ModuleRef`, improved type hints throughout. ([#1143](#1143)) by @jeff-hykin
- **Image class simplification** — Rewritten as pure NumPy dataclass. Removed CUDA backend, unused methods (solve_pnp, csrt_tracker), and image_impls/ directory. ([#1161](#1161)) by @leshy
- **Odometry message cleanup** — Simplified Odometry message type. ([#1256](#1256)) by @leshy
- **Remove all ROS message dependencies** — Purged ROS message types from core DimOS. Refactored rosnav to use ROSTransport. Removed dead ROS bridge code. ([#1230](#1230)) by @alexlin2
- **Removed bad function serialization** — Eliminated unnecessary serialization of Python functions. ([#1121](#1121)) by @paul-nechifor
- **Benchmark IEC units** — Switched bandwidth benchmarks from SI to IEC units for accuracy. ([#1147](#1147)) by @leshy
- **Pubsub typing improvements** — Thread-safety locks on `subscribe_new_topics` and `subscribe_all`. Proper type params across pubsub stack. ([#1153](#1153)) by @leshy
- **Autogenerated blueprint list** — Blueprints are now auto-discovered and listed. ([#1100](#1100)) by @paul-nechifor
- **Generic Buttons message** — Renamed `QuestButtons` to `Buttons` with generic field names for cross-platform teleop. ([#1261](#1261)) by @ruthwikdasyam
- **Dev container uses ros-dev image** — `./bin/dev` now runs the ROS-enabled dev image. ([#1170](#1170)) by @leshy
- **LSP support** — Added python-lsp-server and python-lsp-ruff to dev dependencies. ([#1169](#1169)) by @leshy
- **Lazy-load pyrealsense2** — RealSense camera module uses lazy imports to avoid errors in simulation environments without the SDK. ([#1309](#1309)) by @spomichter
- **Removed unused mmcv and mmengine** — Dead Detic dependencies removed, eliminating slow source builds from install. ([#1319](#1319)) by @spomichter
- **Simplified installation** — Removed direnv requirement, streamlined install instructions across all platforms. ([#1315](#1315)) by @spomichter
- **DDS extra excluded from --all-extras** — `cyclonedds` requires a source build, so `dds` is now excluded from `uv sync --all-extras` by default. ([#1318](#1318)) by @spomichter
- **Nix pre-commit skip** — Skip pre-commit install if hooks already exist. ([#1162](#1162)) by @leshy
- **Removed base-requirements** — Consolidated dependency management. ([#1098](#1098)) by @paul-nechifor
- **Removed old graspnet** — Cleaned up deprecated graspnet version. ([#1248](#1248)) by @paul-nechifor
- **Code cleanup** — Removed `tofix` markers ([#1216](#1216)), fixed ruff issues ([#1112](#1112)), removed old README_installation.md ([#1101](#1101)) by @paul-nechifor

---

## 🐛 Bug Fixes

- Fix LFS updating (move from .local to venv) ([#1090](#1090)) by @jeff-hykin
- Launch hotfixes: git clone HTTPS, get_data main branch ([#1091](#1091)) by @spomichter
- Fix camera demo not showing in Rerun ([#1148](#1148)) by @jeff-hykin
- Default to rerun native viewer ([#1099](#1099)) by @Nabla7
- Fix exploration blocking agent loop ([#1258](#1258)) by @paul-nechifor
- Fix person-follow blocking agent loop ([#1278](#1278)) by @paul-nechifor
- Skip metric3d tests on unsupported xformers GPUs (Blackwell compute capability >9.0) ([#1225](#1225)) by @leshy
- Fix manipulation tests ([#1218](#1218), [#1247](#1247)) by @jeff-hykin, @paul-nechifor
- Fix control coordinator e2e test ([#1212](#1212)) by @mustafab0
- Fix xarm7-sim broken e2e tests ([#1294](#1294)) by @paul-nechifor
- Pin langchain to restore supported providers ([#1241](#1241)) by @spomichter
- Fix missing library dependencies in Nix flake ([#1240](#1240)) by @Kaweees
- Fix discord invite link ([#1122](#1122)) by @spomichter
- macOS edgecase fix ([#1096](#1096)) by @jeff-hykin
- Fix second N in logo ([#1250](#1250)) by @jeff-hykin
- Fix Unitree Go2 minor issues ([#1307](#1307)) by @paul-nechifor
- Fix broken tests ([#1305](#1305)) by @ruthwikdasyam
- Fix `uv sync` for some macOS systems ([#1322](#1322)) by @jeff-hykin
- Fix mmcv install ([#1313](#1313)) by @paul-nechifor
- Fix mypy issues ([#1150](#1150), [#1167](#1167), [#1257](#1257)) by @leshy, @paul-nechifor, @jeff-hykin
- Fix Nix install uv pip extras ([#1321](#1321)) by @spomichter

---

## 📚 Documentation

- **Major docs overhaul** — New README with feature grid, hardware table, quickstart. Navigation, transports, data streams, and agent docs. ([#1295](#1295)) by @leshy
- **Day 1 docs** — Comprehensive getting started guide, development docs, contributing guide, architecture overview. Executable blueprint docs via md-babel-py. ([#1064](#1064)) by @jeff-hykin
- **Arm integration guide** — How-to for integrating new robotic arms with DimOS. ([#1238](#1238)) by @mustafab0
- **MCP documentation update** — Updated MCP install and usage instructions. ([#1251](#1251)) by @Kaweees
- **Docker docs** — First pass on Docker deployment documentation. ([#1151](#1151)) by @leshy
- **Transports documentation** — Encode/decode mixins, SHM examples, ROS/DDS transport docs. ([#1107](#1107)) by @leshy
- **Rerun API examples** — Updated examples for the new RerunBridge API. ([#1262](#1262)) by @jeff-hykin
- **PR template** added ([#1172](#1172)) by @christiefhyang
- **Simplified install instructions** — Removed direnv, streamlined across all platforms. ([#1315](#1315)) by @spomichter
- **Python example restored** — Added back the Python usage example. ([#1317](#1317)) by @jeff-hykin
- **Nix install updated** — Replaced uv with pip for Nix compatibility. ([#1326](#1326)) by @ruthwikdasyam
- **README improvements** ([#1311](#1311)) by @paul-nechifor
- **Simplified writing docs** — Consolidated writing_docs to a single markdown file. ([#1254](#1254)) by @jeff-hykin

---

## 🏗️ CI & Build

- **ci-complete gate** — Dynamic branch protection via single aggregated status check. MD-only PRs no longer blocked. ([#1279](#1279)) by @spomichter
- **Path-based test filtering** — Test jobs fully skip (no container spin-up) when no relevant code changed. ([#1284](#1284), [#1286](#1286)) by @spomichter
- **Navigation docker build workflow** — CI builds for the ROS navigation stack. ([#1259](#1259)) by @spomichter
- **CUDA test marker** — `@pytest.mark.cuda` for GPU-dependent tests. ([#1220](#1220)) by @jeff-hykin
- **e2e test marker** — Marked end-to-end tests for selective CI runs. ([#1110](#1110)) by @paul-nechifor
- **pytest stdin fix** — Added `-s` to default addopts for LCM autoconf compatibility. ([#1320](#1320)) by @spomichter

---

## ⚠️ Breaking Changes

- **RPC renames**: `ModuleBlueprint` → `_BlueprintAtom`, `ModuleBlueprintSet` → `Blueprint`, `ModuleConnection` → `Stream` ([#1143](#1143))
- **Image class rewrite**: `CudaImage` and `NumpyImage` removed. Image is now a pure NumPy dataclass. Methods like `solve_pnp`, `csrt_tracker`, `from_depth`, `to_depth_meters` removed. ([#1161](#1161))
- **ROS messages removed from core**: All `to_ros`/`from_ros` conversion methods removed. Use `ROSTransport` instead. ([#1230](#1230))
- **QuestButtons → Buttons**: Renamed with generic field names. ([#1261](#1261))
- **RerunBridge replaces old rerun init**: `dimos.dashboard.rerun_init` removed. Use `RerunBridgeModule` or the `rerun-bridge` CLI. ([#1154](#1154))
- **Unitree directory restructuring**: `unitree_go2` → `unitree/go2`, `unitree_g1` → `unitree/g1`. Blueprint names updated. ([#1221](#1221))
- **Default viewer is now rerun-web**: Use `--viewer-backend rerun` to restore native viewer. ([#1324](#1324))

---

## Quickstart

```bash
# Install
uv pip install dimos[base,unitree]

# Try it (no hardware needed)
# NOTE: First run downloads ~2.4 GB from LFS
dimos --replay run unitree-go2

# Simulate
uv pip install dimos[base,unitree,sim]
dimos --simulation run unitree-go2
```

---

## New Contributors 🎉

- @ruthwikdasyam — Quest VR teleoperation, phone teleop, arm teleop, webcam rerun viz
- @JalajShuklaSS — GraspGen integration
- @jca0 — MuJoCo simulation module
- @christiefhyang — PR template

---

**Full Changelog**: [v0.0.9...v0.0.10](v0.0.9...v0.0.10)
docs(go2): add getting started guide (#1339)
jeff-hykin and others added 23 commits March 19, 2026 16:49
…cker3

# Conflicts:
#	dimos/core/global_config.py
#	dimos/mapping/occupancy/path_map.py
Resolve conflicts:
- path_map.py: remove stale blank line (take dev)
- vl/create.py: keep __all__ export, remove duplicate import
- test_process_crash_triggers_stop: call mod.stop() after watchdog cleanup
  to release LCM transport and event loop threads (fixes CI thread leak error)
- docker pull: remove stderr=subprocess.PIPE so both stdout and stderr
  are visible during pulls (progress bars, layer downloads)
test_sim_module.py was replaced by test_sim_adapter.py in PR #1639
(MuJoCo sim support for Manipulation). Accept dev's deletion.
When docker pull fails, the error message now includes the actual
output to help diagnose auth/network/registry issues.

Revert: git revert HEAD
Test file used ExceptionGroup without importing it, causing NameError
on Python < 3.11. Import from safe_thread_map where it's polyfilled.

Revert: git revert HEAD
Keep both the sleep (for watchdog thread cleanup) and explicit stop()
(for LCM transport cleanup).
@jeff-hykin jeff-hykin marked this pull request as draft March 31, 2026 23:40
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 31, 2026

Greptile Summary

This PR is a significant architectural refactor of the DimOS module deployment pipeline. It consolidates Docker tooling (docker_build.py + docker_runner.pydocker_module.py), splits the monolithic WorkerManager into WorkerManagerDocker and WorkerManagerPython, and introduces a clean compile-then-execute pattern in Blueprint.build() via the new DeploySpec intermediate representation. New utilities (ThreadSafeVal, ModuleThread, AsyncModuleThread, ModuleProcess, safe_thread_map) provide well-tested thread-safety primitives used throughout the new managers.

Key changes and concerns:

  • P1 — FieldInfo instead of dict in RPCClient.__init__: getattr(default_config, "rpc_timeouts", ...) where default_config is a Pydantic model class returns a FieldInfo descriptor, not the factory default. This will cause dict(FieldInfo) to fail inside PubSubRPCMixin.__init__ for every Python-worker module. Acknowledged in changes.md as "not addressed".
  • P1 — g=global_config forwarded into Docker module config: Both WorkerManagerDocker.deploy() and deploy_parallel() pass g=GlobalConfig() as a kwarg to DockerModuleOuter, which then passes all kwargs to config_class(**kwargs). Unless BaseConfig silently ignores unknown fields, this is a Pydantic ValidationError for every Docker module deployment.
  • P2 — Build hash misses DimOS footer: _compute_build_hash hashes the raw Dockerfile but build_image builds the DimOS-converted file. Footer / module-install.sh changes won't trigger a rebuild.
  • P2 — Undocumented breaking defaults: docker_gpus changed from "all" to None and docker_restart_policy from "on-failure:3" to "no" with no migration guidance in the PR description.
  • P2 — changes.md committed: Developer scratch notes ("Reviewer was wrong on", "Not addressed") should not be part of the repository history.

Confidence Score: 4/5

  • Not safe to merge as-is — two P1 defects will break RPCClient construction for every module and Docker module deployment.
  • The architectural direction is sound and the thread utilities are excellent, but the FieldInfo/Pydantic class-vs-instance issue in rpc_client.py and the spurious g=global_config forwarding in worker_manager_docker.py are active runtime defects that would surface immediately in any end-to-end test exercising Python workers or Docker modules respectively.
  • dimos/core/rpc_client.py (FieldInfo bug) and dimos/core/worker_manager_docker.py (g=global_config forwarding) require fixes before merge.

Important Files Changed

Filename Overview
dimos/core/rpc_client.py Adds ModuleProxyProtocol, per-module RPC timeout support, and a build() stub on ModuleProxy; but RPCClient.__init__ reads rpc_timeouts from a Pydantic model class (returns FieldInfo, not a dict), which will break at runtime.
dimos/core/worker_manager_docker.py New manager for Docker-backed modules; deploy() and deploy_parallel() both pass g=global_config into DockerModuleOuter.__init__, which forwards it to config_class(**kwargs) — a Pydantic validation error for any module with strict extra-field handling.
dimos/core/docker_module.py Merges docker_build.py into docker_runner.py and renames to docker_module.py; adds content-hash rebuild detection, reconnect mode, DockerModuleOuter/DockerModuleInner split, and __main__ module-path resolution; breaking defaults for docker_gpus and docker_restart_policy undocumented; build hash omits DimOS footer.
dimos/core/blueprints.py Refactors Blueprint to a compile-then-execute pattern with DeploySpec; logic correctness preserved via equivalent if-elif chains; _compile_rpc_wiring now stores method names instead of callables (deferred to execution phase).
dimos/core/module_coordinator.py Split WorkerManager into WorkerManagerDocker + WorkerManagerPython; adds _build_all_modules() lifecycle step; raises ValueError for empty deployments which may be overly strict; stop() now suppresses exceptions per-module, logging them instead.
dimos/core/worker_manager_python.py New file: replaces WorkerManager with dynamic worker scaling (worker_to_module_ratio), proper cleanup on partial deploy failure, and health_check delegated to worker PIDs; looks correct.
dimos/utils/thread_utils.py New file: introduces ThreadSafeVal (RLock-backed), ModuleThread, AsyncModuleThread, ModuleProcess, and safe_thread_map; well-designed with RLock re-entrancy fix, idempotent stops, and comprehensive test coverage.
dimos/protocol/rpc/spec.py Extracts DEFAULT_RPC_TIMEOUT/DEFAULT_RPC_TIMEOUTS constants; refactors call_sync timeout resolution to consult per-method rpc_timeouts map (full topic name, then bare method name fallback); protocol updated accordingly.
dimos/core/module.py Adds build() RPC method (24h timeout default), per-module RPC timeout fields on ModuleConfig, and configurable loop-thread join timeout; removes configure_stream in favour of the unified set_transport path.
dimos/utils/typing_utils.py New file: polyfills ExceptionGroup for Python < 3.11 and TypeVar for Python < 3.13; clean and correct.
changes.md Developer working notes file; should not be committed to the repository.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Blueprint.build()"] --> B["Phase 1: Config\nglobal_config.update()"]
    B --> C["Phase 2: Validate\n_check_requirements()"]
    C --> D["Phase 3: Compile\nDeploySpec (pure)"]
    D --> D1["_compile_module_specs()"]
    D --> D2["_compile_stream_wiring()"]
    D --> D3["_compile_rpc_wiring()"]
    D --> D4["_compile_module_ref_wiring()"]
    D1 & D2 & D3 & D4 --> E["Phase 4: Execute\nModuleCoordinator(deploy_spec)"]
    E --> F["deploy_parallel()"]
    F --> G{Docker module?}
    G -- Yes --> H["WorkerManagerDocker\n.deploy_parallel()"]
    G -- No --> I["WorkerManagerPython\n.deploy_parallel()"]
    H --> J["DockerModuleOuter.build()\n(pull/build image, start container,\nwait for RPC)"]
    I --> K["Worker.deploy_module()\n+ RPCClient"]
    J & K --> L["_wire_streams()"]
    L --> M["_wire_rpc_methods()"]
    M --> N["_wire_module_refs()"]
    N --> O["_build_all_modules()\n(calls build() RPC on each)"]
    O --> P["start_all_modules()\n(calls start() RPC on each)"]
Loading

Comments Outside Diff (2)

  1. dimos/core/rpc_client.py, line 1808-1811 (link)

    P1 FieldInfo returned instead of dict for class-level Pydantic access

    default_config here is a Pydantic model class (e.g. ModuleConfig), not an instance. In Pydantic v2, accessing a field with default_factory at the class level returns the FieldInfo descriptor object, not the factory's result. The getattr(default_config, "rpc_timeouts", dict(DEFAULT_RPC_TIMEOUTS)) fallback only fires when the attribute is absent, but rpc_timeouts is defined on ModuleConfig, so it will always be present as a FieldInfo. Passing that FieldInfo to LCMRPC(rpc_timeouts=FieldInfo(...)) and then calling dict(FieldInfo) inside PubSubRPCMixin.__init__ will raise a TypeError.

    This is acknowledged in changes.md under "Not addressed", but it will break RPCClient construction for any module that uses the default ModuleConfig.

    A safe fix is to instantiate a temporary config object to read the defaults:

    default_config_cls = getattr(actor_class, "default_config", None)
    try:
        _tmp = default_config_cls() if default_config_cls is not None else None
    except Exception:
        _tmp = None
    self.rpc = LCMRPC(
        rpc_timeouts=getattr(_tmp, "rpc_timeouts", dict(DEFAULT_RPC_TIMEOUTS)),
        default_rpc_timeout=getattr(_tmp, "default_rpc_timeout", DEFAULT_RPC_TIMEOUT),
    )
  2. dimos/core/docker_module.py, line 1214-1224 (link)

    P2 Build hash covers original Dockerfile, not the DimOS-converted one

    _compute_build_hash hashes cfg.docker_file.read_bytes() — the raw, unmodified Dockerfile — but build_image actually builds from the converted file that includes DIMOS_FOOTER. This means the stored hash will never reflect changes to DIMOS_FOOTER, docker/python/module-install.sh, or the DimOS source tree itself. An updated DimOS version would not trigger a rebuild even though the resulting image would be different.

    changes.md acknowledges this under "Not addressed". At minimum the hash should include the DIMOS_FOOTER string so footer changes force a rebuild. Ideally it would also hash module-install.sh and the DimOS source, but that's a bigger change.

Reviews (1): Last reviewed commit: "split ModuleCoordinator from Blueprint" | Re-trigger Greptile

Comment on lines +37 to +40
self._deployed: list[DockerModuleOuter] = []

def should_manage(self, module_class: type) -> bool:
# inlined to prevent circular dependency
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 g=global_config leaks into Docker module config constructor

DockerModuleOuter.__init__ accepts **kwargs and passes them all directly to config_class(**kwargs). So this call adds g=GlobalConfig(...) to the kwargs dict that becomes arguments to DockerModuleConfig(...) (or a subclass). Unless BaseConfig / ModuleConfig is configured with model_config = ConfigDict(extra='ignore'), Pydantic will raise a ValidationError for the unknown g field — making every Docker module deployment fail.

The # type: ignore[arg-type] annotation hints the author is aware of a typing mismatch, but the runtime failure risk is real. The global_config argument does not belong in the Docker module's config; it should be used only to configure the manager itself (as WorkerManagerPython does).

# In deploy(), use config only to route; don't forward g= into the module config
mod = DockerModuleOuter(module_class, **kwargs)  # type: ignore[arg-type]

Comment on lines +55 to +57
from dimos.core.docker_module import DockerModuleOuter

mod = DockerModuleOuter(module_class, g=global_config, **kwargs) # type: ignore[arg-type]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Same g=global_config issue in deploy_parallel

Same problem as in deploy()g=spec[1] (which is a GlobalConfig) is forwarded into DockerModuleOuter.__init__ and ultimately into config_class(g=GlobalConfig(), ...).

mod = DockerModuleOuter(spec[0], **spec[2])  # type: ignore[arg-type]

Comment on lines +675 to +677
context = cfg.docker_build_context or cfg.docker_file.parent
cmd = [cfg.docker_bin, "build", "-t", cfg.docker_image, "-f", str(dockerfile)]
cmd.extend(["--label", f"{_BUILD_HASH_LABEL}={build_hash}"])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Silent breaking defaults for GPU and restart policy

Two defaults have changed relative to the previous release:

  • docker_gpus: "all"None — disables GPU passthrough
  • docker_restart_policy: "on-failure:3""no" — containers no longer restart on crash

Both are acknowledged in changes.md as "intentional breaking change?" with no migration guidance. Users upgrading who relied on GPU access or automatic restarts will silently lose those behaviours. The PR description's Breaking Changes section says only "None".

These should either be documented as breaking changes or preserved as opt-in/opt-out flags with a deprecation path.

changes.md Outdated
Comment on lines +1 to +24
# PR #1431 (Docker Restoration) — Paul Review Fixes

## Commits (local, not pushed)

### 1. `317c487a2` — Include stdout/stderr in docker pull error
- Pull failures were silent — no diagnostic output
- Now includes both stdout and stderr in exception
- **Revert:** `git revert 317c487a2`

### 2. `91a13f1e7` — Import ExceptionGroup in test file
- Test used ExceptionGroup without import → NameError on Python < 3.11
- Now imports from safe_thread_map polyfill
- **Revert:** `git revert 91a13f1e7`

## Reviewer was wrong on
- `rpc_timeouts` class-level mutable dict — it's in ModuleConfig (pydantic) with `Field(default_factory=...)`, which is correct

## Not addressed (need Jeff's input / bigger refactor)
- Container launch in `__init__` vs `start()` — lifecycle redesign
- Deterministic container naming (removed PID+timestamp) — collision risk
- `docker_gpus` default None (was "all") — intentional breaking change?
- `docker_restart_policy` default "no" (was "on-failure:3") — same
- Build hash includes original Dockerfile, not converted (with footer)
- `getattr(default_config, "rpc_timeouts", ...)` returns FieldInfo on class
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Developer scratch notes committed to the repository

changes.md appears to be a personal working notes file ("Commits (local, not pushed)", "Reviewer was wrong on", "Not addressed (need Jeff's input)"). Committing this kind of internal review dialogue to the repo history adds noise and may inadvertently expose in-flight design discussions. Consider removing it from the PR and keeping it in a local branch note, PR description body, or Jira comment instead.

…ordinator.py refactor

Conflicts resolved:
- blueprints.py, module_coordinator.py: kept docker4's DeploySpec architecture
- All other conflicts: took dev's versions
- Removed docker4-only files (thread_utils, typing_utils, worker_manager_python, worker_python)
- Restored dev's worker.py and worker_manager.py
@jeff-hykin jeff-hykin changed the title Jeff/fix/docker4 ModuleCoordinator & Blueprint.py changes Apr 1, 2026
- Added _DisabledModuleProxy class back to blueprints.py
- Added optional field to ModuleRef
- _compile_module_ref_wiring now detects disabled providers and creates no-op proxies
- DeploySpec carries disabled_ref_proxies
- ModuleCoordinator._wire_disabled_ref_proxies wires them after module refs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants