Skip to content

Refactor: collapse ChipWorker init/set_device → init(device_id, bins)#723

Open
ChaoWao wants to merge 1 commit intohw-native-sys:mainfrom
ChaoWao:refactor/chip-worker-fold-set-device
Open

Refactor: collapse ChipWorker init/set_device → init(device_id, bins)#723
ChaoWao wants to merge 1 commit intohw-native-sys:mainfrom
ChaoWao:refactor/chip-worker-fold-set-device

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented May 8, 2026

Summary

ChipWorker exposed init / set_device / reset_device / finalize as four public lifecycle methods even though set_device existed only to satisfy CANN's per-thread aclrtSetDevice attach requirement (#715), and reset_device had zero call sites. The split was already the source of one bug (#715) and forced every caller through a redundant ceremony.

This PR collapses the lifecycle to a single attach point at init.

What changed

C++ (src/common/worker/chip_worker.{h,cpp})

  • ChipWorker::init now takes device_id and attaches the calling thread internally via simpler_init
  • ChipWorker::set_device and ::reset_device deleted
  • device_set_ flag and device_set() getter collapse into initialized_
  • finalize() simplified — calls finalize_device_fn_ directly

C-API (pto_runtime_c_api.h + a2a3/a5 onboard+sim implementations)

  • set_device export removed
  • simpler_init signature extended to (ctx, device_id, log_level, log_info_v) and now returns int; folds the per-thread attach in alongside log config

DeviceRunner attach surface (a2a3 + a5, onboard + sim)

  • attach_current_thread(int) is the centralized binder. Onboard: rtSetDevice. Sim: pto_cpu_sim_bind_device + pto_cpu_sim_acquire_device.
  • device_id_ stays a plain int. All ChipWorker callers run on the same thread that called init(), so the standard thread-spawn happens-before edge is sufficient — no atomic, no CAS, no per-op re-attach.

Python

  • _ChipWorker.init / ChipWorker.init take device_id (and the wrapper takes it before bins, since it's the simpler, more fundamental "where" parameter)
  • _ChipWorker.set_device / .reset_device bindings removed; device_set property removed
  • bootstrap_context no longer calls set_device — the device is attached at init time, the param is now used only as a defensive consistency check

Drive-by fix: ProfilerBase's constructor/destructor are now private with friend Derived (bugprone-crtp-constructor-accessibility). Newer clang-tidy versions on macOS would otherwise reject any TU that transitively includes profiler_base.h.

Callers updated

python/simpler/worker.py, simpler_setup/scene_test.py, tests/ut/py/test_chip_worker.py, tests/ut/py/test_worker/test_platform_comm.py, and test_bootstrap_context_{sim,hw}.py collapse init + set_device into a single init(device_id, bins) call.

Docs

docs/chip-level-arch.md, docs/dynamic-linking.md, docs/getting-started.md, plus the worker_malloc / hello_worker examples reflect the new flow.

Test plan

  • `pip install --no-build-isolation -e .` — clean build (a2a3sim + a5sim) on macOS
  • `pytest tests/ut/py` — 116 passed, 7 skipped (the only unrelated failures need `torch`, which is not installed in the local env)
  • `python examples/workers/l2/{hello_worker,worker_malloc}/main.py -p {a2a3sim,a5sim} -d 0` — passes
  • Onboard CI: `ut-a2a3` / `ut-a5` / `st-onboard-*` (Linux, hardware) — needs CI run
  • `ctest --test-dir build/ut_cpp` (Linux) — needs CI run

@ChaoWao ChaoWao force-pushed the refactor/chip-worker-fold-set-device branch from efc0058 to 5528c02 Compare May 8, 2026 09:28
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the ChipWorker API by integrating NPU device attachment into the init method and removing the explicit set_device and reset_device calls. Device operations such as malloc, free, and copy now internally handle thread attachment to ensure safety across different calling threads. Documentation, examples, and tests have been updated to reflect these changes. Feedback identifies potential data races in the initialization and attachment logic, along with a resource leak in the free_tensor error path.

Comment thread src/common/worker/chip_worker.cpp Outdated
Comment thread src/a2a3/platform/sim/host/device_runner.cpp
Comment thread src/a5/platform/sim/host/device_runner.cpp
Comment thread src/a2a3/platform/onboard/host/device_runner.cpp Outdated
Comment thread src/a5/platform/onboard/host/device_runner.cpp Outdated
@ChaoWao ChaoWao force-pushed the refactor/chip-worker-fold-set-device branch 2 times, most recently from 00889af to f48e496 Compare May 9, 2026 08:40
@ChaoWao ChaoWao changed the title Refactor: collapse ChipWorker init/set_device, fold per-thread attach into device-ops Refactor: collapse ChipWorker init/set_device → init(device_id, bins) May 9, 2026
ChipWorker exposed init / set_device / reset_device / finalize as four
public lifecycle methods even though set_device existed only to satisfy
CANN's per-thread aclrtSetDevice attach requirement (PR hw-native-sys#715), and
reset_device had zero call sites. The split was the source of one bug
(hw-native-sys#715) and forced every caller through a redundant ceremony.

Collapse into a single attach point:

- ChipWorker::init now takes device_id and attaches the calling thread
  internally via simpler_init; ChipWorker::set_device and ::reset_device
  are deleted, and device_set_ / device_set() collapse into initialized_.
- C-API set_device export is removed; simpler_init grows a device_id
  parameter and returns int. The per-thread attach is centralized inside
  DeviceRunner::attach_current_thread (sim DeviceRunner gains the same
  method, driving pto_cpu_sim_bind_device + idempotent acquire_device).
  All ChipWorker callers run on the same thread that called init, so
  device_id_ stays a plain int — no atomic / no CAS needed.
- Python `ChipWorker.init` wrapper takes (device_id, bins) — device_id
  first since it's the simpler, more fundamental "where" parameter.
  bootstrap_context's device_id arg becomes a defensive consistency
  check rather than the attach point.
- Callers updated: python/simpler/worker.py, simpler_setup/scene_test.py,
  tests/ut/py/test_chip_worker.py, tests/ut/py/test_worker/test_platform_comm.py
  and test_bootstrap_context_{sim,hw}.py.
- Docs (chip-level-arch, dynamic-linking, getting-started) and the
  worker_malloc / hello_worker examples reflect the new flow.

Drive-by: ProfilerBase's constructor/destructor are now private with
`friend Derived` (bugprone-crtp-constructor-accessibility); newer
clang-tidy versions on macOS would otherwise reject any TU that
transitively includes profiler_base.h.
@ChaoWao ChaoWao force-pushed the refactor/chip-worker-fold-set-device branch from f48e496 to 9939bd3 Compare May 9, 2026 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant