Skip to content

Feat: prepared callable — register + run(cid) on a unified ABI#710

Open
poursoul wants to merge 28 commits intohw-native-sys:mainfrom
poursoul:feat-callable
Open

Feat: prepared callable — register + run(cid) on a unified ABI#710
poursoul wants to merge 28 commits intohw-native-sys:mainfrom
poursoul:feat-callable

Conversation

@poursoul
Copy link
Copy Markdown
Collaborator

@poursoul poursoul commented May 6, 2026

Summary

Introduces the prepared callable dispatch path and unifies the L2 / L3+
API on register() + run(cid) / submit_*(cid). Replaces per-launch
dlclose + dlopen of the orch SO on the AICPU with a one-time-per-cid
upload, then removes the legacy run_runtime ABI altogether.

  • New host↔AICPU protocol (src/common/task_interface/callable_protocol.h):
    AICPU keeps a fixed orch_so_table_[MAX_REGISTERED_CALLABLE_IDS] (cap 64);
    host registers each callable once, AICPU dispatches by callable_id.
  • C ABI: adds prepare_callable / run_prepared / unregister_callable
    on every variant (a2a3 + a5, both host_build_graph and
    tensormap_and_ringbuffer); drops run_runtime / init_runtime_impl and
    the RUNTIME_HAS_CALLABLE_ID / RUNTIME_HOST_ORCH compile-time macros.
  • DeviceRunner (onboard + sim) gains prepared_callables_ keyed by cid,
    an orch_so_dedup_ table that refcounts identical SO bytes by Build-ID,
    and aicpu_seen_callable_ids_ so the AICPU is registered once per cid.
  • Python / nanobind: Worker.register(target) -> cid is the single entry
    point for sub-fn / orch-fn / ChipCallable at every level; Worker.run,
    orch.submit_next_level, orch.submit_sub now take cid. L3+ forbids
    post-init() registration so forked chip / sub children inherit the
    registry via COW; L2 still allows post-init register and pre-warms on the
    spot.
  • Mailbox carries the cid (Stage 3 protocol); _chip_process_loop
    consolidates args parsing in C++ and walks the raw blob path.
  • Examples (vector_add, child_memory, ffn_tp_parallel,
    multi_chip_dispatch, allreduce_distributed, async-notify demos)
    migrated to the cid API. Getting-started doc updated.
  • Tests: adds prepared_callable ST suite under all four
    {a2a3, a5} × {host_build_graph, tensormap_and_ringbuffer} variants,
    plus tests/ut/cpp/common/test_orch_so_file.cpp and an
    aicpu_dlopen_count getter to assert the one-load-per-cid invariant.

Backwards-compatibility shims and dual paths are removed in the same PR
(Phases 3–4 commits) so there is no --legacy flag to maintain.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a 'prepared callable' mechanism to reduce orchestration overhead by caching kernel binaries and SO handles across repeated launches. It transitions the system to use registered callable_id integers instead of raw pointers in submit_next_level and run calls. The changes span the Python API, C++ bindings, and the AICPU executor, which now maintains a per-ID orchestration table. Review feedback identifies a performance issue in the Python chip loop due to TaskArgs instantiation and points out missing bounds checks for callable_id in both the host-side registration and the AICPU-side dispatch logic.

Comment thread python/simpler/worker.py Outdated
Comment thread src/a2a3/platform/onboard/host/device_runner.cpp
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp Outdated
poursoul added a commit to poursoul/simpler that referenced this pull request May 6, 2026
- Apply clang-format on src/a5/platform/sim/host/pto_runtime_c_api.cpp
  and src/common/worker/chip_worker.h (pre-commit fix).
- register_prepared_callable: enforce callable_id in [0, 64) in both
  a2a3 onboard and sim DeviceRunner so an out-of-range id fails fast on
  host instead of OOB-indexing the AICPU orch_so_table_ later.
- aicpu_executor: reject negative callable_id values other than the
  legacy -1 sentinel (mirrors the upper-bound guard).
- tests/st/explicit_fatal: migrate to Stage 4 register + run(cid) API
  so the negative ST works under the unified run(cid) entry point.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
poursoul added a commit to poursoul/simpler that referenced this pull request May 6, 2026
- Apply clang-format on src/a5/platform/sim/host/pto_runtime_c_api.cpp
  and src/common/worker/chip_worker.h (pre-commit fix).
- register_prepared_callable: enforce callable_id in [0, 64) in both
  a2a3 onboard and sim DeviceRunner so an out-of-range id fails fast on
  host instead of OOB-indexing the AICPU orch_so_table_ later.
- aicpu_executor: reject negative callable_id values other than the
  legacy -1 sentinel (mirrors the upper-bound guard).
- tests/st/explicit_fatal: migrate to Stage 4 register + run(cid) API
  so the negative ST works under the unified run(cid) entry point.
poursoul added a commit to poursoul/simpler that referenced this pull request May 7, 2026
- python/bindings: add TaskArgs overload for ChipWorker.run() so chip
  child loops on variants without prepare_callable can dispatch via the
  legacy TaskArgs path (fixes a5 multi_chip_dispatch failures).
- a2a3 sim device_runner: initialize Worker.l2_perf_records_addr in
  the per-core init loop (matches onboard); uninitialized garbage was
  being treated as a valid pointer when the L2 swimlane bit happened
  to be set in enable_profiling_flag, causing AICore segfaults.
- a2a3 onboard host_regs: restore placeholder-address fallback for
  AicoreRegKind::Ctrl on halMemCtl failure (the dispatch path does not
  dereference these); Pmu kind continues to propagate failure so the
  caller can disable PMU collection cleanly.
- a2a3 l2_perf_collector.h: drop unused #include "runtime.h" so
  clang-tidy can lint the header without per-runtime include paths.
poursoul added a commit to poursoul/simpler that referenced this pull request May 7, 2026
- Apply clang-format on src/a5/platform/sim/host/pto_runtime_c_api.cpp
  and src/common/worker/chip_worker.h (pre-commit fix).
- register_prepared_callable: enforce callable_id in [0, 64) in both
  a2a3 onboard and sim DeviceRunner so an out-of-range id fails fast on
  host instead of OOB-indexing the AICPU orch_so_table_ later.
- aicpu_executor: reject negative callable_id values other than the
  legacy -1 sentinel (mirrors the upper-bound guard).
- tests/st/explicit_fatal: migrate to Stage 4 register + run(cid) API
  so the negative ST works under the unified run(cid) entry point.
poursoul added a commit to poursoul/simpler that referenced this pull request May 7, 2026
- python/bindings: add TaskArgs overload for ChipWorker.run() so chip
  child loops on variants without prepare_callable can dispatch via the
  legacy TaskArgs path (fixes a5 multi_chip_dispatch failures).
- a2a3 sim device_runner: initialize Worker.l2_perf_records_addr in
  the per-core init loop (matches onboard); uninitialized garbage was
  being treated as a valid pointer when the L2 swimlane bit happened
  to be set in enable_profiling_flag, causing AICore segfaults.
- a2a3 onboard host_regs: restore placeholder-address fallback for
  AicoreRegKind::Ctrl on halMemCtl failure (the dispatch path does not
  dereference these); Pmu kind continues to propagate failure so the
  caller can disable PMU collection cleanly.
- a2a3 l2_perf_collector.h: drop unused #include "runtime.h" so
  clang-tidy can lint the header without per-runtime include paths.
poursoul added a commit to poursoul/simpler that referenced this pull request May 7, 2026
- python/bindings: add TaskArgs overload for ChipWorker.run() so chip
  child loops on variants without prepare_callable can dispatch via the
  legacy TaskArgs path (fixes a5 multi_chip_dispatch failures).
- a2a3 sim device_runner: initialize Worker.l2_perf_records_addr in
  the per-core init loop (matches onboard); uninitialized garbage was
  being treated as a valid pointer when the L2 swimlane bit happened
  to be set in enable_profiling_flag, causing AICore segfaults.
- a2a3 onboard host_regs: restore placeholder-address fallback for
  AicoreRegKind::Ctrl on halMemCtl failure (the dispatch path does not
  dereference these); Pmu kind continues to propagate failure so the
  caller can disable PMU collection cleanly.
- a2a3 l2_perf_collector.h: drop unused #include "runtime.h" so
  clang-tidy can lint the header without per-runtime include paths.
poursoul added a commit to poursoul/simpler that referenced this pull request May 7, 2026
- python/bindings: add TaskArgs overload for ChipWorker.run() so chip
  child loops on variants without prepare_callable can dispatch via the
  legacy TaskArgs path (fixes a5 multi_chip_dispatch failures).
- a2a3 sim device_runner: in upload_kernel_binary, when func_id is
  cached, compare the new bytes against the cached binary and re-dlopen
  on mismatch. Stage 4 (hw-native-sys#710) wires multiple ChipCallables onto the
  same ChipWorker (and DeviceRunner) via prepare_callable, so different
  callables register distinct kernels under overlapping func_ids; the
  prior unconditional cache-hit handed the AICore the previous
  callable's kernel and segfaulted at dispatch.
- a2a3 sim device_runner: initialize Worker.l2_perf_records_addr in
  the per-core init loop (matches onboard); uninitialized garbage was
  being treated as a valid pointer when the L2 swimlane bit happened
  to be set in enable_profiling_flag, causing AICore segfaults.
- a2a3 onboard host_regs: restore placeholder-address fallback for
  AicoreRegKind::Ctrl on halMemCtl failure (the dispatch path does not
  dereference these); Pmu kind continues to propagate failure so the
  caller can disable PMU collection cleanly.
- a2a3 runtime aicpu_executor: replace stray DEV_ERROR (undefined in
  this branch's logging surface) with LOG_ERROR, and drop the spurious
  leading 0 argument on a LOG_INFO_V0 call (V0 is the verbosity-0 form,
  not LOG_INFO_V).
- a2a3 l2_perf_collector.h: drop unused #include "runtime.h" so
  clang-tidy can lint the header without per-runtime include paths.
@poursoul poursoul force-pushed the feat-callable branch 2 times, most recently from 51d2d8f to c500507 Compare May 7, 2026 03:36
poursoul added a commit to poursoul/simpler that referenced this pull request May 7, 2026
- Apply clang-format on src/a5/platform/sim/host/pto_runtime_c_api.cpp
  and src/common/worker/chip_worker.h (pre-commit fix).
- register_prepared_callable: enforce callable_id in [0, 64) in both
  a2a3 onboard and sim DeviceRunner so an out-of-range id fails fast on
  host instead of OOB-indexing the AICPU orch_so_table_ later.
- aicpu_executor: reject negative callable_id values other than the
  legacy -1 sentinel (mirrors the upper-bound guard).
- tests/st/explicit_fatal: migrate to Stage 4 register + run(cid) API
  so the negative ST works under the unified run(cid) entry point.
poursoul added a commit to poursoul/simpler that referenced this pull request May 7, 2026
- python/bindings: add TaskArgs overload for ChipWorker.run() so chip
  child loops on variants without prepare_callable can dispatch via the
  legacy TaskArgs path (fixes a5 multi_chip_dispatch failures).
- a2a3 sim device_runner: in upload_kernel_binary, when func_id is
  cached, compare the new bytes against the cached binary and re-dlopen
  on mismatch. Stage 4 (hw-native-sys#710) wires multiple ChipCallables onto the
  same ChipWorker (and DeviceRunner) via prepare_callable, so different
  callables register distinct kernels under overlapping func_ids; the
  prior unconditional cache-hit handed the AICore the previous
  callable's kernel and segfaulted at dispatch.
- a2a3 sim device_runner: initialize Worker.l2_perf_records_addr in
  the per-core init loop (matches onboard); uninitialized garbage was
  being treated as a valid pointer when the L2 swimlane bit happened
  to be set in enable_profiling_flag, causing AICore segfaults.
- a2a3 onboard host_regs: restore placeholder-address fallback for
  AicoreRegKind::Ctrl on halMemCtl failure (the dispatch path does not
  dereference these); Pmu kind continues to propagate failure so the
  caller can disable PMU collection cleanly.
- a2a3 runtime aicpu_executor: replace stray DEV_ERROR (undefined in
  this branch's logging surface) with LOG_ERROR, and drop the spurious
  leading 0 argument on a LOG_INFO_V0 call (V0 is the verbosity-0 form,
  not LOG_INFO_V).
- a2a3 l2_perf_collector.h: drop unused #include "runtime.h" so
  clang-tidy can lint the header without per-runtime include paths.
poursoul added a commit to poursoul/simpler that referenced this pull request May 7, 2026
- python/bindings: add TaskArgs overload for ChipWorker.run() so chip
  child loops on variants without prepare_callable can dispatch via the
  legacy TaskArgs path (fixes a5 multi_chip_dispatch failures).
- a2a3 sim/onboard device_runner: in upload_kernel_binary, hash the
  incoming bytes and re-upload when a cached func_id entry holds a
  different binary. Stage 4 wires multiple ChipCallables onto the same
  ChipWorker (and DeviceRunner) via prepare_callable, so different
  callables register distinct kernels under overlapping func_ids; the
  prior unconditional cache hit handed the AICore the previous
  callable's kernel and segfaulted (sim) or hung the AICPU dispatch
  spin-wait (onboard) on the next run.
- a2a3 sim device_runner: initialize Worker.l2_perf_records_addr in
  the per-core init loop (matches onboard); uninitialized garbage was
  being treated as a valid pointer when the L2 swimlane bit happened
  to be set in enable_profiling_flag, causing AICore segfaults.
- a2a3 onboard host_regs: restore placeholder-address fallback for
  AicoreRegKind::Ctrl on halMemCtl failure (the dispatch path does not
  dereference these); Pmu kind continues to propagate failure so the
  caller can disable PMU collection cleanly.
- a2a3 runtime aicpu_executor: replace stray DEV_ERROR (undefined in
  this branch's logging surface) with LOG_ERROR, and drop the spurious
  leading 0 argument on a LOG_INFO_V0 call (V0 is the verbosity-0 form,
  not LOG_INFO_V).
- a2a3 l2_perf_collector.h: drop unused #include "runtime.h" so
  clang-tidy can lint the header without per-runtime include paths.
poursoul added a commit to poursoul/simpler that referenced this pull request May 8, 2026
- Apply clang-format on src/a5/platform/sim/host/pto_runtime_c_api.cpp
  and src/common/worker/chip_worker.h (pre-commit fix).
- register_prepared_callable: enforce callable_id in [0, 64) in both
  a2a3 onboard and sim DeviceRunner so an out-of-range id fails fast on
  host instead of OOB-indexing the AICPU orch_so_table_ later.
- aicpu_executor: reject negative callable_id values other than the
  legacy -1 sentinel (mirrors the upper-bound guard).
- tests/st/explicit_fatal: migrate to Stage 4 register + run(cid) API
  so the negative ST works under the unified run(cid) entry point.
poursoul added a commit to poursoul/simpler that referenced this pull request May 8, 2026
- python/bindings: add TaskArgs overload for ChipWorker.run() so chip
  child loops on variants without prepare_callable can dispatch via the
  legacy TaskArgs path (fixes a5 multi_chip_dispatch failures).
- a2a3 sim/onboard device_runner: in upload_kernel_binary, hash the
  incoming bytes and re-upload when a cached func_id entry holds a
  different binary. Stage 4 wires multiple ChipCallables onto the same
  ChipWorker (and DeviceRunner) via prepare_callable, so different
  callables register distinct kernels under overlapping func_ids; the
  prior unconditional cache hit handed the AICore the previous
  callable's kernel and segfaulted (sim) or hung the AICPU dispatch
  spin-wait (onboard) on the next run.
- a2a3 sim device_runner: initialize Worker.l2_perf_records_addr in
  the per-core init loop (matches onboard); uninitialized garbage was
  being treated as a valid pointer when the L2 swimlane bit happened
  to be set in enable_profiling_flag, causing AICore segfaults.
- a2a3 onboard host_regs: restore placeholder-address fallback for
  AicoreRegKind::Ctrl on halMemCtl failure (the dispatch path does not
  dereference these); Pmu kind continues to propagate failure so the
  caller can disable PMU collection cleanly.
- a2a3 runtime aicpu_executor: replace stray DEV_ERROR (undefined in
  this branch's logging surface) with LOG_ERROR, and drop the spurious
  leading 0 argument on a LOG_INFO_V0 call (V0 is the verbosity-0 form,
  not LOG_INFO_V).
- a2a3 l2_perf_collector.h: drop unused #include "runtime.h" so
  clang-tidy can lint the header without per-runtime include paths.
poursoul added a commit to poursoul/simpler that referenced this pull request May 8, 2026
… API

- vector_add: register chip_callable before init(), pass cid to worker.run
- child_memory: register before init(), pass cid to orch.submit_next_level
- Update vector_add README and docstring diagram to match the new flow

Resolves CI failures in st-sim-a2a3 (ubuntu/macos) on PR hw-native-sys#710.
poursoul added a commit to poursoul/simpler that referenced this pull request May 8, 2026
Function grew to 104 statements (limit 100) after the callable refactor.
The function is structured as a single dispatch loop over the bootstrap +
control-mailbox protocol — splitting it would obscure the state machine,
so add PLR0915 to the existing PLR0912 noqa.

Resolves the pre-commit CI failure on PR hw-native-sys#710.
poursoul added a commit to poursoul/simpler that referenced this pull request May 9, 2026
- Apply clang-format on src/a5/platform/sim/host/pto_runtime_c_api.cpp
  and src/common/worker/chip_worker.h (pre-commit fix).
- register_prepared_callable: enforce callable_id in [0, 64) in both
  a2a3 onboard and sim DeviceRunner so an out-of-range id fails fast on
  host instead of OOB-indexing the AICPU orch_so_table_ later.
- aicpu_executor: reject negative callable_id values other than the
  legacy -1 sentinel (mirrors the upper-bound guard).
- tests/st/explicit_fatal: migrate to Stage 4 register + run(cid) API
  so the negative ST works under the unified run(cid) entry point.
poursoul added a commit to poursoul/simpler that referenced this pull request May 9, 2026
- python/bindings: add TaskArgs overload for ChipWorker.run() so chip
  child loops on variants without prepare_callable can dispatch via the
  legacy TaskArgs path (fixes a5 multi_chip_dispatch failures).
- a2a3 sim/onboard device_runner: in upload_kernel_binary, hash the
  incoming bytes and re-upload when a cached func_id entry holds a
  different binary. Stage 4 wires multiple ChipCallables onto the same
  ChipWorker (and DeviceRunner) via prepare_callable, so different
  callables register distinct kernels under overlapping func_ids; the
  prior unconditional cache hit handed the AICore the previous
  callable's kernel and segfaulted (sim) or hung the AICPU dispatch
  spin-wait (onboard) on the next run.
- a2a3 sim device_runner: initialize Worker.l2_perf_records_addr in
  the per-core init loop (matches onboard); uninitialized garbage was
  being treated as a valid pointer when the L2 swimlane bit happened
  to be set in enable_profiling_flag, causing AICore segfaults.
- a2a3 onboard host_regs: restore placeholder-address fallback for
  AicoreRegKind::Ctrl on halMemCtl failure (the dispatch path does not
  dereference these); Pmu kind continues to propagate failure so the
  caller can disable PMU collection cleanly.
- a2a3 runtime aicpu_executor: replace stray DEV_ERROR (undefined in
  this branch's logging surface) with LOG_ERROR, and drop the spurious
  leading 0 argument on a LOG_INFO_V0 call (V0 is the verbosity-0 form,
  not LOG_INFO_V).
- a2a3 l2_perf_collector.h: drop unused #include "runtime.h" so
  clang-tidy can lint the header without per-runtime include paths.
poursoul added a commit to poursoul/simpler that referenced this pull request May 9, 2026
… API

- vector_add: register chip_callable before init(), pass cid to worker.run
- child_memory: register before init(), pass cid to orch.submit_next_level
- Update vector_add README and docstring diagram to match the new flow

Resolves CI failures in st-sim-a2a3 (ubuntu/macos) on PR hw-native-sys#710.
poursoul added a commit to poursoul/simpler that referenced this pull request May 9, 2026
Function grew to 104 statements (limit 100) after the callable refactor.
The function is structured as a single dispatch loop over the bootstrap +
control-mailbox protocol — splitting it would obscure the state machine,
so add PLR0915 to the existing PLR0912 noqa.

Resolves the pre-commit CI failure on PR hw-native-sys#710.
poursoul added 3 commits May 9, 2026 16:25
Foundation for the callable.md design: lift each-run dlclose+dlopen on
AICPU (caused by alternating callables) to a one-time-per-callable_id
load. Adds active_callable_id_/register_new_callable_id_ to the Runtime
struct and a 64-slot orch_so_table_ on the AICPU executor.

active_callable_id_ < 0 keeps the legacy single-slot path (governed by
has_new_orch_so_) untouched, so existing run_runtime() callers and all
six other variants continue to work without changes.

Verified:
  - tests/ut/py/test_chip_worker.py: 12/12 pass on a2a3sim
  - examples/.../vector_example: pass on a2a3sim
Implement Layer 3 of the per-callable_id dispatch protocol described in
docs/callable.md. Splits the legacy run_runtime path into a one-time
prepare phase (uploads orch SO + kernels, builds the per-cid metadata)
and a per-call run phase (binds cached state to a fresh Runtime, then
launches without re-uploading bytes).

- Extract prepare_callable_impl / bind_prepared_to_runtime_impl out of
  init_runtime_impl in trb runtime_maker.cpp so the c_api layer can
  drive the prepare/run split independently.
- DeviceRunner (onboard + sim) gains prepared_callables_ keyed by
  callable_id, an orch_so_dedup_ table that refcounts identical SO
  bytes by Build-ID hash, and aicpu_seen_callable_ids_ to drive
  register_new_callable_id_ on first sighting per cid.
- prepare_orch_so resolves the active callable_id when present and
  short-circuits the H2D upload to the cached buffer; legacy callers
  with cid<0 still take the original pending_orch_so path.
- New ABI exported from pto_runtime_c_api.{h,cpp} on both platforms.
  Variants without callable.md support (host_build_graph,
  aicpu_build_graph) export stubs that return -1, gated by
  RUNTIME_HAS_CALLABLE_ID defined only in the trb runtime.h, so the
  shared device_runner.cpp compiles cleanly across all six variants.
… + Python

Layer 4 of the callable.md migration: drive the per-callable_id C ABI
(introduced in fc721150) end-to-end through ChipWorker, the nanobind
surface, and the Python wrapper, plus a sticky flag in DeviceRunner
that keeps finalize's "kernel still cached" leak signal honest now that
the prepared-callable path legitimately keeps kernels resident until
finalize.

- ChipWorker (src/common/worker): dlsym the new symbols and add
  prepare_callable / run_prepared / unregister_callable methods with
  device-not-set guards. Stubs in non-trb variants surface the runtime
  rejection as a thrown error.
- nanobind: bind the three methods on _ChipWorker so the Python wrapper
  can drive them without a separate raw-pointer path.
- Python wrapper (simpler.task_interface.ChipWorker): thin pass-through
  that mirrors run()'s **kwargs config-override pattern.
- DeviceRunner.finalize: distinguish legacy-path "still-cached kernels"
  leaks from prepared-callable kernels that live until finalize by
  design. Uses a sticky prepared_callable_path_used_ flag set by
  register_prepared_callable (never cleared, so a post-unregister
  finalize still routes to DEBUG instead of ERROR).
- tests/ut/py/test_chip_worker.py: 3 new state-machine guards covering
  the new methods before set_device.
- tests/st/a2a3/tensormap_and_ringbuffer/prepared_callable: new e2e
  test that prepares two callable_ids sharing the vector_example orch
  SO, runs each via run_prepared (cid=0 twice to hit the dedup path),
  then unregisters both.

Verified:
- tests/ut/py/test_chip_worker.py: 15/15 PASSED
- prepared_callable test: PASSED on a2a3sim
- paged_attention_unroll on a2a3 hardware (--device 9): PASSED
poursoul added 24 commits May 9, 2026 16:25
Stage 2 of docs/callable.md: make the prepare/run_prepared/unregister
ABI uniform across all 5 valid runtime variants (3 a2a3 + 2 a5) so
ChipWorker dlsym is independent of which variant is loaded.

- a5/platform/{onboard,sim}/host/pto_runtime_c_api.cpp: add
  unconditional stubs (prepare_callable/run_prepared return -1 with
  LOG_ERROR; unregister_callable returns 0). a5 has no
  RUNTIME_HAS_CALLABLE_ID-aware path yet, so the stubs are the entire
  surface; full support is deferred until a5 picks up the per-cid
  orch SO dispatch.
- python/simpler/worker.py: add L2 facade methods on Worker that
  forward to the underlying ChipWorker. The ST framework's
  conftest.st_worker fixture wraps ChipWorker in Worker(level=2),
  so prepared_callable e2e tests (and any future caller going through
  Worker) need this thin pass-through. L3+ still raises
  NotImplementedError pending Stage 3 (mailbox protocol switch to cid).

a2a3/{host_build_graph,aicpu_build_graph} required no source changes:
the platform code is shared across the three a2a3 variants and was
already gated by `#ifdef RUNTIME_HAS_CALLABLE_ID`, which only
tensormap_and_ringbuffer's runtime.h defines. The non-trb variants
fall through to the existing `#else` stub branch automatically.

Verified on sim only:
- 5 variants compile clean (a2a3sim x3, a5sim x2; a5 has no
  aicpu_build_graph).
- UT test_chip_worker.py 15/15.
- a2a3sim ST sample: host_build_graph 4/4, aicpu_build_graph 3/3,
  tensormap_and_ringbuffer 4/4 (incl. prepared_callable e2e).
- a5sim ST: host_build_graph 1/1, tensormap_and_ringbuffer 10/10.
…pre-warm

Replace the NEXT_LEVEL raw ChipCallable* pointer path with a unified
callable_id (cid) protocol:

C++ core:
- Remove TaskSlotState::callable (uint64_t ptr) field; unify on callable_id
- Orchestrator::submit_next_level now takes int32_t callable_id
- dispatch_thread/dispatch_process write cid into mailbox for both
  NEXT_LEVEL and SUB worker types

Python runtime:
- Worker.register() accepts ChipCallable in addition to Python fns;
  returns cid from a single shared id space
- _chip_process_loop / _chip_process_loop_with_bootstrap: accept registry
  dict, read cid from mailbox, lazy-prepare + run_prepared
- New _CTRL_PREPARE (=4) control command for explicit pre-warming
- _start_hierarchical: after init(), pushes _CTRL_PREPARE to every chip
  child for each registered ChipCallable (fixes first-run latency spike)
- Orchestrator.submit_next_level raises TypeError on raw ChipCallable
  (migration guide: use Worker.register + pass cid)

Nanobind:
- _Orchestrator binding: submit_next_level takes int32_t callable_id
- _ChipWorker.run_prepared: add TaskArgs overload (chip child path)

Test infrastructure:
- conftest.py st_worker L3: register ChipCallable entries before init
- scene_test.py _create_standalone_worker: compile + register ChipCallable
  before init; CallableNamespace exposes cid (int) not ChipCallable
- Migrate 7 L3 examples/demos to register + cid pattern
- C++ UTs: submit_next_level(int32_t, ...) signatures

Verified: C++ UT 17/17, Python UT 70/70 (65+5), a2a3sim L3 ST 3/3,
a5sim ST 10/10, prepared_callable L2 e2e 1/1.
Promote the Stage 3 cid contract to the L2 entry point so every level of
the hierarchy speaks the same dispatch surface.

Worker (level=2):
- register() now also accepts ChipCallable; returns a cid from the
  unified id space (callable.md §3.4).  May be called either before or
  after init() — L2 has no fork/COW constraint.  Pre-init registrations
  are batched and prepared at the end of init(); post-init registrations
  prepare on the device immediately.
- run(cid, args, cfg) routes through _chip_worker.run_prepared.
- _l2_use_prepared probe: when the bound runtime variant lacks
  prepare_callable support (host_build_graph / aicpu_build_graph stub
  return -1 — see Stage 2), the first prepare attempt flips the flag and
  every subsequent run() falls back to the legacy _chip_worker.run
  lower-level binding silently.

Rollback knob:
- PTO2_DISABLE_PREPARED_CALLABLE=1 forces L2 onto the legacy lower-level
  binding (skips prepare at init, resolves cid back to its ChipCallable
  at run time).  L3+ paths are unaffected — the cid mailbox protocol has
  no legacy fallback.

scene_test.py:
- _run_and_validate_l2 now register()s the compiled ChipCallable once
  per class (cached via _st_l2_cid) and calls Worker.run(cid, …).

Verified: Python UT 80/80 (15 chip + 65 worker), a2a3sim L2
host_build_graph 4/4 (auto fallback), aicpu_build_graph 3/3, trb
spmd_sync_start (with and without PTO2_DISABLE_PREPARED_CALLABLE=1),
prepared_callable e2e 1/1.
Expose a monotonic counter of distinct callable_ids the AICPU has been
asked to dlopen for, so tests can assert per-cid registration eliminates
redundant dlopens across repeated runs (callable.md §7 verification).

- DeviceRunner (a2a3 onboard + sim): track aicpu_dlopen_total_, bumped
  on first-sighting bind; not decremented by unregister so case D
  (unregister + re-prepare) reports +2
- C ABI: get_aicpu_dlopen_count exported by all 4 a2a3/a5 variants;
  a5 + non-trb a2a3 return 0 (no per-cid registration there)
- ChipWorker / nanobind / Python wrappers: aicpu_dlopen_count property
  on _ChipWorker, ChipWorker, and Worker (L2-only; non-L2 returns 0)
- tests/st prepared_callable: 4 new test methods asserting counter
  delta for same-cid repeat (1), two-cid interleaving (2), double
  prepare (RuntimeError), and unregister + re-prepare (2). Each test
  snapshots baseline on entry and unregisters on exit so the shared
  st_worker fixture stays clean between cases.
- Apply clang-format on src/a5/platform/sim/host/pto_runtime_c_api.cpp
  and src/common/worker/chip_worker.h (pre-commit fix).
- register_prepared_callable: enforce callable_id in [0, 64) in both
  a2a3 onboard and sim DeviceRunner so an out-of-range id fails fast on
  host instead of OOB-indexing the AICPU orch_so_table_ later.
- aicpu_executor: reject negative callable_id values other than the
  legacy -1 sentinel (mirrors the upper-bound guard).
- tests/st/explicit_fatal: migrate to Stage 4 register + run(cid) API
  so the negative ST works under the unified run(cid) entry point.
Previously the upper bound was hard-coded as `64` in three independent
places (a2a3 onboard/sim DeviceRunner host bounds checks and the AICPU
executor's `orch_so_table_[]` declaration), with three different
spellings (`kMaxCallableId` vs `MAX_REGISTERED_CALLABLE_IDS`). They are
the same protocol constant — diverging would silently break the host↔
AICPU contract.

Move the constant into a new `src/common/task_interface/callable_protocol.h`
header (cstdint-only so the AICPU side can include it without dragging
in `<vector>`/`<stdexcept>` from `callable.h`) and have all three
call sites reference it.
…t lacks prepare_callable

a5/onboard's pto_runtime_c_api stubs `prepare_callable`/`run_prepared`
to -1 (Stage 1 ABI port deferred the implementation), which hard-broke
every L3+ test on a5/onboard once Stage 3 made the chip_process_loop
go through `prepare_callable` + `run_prepared` unconditionally.

Detect the stub at the very first prepare attempt: if the call raises
RuntimeError, set `prepared_unsupported` and route every subsequent
TASK_READY through the legacy `cw.run(callable_obj, args, cfg)` path
(callable_obj resolved from the COW-inherited registry by cid). This
keeps the L3+ mailbox protocol cid-only as designed while letting
variants that have not yet picked up per-cid orch SO dispatch keep
working in the meantime. Once all variants implement the prepared
path, the fallback shim and the legacy ChipWorker.run binding can go.

Mirror the same fallback in `_chip_process_loop_with_bootstrap`
(distributed/HCCL chips).
The onboard `create_orch_so_file` named the staged SO `libdevice_orch_<pid>.so`
based on the assumption that "only one runtime runs per device process,
so pid uniqueness is sufficient" (in 7e071c1 / before stage 4). Stage 4
broke that assumption: per-callable_id dispatch keeps multiple orch SO
images resident in the same AICPU process at once, one per cid in
`orch_so_table_[]`. The reload branch first creates `orch_so_table_[cid].handle`
without unlinking any pre-existing on-disk file (the unlink only fires
when *that same slot's* handle is non-null), so the second cid's
`open(..., O_TRUNC)` silently truncated and rewrote cid=0's file image.
The kernel still mapped the old inode for cid=0's dlopen'd code; the
next launch on cid=0 jumped into bytes that now belonged to cid=1 and
SIGBUS'd inside AICPU. The host saw it as
`rtStreamSynchronize (AICPU) failed: 507018`.

Repro: examples/workers/l3/ffn_tp_parallel — two cids (ffn_local +
allreduce) on a2a3/onboard. multi_chip_dispatch passed because it only
register()'d a single ChipCallable.

Fix:
- create_orch_so_file gains a callable_id parameter. Onboard variants
  embed it in the file name (`libdevice_orch_<pid>_<cid>.so`) when
  cid >= 0; the legacy single-slot path (cid == -1) keeps pid-only
  naming so variants that never adopt per-cid dispatch see no change.
- Sim variants embed cid for log readability only — mkstemps already
  guarantees uniqueness — keeping the contract symmetrical across all
  four implementations.
- aicpu_executor.cpp at both a2a3 and a5 forwards the active cid (a5
  passes -1 since it has no callable_id concept yet).

Regression test: tests/ut/cpp/common/test_orch_so_file.cpp asserts that
distinct cids produce distinct paths and the legacy sentinel preserves
pid-only naming. Compiles the a2a3 onboard implementation directly so
the ut catches the bug on no-hw runners too.
- python/bindings: add TaskArgs overload for ChipWorker.run() so chip
  child loops on variants without prepare_callable can dispatch via the
  legacy TaskArgs path (fixes a5 multi_chip_dispatch failures).
- a2a3 sim/onboard device_runner: in upload_kernel_binary, hash the
  incoming bytes and re-upload when a cached func_id entry holds a
  different binary. Stage 4 wires multiple ChipCallables onto the same
  ChipWorker (and DeviceRunner) via prepare_callable, so different
  callables register distinct kernels under overlapping func_ids; the
  prior unconditional cache hit handed the AICore the previous
  callable's kernel and segfaulted (sim) or hung the AICPU dispatch
  spin-wait (onboard) on the next run.
- a2a3 sim device_runner: initialize Worker.l2_perf_records_addr in
  the per-core init loop (matches onboard); uninitialized garbage was
  being treated as a valid pointer when the L2 swimlane bit happened
  to be set in enable_profiling_flag, causing AICore segfaults.
- a2a3 onboard host_regs: restore placeholder-address fallback for
  AicoreRegKind::Ctrl on halMemCtl failure (the dispatch path does not
  dereference these); Pmu kind continues to propagate failure so the
  caller can disable PMU collection cleanly.
- a2a3 runtime aicpu_executor: replace stray DEV_ERROR (undefined in
  this branch's logging surface) with LOG_ERROR, and drop the spurious
  leading 0 argument on a LOG_INFO_V0 call (V0 is the verbosity-0 form,
  not LOG_INFO_V).
- a2a3 l2_perf_collector.h: drop unused #include "runtime.h" so
  clang-tidy can lint the header without per-runtime include paths.
Add `active_callable_id_` and `register_new_callable_id_` fields plus
their setter/getter to the three runtime variants that lack them
(a2a3/host_build_graph, a5/tensormap_and_ringbuffer,
a5/host_build_graph). After this commit every runtime variant exposes
the same per-callable_id state shape that a2a3/tensormap_and_ringbuffer
already has — Phase 1+ wire AICPU and platform layers to read it.

Also gate a5/tensormap_and_ringbuffer with `#define
RUNTIME_HAS_CALLABLE_ID 1` so the shared a5 platform layer recognises
the protocol when compiled against this runtime; the macro is removed
once every variant implements the prepare/run_prepared path.

Behaviour is unchanged: the new fields are written but no caller reads
them yet. All four sim variants
(a2a3sim/{trb,hbg}, a5sim/{trb,hbg}) compile cleanly.
Mirrors the a2a3/tensormap_and_ringbuffer prepared_callable implementation
onto a5: AICPU executor gains a per-cid orch_so_table_, host device runner
gains register/unregister/has/bind methods + a hash-keyed orch SO buffer
dedup, and runtime_maker.cpp is split into prepare_callable_impl +
bind_prepared_to_runtime_impl with init_runtime_impl as a shim.

The a5 platform layer (onboard + sim) is shared between trb and hbg, so
callable-specific implementations are guarded by RUNTIME_HAS_CALLABLE_ID
to keep hbg compiling until Phase 2 lands its prepare/bind impls.
End-to-end coverage for prepare_callable / run_prepared / unregister_callable
on a5/tensormap_and_ringbuffer, structurally identical to the a2a3 test:
shared-orch double-cid run, same-cid repeat dlopen accounting, two-cid
interleaved dlopen accounting, double-prepare rejection, and unregister +
re-prepare counter monotonicity.

Reuses the orch_so_cache single-task orchestration and mixed_example
kernel_add_standalone so the test stays focused on the prepare/run ABI.
…cached host dlopen

- 4 hbg runtime.h (a2a3+a5): add RUNTIME_HAS_CALLABLE_ID + RUNTIME_HOST_ORCH
  defines and pending_host_dlopen_handle_/pending_host_orch_func_ptr_ host
  staging fields.
- 4 runtimes (trb+hbg): add replay_function_bin_addr(func_id, addr) — does
  not record into registered_kernel_func_ids_, lets platform replay prepared
  kernel bindings without triggering validate-time release. Unifies
  func_id_to_addr_ access via member function.
- 2 hbg runtime_maker.cpp: split init_runtime_impl into prepare_callable_impl
  (dlopen+dlsym → staging fields) and bind_prepared_to_runtime_impl (read
  fn_ptr, call orch_func, build graph). Legacy init_runtime_impl is now a
  shim (dlclose at end).
- 4 platform device_runner.{h,cpp} (a2a3/a5 × onboard/sim):
  PreparedCallableState extended with host_dlopen_handle/host_orch_func_ptr;
  new register_prepared_callable_host_orch + host_dlopen_count +
  host_dlopen_total_; unregister_prepared_callable branches on
  host_dlopen_handle (hbg → dlclose, trb → orch_so_dedup_ refcount);
  bind_prepared_callable_to_runtime uses replay_function_bin_addr; host orch
  fields restored under #ifdef RUNTIME_HOST_ORCH; prepare_orch_so early-
  returns for hbg (zeroes dev_orch_so to skip AICPU counting).
- 4 pto_runtime_c_api.cpp: prepare_callable uses std::unique_ptr<Runtime>
  (hbg Runtime holds 131072 Tasks ≈ tens of MB, too large for stack);
  routes to register_prepared_callable_host_orch under #ifdef
  RUNTIME_HOST_ORCH; exports get_host_dlopen_count.
- chip_worker.{h,cpp}: add host_dlopen_count() getter and dlsym binding.
- bindings/task_interface.cpp + python/simpler/{task_interface,worker}.py:
  expose host_dlopen_count attribute.

Verified: 4 sim binaries compile, 4 variants × 5 prepared_callable ST tests
pass (20 total), tests/ut/py/test_chip_worker.py 15 pass, a2a3/hbg
vector_example regression passes.
…ants

Mirror the trb prepared_callable ST suite to host_build_graph:

- tests/st/a2a3/host_build_graph/prepared_callable/test_prepared_callable.py
  reuses a2a3 vector_example kernel for the 5 prepared_callable scenarios
  (single-cid prepare→run, multi-cid alternation, repeated run, unregister,
  host_dlopen_count assertions).
- tests/st/a5/host_build_graph/prepared_callable/test_prepared_callable.py
  with self-contained dump_tensor-style kernels under kernels/{aiv,
  orchestration}/.

Both assert host_dlopen_count == distinct_registered_cids and
aicpu_dlopen_count == 0 (hbg path does not trigger AICPU dlopen).

Verified: 5 tests pass on each variant under sim.
…E_HOST_ORCH macros

All four runtime variants (a2a3/{trb,hbg}, a5/{trb,hbg}) now implement
prepare_callable / run_prepared / unregister_callable end-to-end, so the
build-time guards that picked between the real implementation and stubs
or between trb/hbg staging fields are no longer load-bearing.

Unify the public Runtime API across variants so the platform layer can
branch at runtime instead:

- trb runtime.h (a2a3+a5): add pending_host_dlopen_handle_ /
  pending_host_orch_func_ptr_ host-only fields (always nullptr on trb).
- hbg runtime.h (a2a3+a5): add device_orch_func_name_ /
  device_orch_config_name_ + set/get accessors (always empty on hbg).
- 4 device_runner.cpp: bind_prepared_callable_to_runtime now writes
  both host_dlopen and device_orch_func_name fields unconditionally;
  whichever set was populated by the corresponding register_*
  overload wins, the other stays at its default.
- 4 pto_runtime_c_api.cpp: prepare_callable picks the trb vs hbg path
  by inspecting r->pending_host_dlopen_handle_ at runtime instead of
  via #ifdef RUNTIME_HOST_ORCH.

Mechanical removals:

- 4 runtime.h: drop #define RUNTIME_HAS_CALLABLE_ID and (where present)
  RUNTIME_HOST_ORCH.
- 8 platform files (.h/.cpp): unwrap every #ifdef RUNTIME_HAS_CALLABLE_ID
  and RUNTIME_HOST_ORCH block, keeping the real implementation; delete
  the dlsym-stub #else branches in the c_api files (no variant needs
  them now).

Verified: 4 sim binaries compile, 4×5 prepared_callable ST tests pass
(20 total), tests/ut/py/test_chip_worker.py 15 pass.
Now that all four runtime variants implement prepare_callable /
run_prepared end-to-end, Worker no longer needs a fallback to the
legacy chip_worker.run(callable, args, cfg) lower-level binding when
the runtime returned -1 from the C ABI stub.

Worker.py removals:
- _PREPARED_CALLABLE_DISABLED_ENV / _prepared_callable_disabled() and
  the PTO2_DISABLE_PREPARED_CALLABLE env-var rollback knob.
- _l2_use_prepared field, _l2_prepare() method, and the conditional
  prepare-then-fallback dance in register() / _init_level2() / run().
- prepared_unsupported flag and _run_legacy() in both
  _chip_process_loop and _chip_process_loop_with_bootstrap. Both helpers
  now have a simpler _ensure_prepared() that always prepares-or-raises.

Worker.run(L2) and the chip_process loops now always go through
run_prepared. A registered ChipCallable that fails to prepare now
surfaces the underlying RuntimeError instead of silently rerouting.

Verified: tests/ut/py/test_chip_worker.py 15 pass,
tests/ut/py/test_worker/ 65 pass + 3 hardware skipped, hbg
prepared_callable ST 5×2 pass, a2a3/trb vector_example regression
passes.
…gacy ABI

Now that all four variants implement prepare_callable / run_prepared and
the Python fallback to the legacy callable-buffer path is gone, the
single-call C ABI it relied on is dead weight. ChipWorker::run becomes a
thin forwarder to run_prepared so the hierarchical IWorker contract is
preserved; the cid still arrives via worker_manager packing s.callable_id
into uint64.

C++ removals:
- 4 platform pto_runtime_c_api.cpp: drop run_runtime() definitions and the
  init_runtime_impl forward decls.
- 4 runtime_maker.cpp: drop the init_runtime_impl compatibility shim that
  bundled prepare_callable_impl + bind_prepared_to_runtime_impl.
- src/common/worker/pto_runtime_c_api.h: drop run_runtime declaration and
  refresh the file-header dlsym list / call-site references.
- src/common/worker/chip_worker.{h,cpp}:
  * IWorker::run(uint64_t, ...) now reinterprets the uint64 as cid and
    delegates to run_prepared.
  * Drop ChipWorker::run(const void*, const void*, ...) overload, the
    RunRuntimeFn typedef, and run_runtime_fn_ dlsym.

Python removals:
- python/bindings/task_interface.cpp: remove the four legacy nanobind
  overloads (run / run / run_raw / run_from_blob); keep run_prepared /
  prepare_callable / unregister_callable.
- python/simpler/task_interface.py: drop ChipWorker.run wrapper; usage
  doc updated to the prepare_callable + run_prepared idiom.
- tests/ut/py/test_chip_worker.py: drop test_run_before_set_device_raises
  (test_run_prepared_before_set_device_raises already covers the same
  state-machine guard).

Verified: 4 sim binaries compile, nanobind wheel rebuilds,
tests/ut/py/test_chip_worker.py 14 pass + tests/ut/py/test_worker/ 65
pass + 3 hardware skipped, 4 variants × 5 prepared_callable ST = 20 pass,
a2a3/trb vector_example + orch_so_cache regression pass.
…single slot

The single-slot orch SO cache and the callable_id==-1 fallback path
existed only to serve the now-deleted run_runtime() ABI. With every
caller routed through prepare_callable / run_prepared, callable_id is
always in [0, MAX_REGISTERED_CALLABLE_IDS) and AICPU dispatches via
orch_so_table_[callable_id] unconditionally.

Runtime structure:
- 4 runtime.h (a2a3+a5 × trb+hbg): drop has_new_orch_so_ field; simplify
  set_dev_orch_so to (dev_addr, size).
- 2 trb shared/runtime.cpp: drop has_new_orch_so() implementation; drop
  the dirty-flag init in reset.
- 4 platform device_runner.{h,cpp}: drop the third arg from every
  set_dev_orch_so call (5 sites per platform); update doc-comments that
  referenced has_new_orch_so_.

AICPU executor (2 trb aicpu_executor.cpp):
- Drop legacy single-slot fields (orch_so_handle_, orch_so_path_,
  orch_func_, orch_bind_runtime_, orch_config_func_) along with the
  destructor branch and deinit comment that preserved them.
- Replace the use_table-ternary fork with unconditional access into
  orch_so_table_[callable_id]; reload is governed by
  register_new_callable_id().
- Reject any callable_id outside [0, MAX_REGISTERED_CALLABLE_IDS) (the
  -1 escape hatch is gone).
- The run() teardown branch that called orch_bind_runtime_(nullptr) now
  reads the per-cid bind from the table.

Verified: 4 sim binaries compile, tests/ut/py/test_chip_worker.py 14
pass + tests/ut/py/test_worker/ 65 pass + 3 hardware skipped, 4 variants
× 5 prepared_callable ST = 20 pass.
… API

- vector_add: register chip_callable before init(), pass cid to worker.run
- child_memory: register before init(), pass cid to orch.submit_next_level
- Update vector_add README and docstring diagram to match the new flow

Resolves CI failures in st-sim-a2a3 (ubuntu/macos) on PR hw-native-sys#710.
Function grew to 104 statements (limit 100) after the callable refactor.
The function is structured as a single dispatch loop over the bootstrap +
control-mailbox protocol — splitting it would obscure the state machine,
so add PLR0915 to the existing PLR0912 noqa.

Resolves the pre-commit CI failure on PR hw-native-sys#710.
…n dedup

Root cause of CI a5 sim trb failures: tests/st/a5/.../prepared_callable used
the vector_example orchestration (which dispatches func_ids 0/1/2) but only
registered func_id=0. AICPU jumped to a NULL kernel address on func_id 1/2
and segfaulted, cascading through the pytest-xdist workers and dragging
spmd_*/orch_so_cache/mixed_example down with it.

Test fix: align tests/st/a5/.../prepared_callable verbatim with the a2a3
sibling — register all three vector_example AIV kernels (add/add_scalar/mul),
update the golden formula to match the orchestration's 5-task DAG.

Runtime parity (defensive — not exercised by current a5 CI but matches the
0715661 fix on a2a3 onboard so future cross-callable func_id reuse on a5
does not regress):
- src/a5/platform/onboard: add func_id_to_hash_ map, reject cached entry on
  hash mismatch, evict + re-upload on changed binary. finalize() and
  remove_kernel_binary() clear the parallel map.
- src/a5/platform/sim: compare cached CoreCallable bytes via memcmp on each
  upload (mirrors a2a3 sim — no separate hash map needed because the
  MappedKernel cache already retains the original bytes).
…tion

Stage 3 (5796321) introduced `_read_args_from_mailbox` to rebuild a
ChipStorageTaskArgs Python object from the mailbox blob in chip-child
processes (replacing the legacy raw-bytes `run_from_blob` path). The
unpacker read data/shapes/ndims/dtype but skipped the child_memory uint8
at offset 33, so every chip-child-side tensor came back with
child_memory=False (the make() default).

For tensors that carry a chip-owned device pointer — HCCL window slots
in allreduce_distributed, deferred_notify_demo, ffn_tp_parallel —
the bind_prepared_to_runtime_impl host path then treats the device
address as a host pointer, allocates a fresh device buffer, and H2D
copies from the (device) source: AICPU dispatches a task whose tensors
point at uninitialised allocations, so the task lands in ready_queue
with a kernel mask that scheduler/dispatch never advance, surfacing as
the "PTO2 timeout after 800001 idle iterations" hang we saw on a2a3
onboard.

multi_chip_dispatch passes because all of its tensors are host pointers
(child_memory=False), so the missing byte happens to round-trip
correctly. This is also why main is unaffected: there `run_from_blob`
hands the mailbox bytes straight to C++ via reinterpret_cast on the 40B
ContinuousTensor layout, which naturally preserves byte 33.

Read offset 33 explicitly and pass it through ContinuousTensor.make.
Layout matches src/common/task_interface/tensor_arg.h (40B with
child_memory at byte 33).

Verified on a2a3 onboard (devices 9,10):
- examples/workers/l3/allreduce_distributed:        PASS  (was hang)
- examples/a2a3/.../deferred_notify_demo:           PASS  (was hang)
- examples/workers/l3/multi_chip_dispatch:          PASS  (no regression)
…rgs parsing in C++

Stage 3 (5796321) made chip-child loops re-deserialise the mailbox
ChipStorageTaskArgs blob in Python via _read_args_from_mailbox before
forwarding to cw.run_prepared. The hand-written Python parser dropped
ContinuousTensor.child_memory at offset 33, which silently broke every
tensor carrying a chip-owned device pointer (HCCL window slots in
allreduce_distributed / deferred_notify_demo / ffn_tp_parallel) on
a2a3 onboard — the runtime treated the device address as a host pointer,
the submitted task stuck in ready_queue with kernel_id=-1 / state=0
forever, surfacing as 'PTO2 timeout after 800001 idle iterations'
on st-onboard-a2a3.

Root cause was duplicating the on-wire ContinuousTensor layout in
Python. Fix: keep the layout single-sourced in C++ and stop redoing
it in Python.

- Add _ChipWorker.run_prepared_from_blob(cid, ptr, capacity, config)
  nanobind overload. Internally calls read_blob (already used by every
  C++ caller) for a zero-copy TaskArgsView, then forwards to the
  existing run_prepared(view, ...) path. No new C-ABI symbol — just a
  Python-side overload over an existing C++ entry point.
- chip-child mailbox loops (_chip_process_loop and
  _chip_process_loop_with_bootstrap) drop the
  args = _read_args_from_mailbox(buf) round-trip and call
  run_prepared_from_blob with the mailbox address directly. The args
  was never inspected in Python, so the typed-object detour bought
  nothing and only added a place to lose fields.
- _read_args_from_mailbox is kept (still used by _sub_worker_loop and
  _child_worker_loop, where the destination is a Python callable) but
  its body collapses to a one-line delegation to the existing nanobind
  read_args_from_blob helper. The hand-rolled struct.unpack_from
  layout (which had to know sizeof(ContinuousTensor)==40 and per-field
  offsets) is gone.

Net effect on chip-child hot path: one Python->C++ call instead of
N+1 (per-tensor make() + add_tensor() + a final run_prepared()), no
intermediate Python TaskArgs / ContinuousTensor object construction.
And there is now exactly one place that knows the on-wire layout
(src/common/task_interface via read_blob), so adding a field to
ContinuousTensor cannot drop it on the chip-child path again.

Verified on a2a3 onboard (devices 9,10) and a2a3sim:
- examples/workers/l3/allreduce_distributed:   PASS  (was hang)
- examples/a2a3/.../deferred_notify_demo:      PASS  (was hang)
- examples/workers/l3/multi_chip_dispatch:     PASS  (no regression)
- examples/workers/l3/child_memory  [a2a3sim]: PASS
- tests/ut/py/test_chip_worker:                14/14 pass
- hbg DeviceRunner::finalize() now dlcloses any host orch handles
  callers forgot to unregister; the host process previously leaked one
  dlopen handle per re-created Worker (visible in long-running pytest).
- AICPU executor unlinks the on-disk libdevice_orch_<pid>_<cid>.so
  immediately after dlopen, so chip/sub/next-level children that exit
  via os._exit(0) no longer leave stale .so files in /tmp.
- ChipWorker docstring usage example now uses real keyword names
  (callable_id=, callable=, args=, config=) so the snippet parses as
  valid Python.
- Drop "callable.md" / "Stage N (callable.md)" pointers from comments
  and docstrings; keep the semantic content but remove references to
  the un-archived design doc, per .claude/rules/codestyle.md item 1.
Address four review findings on the callable_id refactor:

- scene_test.py: L2 _create_standalone_worker returns (worker, {}, {})
  to match the 3-tuple unpacking used by the L3 path; standalone L2
  runners no longer fail with ValueError.
- sdma_async_completion_demo: register the ChipCallable before init()
  and submit_next_level(chip_cid, ...). raw ChipCallable is rejected
  by both register-after-init guards and Orchestrator._require_cid.
- prepared_callable ST: each of the 4 test classes now owns an isolated
  L2 Worker via a directory-local conftest.py override so the cid table
  is empty on entry; cid 0/1 are renamed _CID_PRIMARY/_CID_SECONDARY
  to make the white-box intent explicit and a stale comment claiming
  unregister decrements the dlopen counter is removed.
- Docs: worker.py module docstring, docs/getting-started.md, and the
  L2/L3 example READMEs all show the full register -> cid -> run /
  submit_next_level pattern, including the must-register-before-init()
  rule for L>=3.
@poursoul poursoul changed the title Feat callable Feat: prepared callable — register + run(cid) on a unified ABI May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant