fast-interp: legacy exception handling (try/catch/rethrow/delegate/tag)#4949
fast-interp: legacy exception handling (try/catch/rethrow/delegate/tag)#4949matthargett wants to merge 16 commits into
Conversation
…terp
Enables WAMR_BUILD_EXCE_HANDLING=1 together with FAST_INTERP=1 for the
*throw-only* subset of the legacy wasm-eh proposal — modules that
declare tags and execute `throw`/`rethrow` but never define a same-
function `try`/`catch` handler. The throw escapes via the existing
`got_exception` bailout path, exactly like any other trap, and the
host sees the exception via `wasm_runtime_get_exception`.
This is the shape produced in the wild by Porffor (the JS-to-wasm
compiler used by Fastly's StarlingMonkey): its graphql-validation
benchmark we measure cross-runtime contains 561 `throw` opcodes and
zero in-wasm try/catch handlers. Every JS throw escapes to the host
JS engine, which is the typical Porffor / static-JS-to-wasm pattern.
Three changes:
* `build-scripts/unsupported_combination.cmake` — lift the
EXCE_HANDLING + FAST_INTERP ban (with a comment explaining the
scope: throw-only is supported, in-function try/catch is the
natural follow-up).
* `core/iwasm/interpreter/wasm_loader.c` — when fast-interp parses
WASM_OP_THROW, emit the tag index as a uint32 immediate after
the auto-emitted THROW opcode. Same shape as how WASM_OP_CALL
emits its funcidx.
* `core/iwasm/interpreter/wasm_interp_fast.c` — `HANDLE_OP(WASM_OP
_THROW)` now reads the uint32 immediate, surfaces a tag-bearing
exception via `wasm_set_exception`, and falls through to
`got_exception`. The other legacy-EH ops (TRY / CATCH /
CATCH_ALL / RETHROW / DELEGATE / EXT_OP_TRY) keep the existing
"unsupported opcode" diagnostic — they're unreachable for
fast-interp-compiled code today (the loader's fast-interp path
treats TRY as a plain block via skip_label and never emits
CATCH-family opcodes into the IR), so the diagnostic only fires
if a future loader change starts emitting them.
Validated end-to-end on aarch64-apple-darwin: a benchmark-core harness
loads Porffor's graphql-validation-porf.wasm, runs `m()` (the export
that drives the validation pipeline), and gets `result=0` — matching
the cross-runtime consensus from wasmtime / WasmEdge interpreter.
Before this PR the same workload failed at LOAD with "invalid section
id" (the tag section couldn't be parsed without EXCE_HANDLING=1).
Full same-function try/catch lowering — porting the classic
interpreter's `find_a_catch_handler` design to fast-interp's slot-
allocator + pre-decoded IR — is the natural follow-up.
Adds per-function `WASMFastEHEntry[]` (sized by the existing
`func->exception_handler_count` field, allocated in pass 2 of the
preprocess pass and freed in `wasm_loader_unload`) recording each
try-region's catch handler pcs in the rewritten fast-interp IR.
This is the data the upcoming runtime EH-frame stack will consult
when a `throw` walks for a matching catch handler — it is *not yet
used* in this commit.
Three pieces of plumbing on the loader side:
* `WASMFastEHCatch` / `WASMFastEHEntry` typedefs in `wasm.h`,
plus a `WASMFunction.exception_handlers` field. The struct is
gated on `WASM_ENABLE_EXCE_HANDLING && WASM_ENABLE_FAST_INTERP`
so classic-interp builds are byte-identical.
* `BranchBlock.eh_entry_idx` (loader-internal CSP slot) and
`WASMLoaderContext.cur_eh_entry_idx` (the source-order cursor).
These let CATCH / CATCH_ALL / DELEGATE / END handlers resolve
back to the right try-region without walking the CSP at
runtime — same pattern the existing fast-interp loader uses to
pre-patch BR / BR_IF / BR_TABLE targets.
* Pass-2-only populate logic on the existing CATCH, CATCH_ALL,
DELEGATE, and END cases. The pass-1 increment of
`exception_handler_count` is now gated on
`loader_ctx->p_code_compiled == NULL` so it doesn't double-
count when the loader re_scans for the second traverse.
Runtime behavior is unchanged in this commit: CATCH / CATCH_ALL /
RETHROW / DELEGATE still hit the "unsupported opcode" stub from
the throw-only patch. The dispatch wiring lands in the next
commit; this one establishes the data layout reviewers will
sanity-check first.
Cost-model note: no changes to any hot-op handler (CALL, LOAD,
STORE) and the new struct fields are entirely behind the existing
WASM_ENABLE_EXCE_HANDLING guard, matching classic-interp's posture
where EH-on builds carry one byte store per PUSH_CSP and a small
per-frame allocation but leave hot ops untaxed.
Wires up the per-frame eh-stack that commit 1 laid the metadata for.
A program can now enter and exit a try-region without aborting; same-
function throw → catch dispatch still bails out via got_exception
(follow-up commit hooks that up).
Frame layout: one extra cell per try-region appended past the value
stack in the existing frame->operand[] allocation, sized by
cur_wasm_func->exception_handler_count. Functions without try blocks
pay zero cells. WASMInterpFrame gains a `uint32 eh_count` (the eh-
stack top), clustered next to the existing EH-gated
exception_raised/tag_index fields — same cache line, cold path only.
Hot-op invariants preserved:
* No new instructions in HANDLE_OP(WASM_OP_CALL),
HANDLE_OP(WASM_OP_*_LOAD_*), HANDLE_OP(WASM_OP_*_STORE_*).
* Dispatch table size is unchanged (slots 0x06 = WASM_OP_TRY, 0x07 =
WASM_OP_CATCH, 0x0b = WASM_OP_END, 0x19 = WASM_OP_CATCH_ALL just
get new bodies — they previously fell through to the
"unsupported opcode" stub).
* eh_count writes/reads only happen on TRY/CATCH/CATCH_ALL/END,
none of which are on the dispatch loop's hot path.
Loader changes (wasm_loader.c):
* WASM_OP_TRY no longer skip_labels; emits its `eh_idx:u32`
immediate after the auto-emitted opcode byte so the runtime push
handler can find the right exception_handlers[] entry.
* WASM_OP_CATCH / CATCH_ALL emit the same `eh_idx:u32` immediate;
the runtime handler reads it to find end_of_region_pc to branch
to on normal-flow exit.
* WASM_OP_END for try-regions keeps the END byte in the IR (with
the patch-list rewind dance to make `br N`-targeted PATCH_END
addresses land *on* the END byte so the pop runs for branches
too, not just fall-through).
Runtime handlers (wasm_interp_fast.c):
* HANDLE_OP(WASM_OP_TRY) pushes eh_idx onto frame_lp[eh_offset +
eh_count] and increments eh_count.
* HANDLE_OP(WASM_OP_CATCH) and HANDLE_OP(WASM_OP_CATCH_ALL) share a
body: decrement eh_count, set frame_ip to
func->exception_handlers[eh_idx].end_of_region_pc.
* HANDLE_OP(WASM_OP_END) moves out of the "unsupported opcode"
block when EXCE_HANDLING is enabled; decrements eh_count.
* WASM_OP_RETHROW / WASM_OP_DELEGATE / EXT_OP_TRY still route to
the diagnostic — wired up in a follow-up commit.
After this commit: programs with try-regions where no throw fires
inside the try body run correctly (the eh-stack is correctly
maintained through entry/exit). Throws inside try bodies still
escape via got_exception, matching the throw-only patch's behavior.
porf-accurate still errors at the first throw escape (its catch
handler does real work; full catch dispatch is the next commit).
Activates same-function and inter-function catch dispatch for the
*void-result* try-region shape (which is what graphql-validation-
porf-accurate emits — `06 40` = try-with-blocktype-void). Programs
that throw inside a void try body now land in the matching catch
handler (or catch_all) instead of escaping to the host trap path.
The eh-stack push/pop infrastructure from the prior commit gives us
the in-scope handlers; this commit adds the walk and the cross-frame
unwind.
Hot-op cost-model check:
* HANDLE_OP(WASM_OP_THROW) is itself a cold op — programs that
never throw never enter it. The walk runs in find_a_catch_
handler, also cold.
* The one new check on a path every wasm-to-wasm call return
visits is the `if (frame->exception_raised)` branch in
return_func. Predicted strongly not-taken (exceptions are
rare); two AArch64 instructions; identical in shape to
classic-interp's existing check at wasm_interp_classic.c:6877.
* The eh-stack cells share the cache line with the value stack
they're allocated next to, so the walk hits warm memory.
* CALL / LOAD / STORE handlers are byte-identical to the no-EH
path.
Mechanism:
* `find_a_catch_handler` is a labeled block reached either by
WASM_OP_THROW or by return_func when a callee stashed a tag
on this frame. It walks frame->eh_count entries top-down,
skipping entries whose top bit is set (state CATCH — already
in an active handler; throws raised inside skip outward).
On a tag match it ORs in EH_TRY_CATCH_STATE_BIT and dispatches
frame_ip to entry->catches[j].handler_pc (or
entry->catch_all_pc when no typed clause matches).
* On exhaustion, the walker stashes exception_tag_index on
prev_frame->tag_index, sets prev_frame->exception_raised = true,
and goes to return_func. return_func, after RECOVER_CONTEXT
has restored the caller's context, re-enters
find_a_catch_handler with the caller's frame in scope.
* At the top of the wasm stack (prev_frame->ip == NULL) the
walker takes the existing got_exception escape so the host
can read the trap message via wasm_runtime_get_exception.
* frame->exception_raised and frame->tag_index are pre-existing
fields originally added for classic-interp. exception_raised
must now be cleared on every fast-interp frame setup — ALLOC_
FRAME doesn't zero-init the header and a stale non-zero byte
trips the return_func check on every call return.
Loader-side bug fix: the CATCH and CATCH_ALL emit_uint32(eh_idx)
calls used to live inside the `if (loader_ctx->p_code_compiled !=
NULL)` populate guard. That gating skipped them in pass 1 but ran
them in pass 2, so pass 2 wrote 4 bytes per catch *past* the
code_compiled buffer allocated based on pass 1's measurement. The
overrun corrupted whatever loader allocation the heap placed
immediately after — typically func->exception_handlers itself (the
first 4 bytes of entry[0], i.e. catch_count, was the usual victim).
Surfaced as "wasm exception thrown (tag 0)" on `test_local_throw`
where the typed-catch's catches[] array showed count=0 at runtime
even though the loader populated count=1 in pass 2 — the populate
itself wrote correctly, then a later opcode's reserve_block_ret
overran the buffer and zeroed catch_count. Moved both emit_uint32
calls outside the populate guard so both passes account for the
4-byte immediate.
State encoding: each eh-stack cell packs the loader's
exception_handlers[] index in the low 31 bits and a state bit
(EH_TRY_CATCH_STATE_BIT) in the top bit. No cell-count change vs
the prior commit; same per-frame allocation footprint.
Known limitation: try-regions with a non-void result-type are not
yet supported by the *normal-flow* path. The fix is a loader-side
try-body→block-dynamic-offset COPY emit at CATCH processing time
(mirrors how WASM_OP_ELSE aligns the if-body's result via
reserve_block_ret). See AGENTS.md's "Open follow-up — WAMR fast-
interp legacy exception handling" section. graphql-validation-porf-
accurate uses void-result try-blocks so it isn't blocked by this.
Verified by `crates/benchmark-core/src/bin/probe_eh_void.rs` (5
cases — typed catch, catch_all, inter-function unwind, nested,
no-throw — all PASS) and the existing run_graphql_validation_wamr
regression (AS / porf-fast / porf-accurate within run-to-run
variance vs the prior commit).
Activates the RETHROW opcode: re-raise the exception currently being
handled by the (depth+1)-th `state=CATCH` entry from the top of the
per-frame eh-stack. Source form `rethrow N` becomes `RETHROW <N:u32>`
in the rewritten IR; the runtime walker scans the eh-stack top-down,
skips state=TRY entries (they're not "catch handlers in progress"),
and on the (depth+1)-th state=CATCH match reads its stashed caught
tag and dispatches to `find_a_catch_handler` exactly as a fresh
throw with that tag would.
Storage shape: each eh-stack entry is now `EH_ENTRY_CELLS = 2` cells
wide. Cell 0 packs `eh_idx | EH_TRY_CATCH_STATE_BIT` (unchanged); cell
1 holds the wasm tag index of the exception currently being handled
on that entry (undefined while the entry is in TRY state — the throw
walker writes it on catch dispatch). Frame allocation grows by
`exception_handler_count * 2` cells per call; functions without try
blocks still pay zero cells.
Hot-op cost-model check:
* No new code in HANDLE_OP(WASM_OP_CALL) / LOAD_* / STORE_*.
* RETHROW is a cold op (only fires inside catch bodies); the walk
runs across at most the number of catches nested around the
rethrow site.
* TRY's push gained a no-op write (cell 1 stays undefined until
the throw walker overwrites it on dispatch) — same one indexed
store as before, just with a wider stride.
* `frame->exception_raised` init + the return_func hook are
unchanged from the prior commit; no new branches on any
return path.
Loader-side land-mine cleared: WAMR's shared `check_branch_block`
calls `emit_br_info` unconditionally, which for a typical
arity-zero catch target writes 4 bytes (arity) + 8 bytes (target
ptr placeholder via `add_label_patch_to_list`) into the IR between
the auto-emitted opcode label and the next op. RETHROW doesn't
*branch* to its target — it walks the eh-stack — so those br-info
bytes are dead weight, and worse: they shift our depth immediate
past where the runtime `read_uint32(frame_ip)` looks for it. The
RETHROW case in the loader now does its own depth + label-type
validation (manual `loader_ctx->frame_csp - depth - 1` lookup,
LABEL_TYPE_CATCH/CATCH_ALL check) and skips check_branch_block
entirely.
Verified by three new cases in
`crates/benchmark-core/tests/eh_correctness.rs`:
- `rethrow_depth_zero`: inner catch sets a flag, `rethrow 0`,
outer catch sees the same tag (= 11).
- `rethrow_preserves_tag`: two tags ($a, $b); throw $b → inner
catch $b → rethrow 0; outer catch $b wins over outer catch $a
(= 11).
- `rethrow_depth_one`: nested catches; from inside the
innermost (which caught $b), `rethrow 1` re-raises the
*outer* catch's tag ($a). All 23 cases in the EH correctness
suite pass; AS / porf-fast / porf-accurate benchmark medians
overlap the prior commit's range within run-to-run variance
(three runs each).
Wires up the runtime + loader for `try ... delegate N` so the throw
walker can re-raise the exception at the target block's location
without spending hot-op budget.
Loader (wasm_loader.c, WASM_OP_DELEGATE case):
Skip the shared `check_branch_block_for_delegate` — its
`emit_br_info` call would write 12 bytes of branch metadata
between the auto-emitted DELEGATE label and the next op, dead
weight at runtime and (worse) the same alignment-shift gotcha
that bit RETHROW. Do the depth read + bounds check inline.
In pass 2, count try/catch/catch_all blocks STRICTLY between the
delegate's frame and the target block — that count (`delta`) is
exactly how many eh-stack entries the runtime walker must skip
past, by spec.
Runtime (wasm_interp_fast.c):
* find_a_catch_handler: before catch-matching, check
`entry->delegate_target_depth`. If set, mark the delegate's
own eh-stack entry consumed (STATE bit) and do `i -= delta;
continue;` so the for-loop's natural i-- lands on the first
eh-stack entry strictly outside the target block. The
`delta + 1 >= i` guard catches "delegate to function block"
(target lies outside this function's eh-stack) and falls
through to the existing "no handler in this frame"
return_func path.
* WASM_OP_DELEGATE: split out of the "unsupported opcode" stub
into its own normal-flow handler — fires when the try body
completes without throwing; just `frame->eh_count--` and
advance.
Cost shape preserved: zero new bytes in CALL / LOAD / STORE; all
delegate work lives on the cold throw walker or the cold normal-
flow exit handler.
Wires up the loader + runtime path so a tagged exception with i32 /
i64 / v128 parameters delivers its payload to the matching catch
body's operand stack — same-function dispatch only. Cross-function
dispatch (callee throws, caller catches) still drops the payload;
that gap is now surfaced explicitly via the
`cross_function_tag_with_params` integration test (#[ignore]'d
with the same justification recorded in AGENTS.md).
WASMFastEHCatch grows two fields:
uint32 param_cell_num;
int16 *param_dst_offsets;
The dst-slots array is a loader-owned int16[] of length
`param_cell_num`, capturing the cell-wise frame_lp slot offsets
that the catch body's downstream ops will pop from. NULL for the
common tag-without-params case (Porffor's empty-payload tags, all
of the spec-test's `tag $err` declarations) — no heap allocation
and the runtime walker's copy loop is a trivial zero-iteration
no-op.
Loader (wasm_loader.c) — CATCH case:
* Swap `PUSH_TYPE` for `PUSH_OFFSET_TYPE` so the catch body's
incoming params get fresh `dynamic_offset` slots allocated +
emitted as int16 operands in the IR (right after the eh_idx
immediate). The PUSH_OFFSET_TYPE emits are dead bytes on the
normal-flow CATCH dispatch (which only reads eh_idx and
branches to end_of_region_pc), but they're necessary so the
catch body's POP_OFFSET_TYPEs find the right slot offsets in
frame_offset[].
* Pass 2 captures handler_pc AFTER the PUSH_OFFSET_TYPEs so the
throw walker's `frame_ip = handler_pc` lands at the first byte
of the catch body proper (skipping the dead dst-slot bytes).
* Pass 2 also bh_memcpy_s's frame_offset[]'s top
`param_cell_num` cells into a fresh int16[] on the catch's
WASMFastEHCatch — these are the destination offsets the
runtime walker will write payload values to.
* Free path in wasm_loader_unload extended to free the
per-catch dst-offsets array.
Loader — THROW case (wasm_loader.c):
* Moved the existing `emit_uint32(tag_index)` below the
tag-type lookup + validation so `tag_type->param_cell_num` is
available.
* After tag_index, emit `<param_cell_num:u32>` plus
`<src_offset_i:int16>` for i in 0..param_cell_num. The src
offsets are read directly off the top of `loader_ctx->
frame_offset[]` — the validation loop above pops frame_ref
but doesn't touch frame_offset, so they're stable. Both
traverses run the same emit to keep pass-1 / pass-2 size
accounting balanced.
Runtime (wasm_interp_fast.c) — new locals in the dispatch
function (cold-path only, same scope as `exception_tag_index`):
uint32 throw_param_cell_num = 0;
int16 *throw_src_offsets = NULL;
These get populated by HANDLE_OP(WASM_OP_THROW), which now reads
tag_index + param_cell_num + the src-offsets array off the IR
(advancing frame_ip past all three). The pair is consumed by
find_a_catch_handler's catch-match dispatch: on a typed-catch
match it does the cell-wise copy `frame_lp[dst[c]] =
frame_lp[src[c]]`. catch_all dispatch explicitly drops the
payload (per spec — catch_all binds no exception values). The
copy loop is fully cold (only THROW reaches here); CALL / LOAD
/ STORE handlers untouched.
WASM_OP_RETHROW: extended to re-point throw_src_offsets at the
matched catch's `param_dst_offsets` before goto find_a_catch_
handler — so rethrow from inside a typed catch carries the same
payload outward. The catch body can't mutate the dst slots
(they're allocated from `dynamic_offset`, separate from the
local-slot range that local.set writes to), so the values are
still the original ones at rethrow time. Rethrow from inside a
catch_all (whose `param_dst_offsets == NULL`) falls back to
zero-cell — documented as a known limitation.
return_func hook: the cross-frame branch zeros throw_param_cell_
num and throw_src_offsets before the goto find_a_catch_handler,
since the callee's source slots live in a frame that's about to
be torn down — same payload-dropping semantics as the existing
cross-function-no-payload case, but explicit instead of
relying on uninitialized stack.
Cost shape preserved: zero new bytes in CALL / LOAD / STORE.
EH_ENTRY_CELLS still 2; no extra cells per try-region. The two
new locals get spilled by the compiler since the hot loop
doesn't reference them.
Two bugs surfaced once same-function tag-with-params actually got
exercised by integration tests:
1. **`PUSH_OFFSET_TYPE` is offset-only.** The CATCH loader was
bumping `dynamic_offset` + `frame_offset[]` but never
`stack_cell_num`, leaving the operand and ref stacks out of
sync. The catch body's first consumer (e.g. `global.set $g`)
then hit `wasm_loader_pop_frame_offset`'s polymorphic
short-circuit — the CATCH block inherits the polymorphic flag
from THROW's `SET_CUR_BLOCK_STACK_POLYMORPHIC_STATE` and with
`available_stack_cell == 0` the pop silently returned without
emitting the source-slot operand bytes. The consumer's
runtime read then landed on heap garbage and crashed with
SIGBUS / SIGSEGV. Fix: pair `PUSH_OFFSET_TYPE` with `PUSH_TYPE`
(ref-only) so both stacks advance in lockstep.
2. **Multi-cell `frame_offset[]` entries are unreliable past
the first cell.** `wasm_loader_push_frame_offset` writes a
meaningful int16 only for the FIRST cell of a multi-cell
value (i64, f64, v128); the subsequent cell entries are left
uninitialized (just a pointer increment, no write). My pass-1
THROW src-offset emit and pass-2 CATCH dst-offset capture
were reading those uninitialized cells directly, producing
garbage offsets for any param wider than 32 bits.
Fix: walk params (not cells) and synthesize consecutive cell
offsets `(first, first+1, ..., first+N-1)` per param, where
`first = frame_offset[cell_so_far]`. Matches the runtime
invariant that an N-cell value occupies N consecutive
frame_lp cells.
3 new integration tests cover the fixes:
* `tag_single_i64_param` — 2-cell payload
* `tag_mixed_i32_i64_params` — exercises per-param cell
synthesis (would fail if cell-walk offset by 1)
* `repeated_throw_with_payload` — confirms catch-allocated
dst slots get fresh writes every invocation
Plus a wat fix in `nested_try_with_params_inner_wins`: the
outer catch's body was `i32.const 999 / global.set $g`, leaving
the param on the operand stack at `end`. That was a latent bug
masked before tag-with-params support (PUSH_TYPE-only didn't
let the param "exist" for validation purposes). Now corrected
by adding an explicit `drop` so the catch body's stack
validates clean.
No hot-op cost change: all the new loader work is in the cold
CATCH / THROW preprocess paths, and the runtime walker copy
loop is unchanged.
`try (result T)` regions now route the try body's normal-flow
value into the block's `dynamic_offset` slot the same way `else`
routes the if-body's value via `reserve_block_ret`. The throw-
dispatch path's catch-body END already handled the catch's COPY
via the existing reserve_block_ret call; this patch fills the
remaining gap by injecting a COPY before each CATCH/CATCH_ALL
label so the normal-flow exit (try body completes without
throwing → falls through to CATCH → CATCH runtime handler jumps
to end_of_region_pc) also deposits the value at the right slot.
Loader (wasm_loader.c):
* WASM_OP_CATCH and WASM_OP_CATCH_ALL: before the existing
emit_uint32(eh_idx) emit, call `check_block_stack` on the
previous body (the try body on the first CATCH; the prior
catch body on subsequent ones) and emit an
EXT_OP_COPY_STACK_TOP / _I64 / _V128 if the body's last cell
isn't already at `cur_block->dynamic_offset`. The
`src != dst` predicate runs in both passes; the sign-stable
nature of dynamic_offset (≥ 0) vs const-pool slots (≤ -1)
keeps pass-1 size accounting and pass-2 writes aligned even
though const-pool slots get renumbered by the qsort/dedup at
the start of pass 2.
* Both cases now also `SET_CUR_BLOCK_STACK_POLYMORPHIC_STATE
(false)` after `RESET_STACK()`, matching how `WASM_OP_ELSE`
resets the if-body's polymorphic flag. Without this reset, a
catch body following a throw inherits the polymorphic state
and `check_block_stack` at END takes the polymorphic branch
(`POP_OFFSET_TYPE` → 2 bytes per return-cell emitted). Those
bytes land between the auto-emitted END label and the EH-END
branch's `skip_label()`, shifting the re-emitted END label
forward and leaving a corrupt handler-ptr at the recorded
`handler_pc` — SIGSEGV on the first dispatch.
Multi-return-value try-regions get an explicit "not yet
supported" error; they need `EXT_OP_COPY_STACK_VALUES` emit
support that's not in this commit. Single-return-value covers
every shape Porffor / AS / our 51-case integration suite emits.
6 new result-typed integration tests (single i32 / i64, with
and without throw, multi-catch picked by tag, catch_all
fallback, mixed-with-locals slot allocation). Plus a wat fix in
`multiple_catches_with_params_pick_by_tag`: the `catch $a` body
left its param on the operand stack before the catch-to-catch
transition. The previous loader didn't validate catch
transitions, so this latent imbalance was silently accepted;
now `check_block_stack` runs at every CATCH, catches the
unbalanced stack, and reports the spec-required `type mismatch:
block requires [] but stack has [i32]`. Added an explicit
`drop` in the catch body so the test's wat validates clean.
Verified end-to-end: 51/51 EH integration tests pass (was 45/45
before; +6 new result-typed cases). porf-accurate runs at 15.6
ms median (no regression vs the 17.3 ms baseline; small
improvement plausibly from the polymorphic-reset path no longer
emitting redundant POP_OFFSET_TYPE operands).
Adds a load-time warning when a br / br_if / br_table opcode
crosses one or more LABEL_TYPE_TRY / _CATCH / _CATCH_ALL
frames, because the runtime br doesn't pop the eh-stack — each
crossed try-region leaks one eh-stack entry that survives until
frame teardown.
The simple case (single br out of a try; e.g. the
`br_out_of_try_pops_eh_stack` integration test) is benign: the
per-frame eh-stack reservation
(`exception_handler_count * EH_ENTRY_CELLS` cells, covering
every static try-block in the function) leaves room for one
stale entry alongside any subsequent sibling try's push, and the
top-down walker iterates from `eh_count` down so sibling-try
throws still match the most recent push first. The stale entry
dies when the frame is freed at function return.
The pathological case — `loop { try { br_to_loop_top } catch }`
— leaks one entry per iteration and eventually overflows the
static reservation. `bh_assert(eh_count < exception_handler_
count)` would catch this, but `bh_assert` is a no-op in release
builds (`BH_DEBUG` is unset there), so the out-of-bounds writes
go through silently. The warning surfaces the shape in
load-time diagnostics so a real embedder sees it before the
hard-to-diagnose runtime corruption.
`count_try_blocks_crossed(cur_block, target_block)` walks csp
positions from cur_block down to target_block inclusive (target
included because br to a non-LOOP target lands AFTER target's
end, skipping it; LOOP targets aren't try-typed so the inclusive
vs exclusive distinction doesn't change the count). The check
fires only in pass 1 (`loader_ctx->p_code_compiled == NULL`) so
each br site logs once even though wasm_loader_prepare_bytecode
runs the bytecode twice. No hot-op cost — this is loader-time
only.
Verified: porf-accurate doesn't trigger the warning (no
br-across-try patterns in the Porffor emit shape, consistent
with the PMU profile showing zero hot-op overhead from EH).
`br_out_of_try_pops_eh_stack` integration test triggers the
warning once and still passes.
… checks
Marks the four structurally-cold paths in WASM_OP_CALL_INDIRECT —
out-of-bounds table index, uninitialized element, unknown function
(post-table lookup), indirect-call type mismatch — with
`__builtin_expect(cond, 0)`. Well-formed wasm modules pass all four
on every dispatched CALL_INDIRECT; the hint lets the compiler:
(a) provide a static-bias fallback for the branch predictor on
unseen call sites (first-iteration impact only — Apple
Silicon's predictor learns the bias dynamically after a few
hits anyway);
(b) lay out the error-handling tail away from the hot path so
each pass-through case stays in straight-line I-cache.
Measured on iPhone 12 (A14, Icestorm E-cores) with the
graphql-validation workloads — bucket-share deltas are within
run-to-run noise on both Porffor and AS, but the Porffor
bottleneck is `Processing` (56.78%, backend / load-store
saturation) not branch prediction (4.19% Discarded). AS's E-core
shows the structural opportunity (27.22% Discarded) but that's
the goto-indirect-branch in FETCH_OPCODE_AND_DISPATCH, not the
direct branches inside CALL_INDIRECT.
Kept as documentation-as-code: the cold-path semantic is real
(spec-required traps that ~never fire on validated modules), and
the compiler-time cost is zero. Full PMU writeup in
out/eh-pmu-iphone12-2026-05-18.md (gitignored).
No correctness change. No hot-op runtime cost. Doesn't affect EH
code paths.
The legacy exception-handling spec test suite was previously hardcoded
to skip every running mode except classic-interp:
if [[ "${RUNNING_MODE}" != "classic-interp" ]]; then
echo "support exception handling in classic-interp"
return 0
fi
Now that fast-interp supports the full legacy-EH proposal (TRY / CATCH /
CATCH_ALL / RETHROW / DELEGATE / tag-with-params), the gate should
allow both modes. This matches the parallel `ENABLE_GC` block a few
lines down that already lists `classic-interp` AND `fast-interp` as
acceptable.
After this change, `./test_wamr.sh -t fast-interp -m exception-handling`
runs the upstream WebAssembly spec EH suite against the fast
interpreter — the same suite already validated against classic
interp.
When a throw from a nested try is caught by an OUTER handler, the
walker previously left the inner-try entries between the throw site
and the matched outer entry on the eh-stack. The matched entry got
its `EH_TRY_CATCH_STATE_BIT` set, but `frame->eh_count` stayed
unchanged. After the outer catch body's END decremented eh_count by
one, the inner-try slot remained at the top of the eh-stack with
the matched outer entry now sitting *under* it (in-progress bit
set).
A subsequent throw inside (or after) the outer catch body would
walk that stale state. The walker SKIPs entries with the state bit
set, so the outer entry was correctly ignored — but the inner-try
entry (no state bit) was treated as live. If the inner try's typed
catch happened to match the new tag, the walker dispatched against
that stale entry — an out-of-scope catch.
Worse, in a tight loop of `outer try { inner try { throw }
catch_other catch_outer { ... } }`, every iteration leaked one
inner-try entry. After more iterations than the function's
`exception_handler_count`, the next TRY push wrote past the static
eh-stack reservation (silently in release builds since `bh_assert`
is a no-op without `BH_DEBUG`).
Fix: at each match-and-dispatch site in `find_a_catch_handler` —
both the typed-catch branch and the catch_all branch — set
`frame->eh_count = i;` before jumping to the handler. `i` is the
loop counter, which equals the index of the matched entry plus
one. This pops the nested-try entries above the match in a single
indexed store. The matched entry stays at index i-1 with its state
bit set; the catch body's END pops it normally when the body
completes.
Cost shape: one extra indexed store on the cold throw path, only
when a typed catch or catch_all matches. CALL / LOAD / STORE
handlers are untouched.
Test added in the external integration suite at
`crates/benchmark-core/tests/eh_correctness.rs::
outer_catch_unwinds_inner_eh_entries`. The test pattern is: outer
try catches `$err`; inner try has a catch for `$err2`. Inner throw
of `$err` is caught by outer. Outer catch body re-throws `$err2`,
which must propagate UNCAUGHT (inner try is out of scope). Pre-fix
walker found the stale inner catch and dispatched to it,
producing a Ok(99) instead of the trap; post-fix the walker has
no in-scope entries and the throw escapes correctly.
Codex P1 review feedback on rebeckerspecialties/wasm-micro-
runtime PR #2: "Unwind skipped EH entries before dispatching
catches".
The walker's "no handler in this frame" path previously set `prev_frame->exception_raised = true` and let `return_func` forward the throw to the caller, regardless of payload size. This silently lost the payload: the source cells (`throw_src_offsets`) live in *this* frame's `frame_lp`, which return_func is about to tear down. The caller's `find_a_catch_handler` then ran with `throw_param_cell_num = 0`, which made any typed catch in the caller bind uninitialized destination slots — the catch body would either see garbage in its payload locals or, if the typed catch's slots were used as struct-of-pointers, dereference freed memory. Cross-function payload preservation would require a per-thread scratch buffer to ferry the payload across the frame boundary (callee's frame_lp → buffer → caller's frame_lp), plus a small change to return_func to populate it before tearing down the callee. That's a meaningful design lift and out of scope for this commit. Safe action for now: when a payload-bearing throw escapes its callee (i.e. `throw_param_cell_num > 0` and we're about to return to a caller frame), trap to the host with the diagnostic `"cross-function exception payload not supported by fast- interp"`. Same-function payload routing (the common Porffor / AS shape, where a JS throw is caught by an in-function catch the JS-to-wasm compiler emitted) is unaffected — that path dispatches via the same-function match in the walker before this branch runs. A `catch_all` in the caller would technically tolerate a zero-payload bind, but the typed-vs-catch_all choice happens in the caller's walker, which we can't peek into here without coupling the frames. Trap unconditionally for payload-bearing cross-frame throws. Tests: * `cross_function_tag_with_params` stays `#[ignore]` — that's the eventual-success-case for when cross-frame payload routing is implemented. * `cross_function_tag_with_params_traps` (new) asserts the current trap-with-expected-message contract on the same module shape. Codex P1 review feedback on rebeckerspecialties/wasm-benchmark PR #3 (patch 0007 line 306): "Preserve cross-frame exception payloads".
…egion
When a br skips over a try-region's END, the runtime br doesn't pop
eh-stack entries. For a one-shot br to a block / function-end /
catch, the leaked entry is absorbed by the static
`exception_handler_count * EH_ENTRY_CELLS` reservation and dies at
frame teardown — a load-time `LOG_WARNING` surfaces the shape for
embedders.
If the br target is a LOOP entry, however, every iteration's TRY
push adds one more entry to the eh-stack. After more iterations
than the function's `exception_handler_count`, the next TRY push
writes past the static reservation. `bh_assert(eh_count < count)`
catches this in debug builds, but is a no-op without `BH_DEBUG` —
release builds silently corrupt whatever sat past the reservation
in the frame allocation.
This commit changes that pathological shape from "log a warning
and accept" to "fail load with an explicit error". The check sits
next to the existing `count_try_blocks_crossed > 0` warning at all
three branch sites (BR, BR_IF, BR_TABLE) and only fires when
`frame_csp_tmp->label_type == LABEL_TYPE_LOOP`. The error message
is identical at each site modulo opcode name:
"br[_if|_table] to loop entry from inside try-region not
supported in fast interpreter (would leak eh-stack entries
per iteration)"
Emitting a synthetic eh-stack pop at the br site would be the
other fix and would let valid modules with this shape run, but it
complicates the rewritten IR's br-info layout (the br dispatch
currently emits a single uint32 depth; a pop-count immediate
would need a per-target lookup) and the shape is rare in
practice. Rejecting at load is the conservative, App-Store-safe
choice — embedders see a deterministic error rather than silent
memory corruption.
Test added in the external integration suite: the previously-
ignored `br_out_of_try_inside_loop` became
`br_out_of_try_inside_loop_rejected`, which asserts the loader
fails with the expected error string.
Codex P1 review feedback on both PRs ("Reject branches that leak
EH entries" / "Reject branches that leak EH stack entries").
Windows MSVC build of upstream PR bytecodealliance#4949 failed with `LNK2019: unresolved external symbol __builtin_expect` because `__builtin_expect` is a GCC/Clang builtin and MSVC has nothing equivalent. The branch-predictor hints are an optimization, not correctness, so the simplest portable fix is a no-op fallback gated on `!defined(__GNUC__) && !defined(__clang__)`. Lives at the top of `wasm_interp_fast.c` rather than in `bh_platform.h` to avoid touching the shared header for a local cold-path concern.
Upstream PR bytecodealliance/wasm-micro-runtime#4949 failed every `build_iwasm` matrix entry on Windows MSVC with `LNK2019: unresolved external symbol __builtin_expect referenced in function wasm_interp_call_func_bytecode`. The cold-path hints we added in patch 0011 use the GCC/Clang `__builtin_expect` intrinsic; MSVC has no equivalent. Drop-in no-op shim gated on `!defined(__GNUC__) && !defined(__clang__)`. The hints are branch-predictor optimization, not correctness, so dropping them on MSVC is fine. Same change is on the upstream PR branch as commit `0411662d` (separate fixup commit; lands in the PR sequence right after patch 0011's equivalent). Stack-position rationale: patch 0024 (after linmem 0023) inserts 9 lines near the top of `wasm_interp_fast.c` between the SIMDe include guards and `typedef int32 CellType_I32`. Putting it last in the apply-stack avoids shifting line-number anchors for any of the earlier patches.
|
Update: pushed Of the remaining single CI failure ( |
Lifts the cmake guard at
build-scripts/unsupported_combination.cmake:67that forbidsWAMR_BUILD_EXCE_HANDLING=1 WAMR_BUILD_FAST_INTERP=1and adds the matching dispatchloop coverage in
wasm_interp_fast.c— loader-side EH metadata table, runtimeEH-frame stack, catch-walk for
throw/rethrow,delegateforward-to-outer,tag-with-params payload routing, and result-typed
try-region COPY-at-CATCHalignment. I tried to keep each step bisectable.
Why we built this: we're replacing WasmEdge with WAMR fast-interp as the wasm
runtime in a pure-interpreter App-Store-eligible app, and a migration
blocker is
graphql-validationcompiled by Porffor — JS-to-wasm output that lowerstry/catch/throwto the wasm-exceptions section. Without EH enabled, fast-interprejects the binary at load with
invalid section id; with EXCE_HANDLING +CLASSIC_INTERP it loads but fast-interp is 1.3–1.8× faster on every benchmark we
ran.
Cross-microarch benchmarks: M4 Lion P / M4 Sawtooth E / A14 Icestorm (iPhone 12) / A12 Tempest (iPhone XS) /
S8 (Watch SE2) at
https://github.com/rebeckerspecialties/wasm-benchmark/blob/claude/relaxed-simd-diff-fuzz/README.md#cross-runtime-results-across-apple-silicon-e-cores
. Integration tests in our benchmark repo include a Porffor-compiled
graphql-validationworkload that mirrors the real-worldtry { visit(…) } catch (e) { if (e !== abortObj) throw e; }shape and exercises every EH opcode the loaderemits. ASan + UBSan builds are part of the local dev loop.
Companion PR: relaxed-SIMD fast-interp opcode lowering, posted separately
(
f32x4.relaxed_maddetc).Validated existing benchmarks perform nearly exactly the same in terms of wallclock, throughput, cache, and branch predictor using CPU bottlneck template in xctrace.