From ae1ed15022576735e813b1a6d7d42d843513e71c Mon Sep 17 00:00:00 2001
From: Marco Walz <marco.walz@dfinity.org>
Date: Thu, 16 Apr 2026 18:20:48 +0200
Subject: [PATCH 1/2] docs: large Wasm guide

---
 docs/guides/canister-management/large-wasm.md | 257 +++++++++++++++++-
 1 file changed, 245 insertions(+), 12 deletions(-)

diff --git a/docs/guides/canister-management/large-wasm.md b/docs/guides/canister-management/large-wasm.md
index 37a99340..6651fb0c 100644
--- a/docs/guides/canister-management/large-wasm.md
+++ b/docs/guides/canister-management/large-wasm.md
@@ -1,21 +1,254 @@
 ---
 title: "Large Wasm Modules"
-description: "Deploy canisters that exceed the 2MB Wasm limit using chunk store and compression"
+description: "Deploy canisters that exceed the 2 MiB Wasm limit using chunk store and compression"
 sidebar:
   order: 9
 ---
 
-TODO: Write content for this page.
+ICP enforces a 2 MiB message size limit that applies to Wasm modules uploaded via `install_code`. Canisters with complex business logic, embedded ML models, or large dependency trees often exceed this threshold. There are two complementary approaches: reduce the module size with compression and dead-code stripping, or bypass the limit entirely by uploading the module in chunks.
 
-<!-- Content Brief -->
-Deploy canisters with Wasm modules larger than the 2MB limit. Cover the Wasm chunk store for splitting large modules, gzip compression for reducing size, the ic-wasm tool for stripping and optimizing, and Wasm64 support for 64-bit memory. Explain when and why you might need large modules (ML models, complex business logic). Include a section on WebAssembly SIMD — 200+ vector instructions for parallel computation that accelerate AI/ML inference, image processing, cryptographic operations, and other math-heavy workloads. SIMD is available on every ICP node.
+This guide covers both approaches, explains Wasm64 for canisters that need extended memory, and introduces WebAssembly SIMD for computationally intensive workloads.
 
-<!-- Source Material -->
-- Portal: building-apps/developing-canisters/compile.mdx (large Wasm section)
-- Examples: backend_wasm64 (Rust)
-- icp-cli: --wasm-chunk-store flag
+## Why Wasm modules grow large
 
-<!-- Cross-Links -->
-- guides/canister-management/optimization -- reducing Wasm size to avoid this entirely
-- reference/execution-errors -- Wasm size errors
-- guides/canister-management/lifecycle -- deployment with chunk store
+A compiled Wasm binary grows for several reasons:
+
+- **Dense dependency trees** — Rust canisters that pull in many crates accumulate dead code that the compiler cannot always eliminate.
+- **Embedded data** — ML model weights, large lookup tables, or static assets compiled into the binary.
+- **Complex business logic** — feature-rich canisters with many update and query methods.
+- **Debug symbols** — by default, Rust release builds include name sections and other debug metadata.
+
+Before reaching for the chunk store, consider whether [canister optimization](optimization.md) can reduce the binary enough to fit under 2 MiB.
+
+## Approach 1: gzip compression
+
+ICP's management canister understands gzip-compressed Wasm modules. When the `wasm_module` field of `install_code` starts with the gzip magic bytes `[0x1f, 0x8b, 0x08]`, the system decompresses it automatically before installation.
+
+Gzip compression typically reduces Wasm binary size by 50-70%, which is often enough to bring a large module under the 2 MiB threshold.
+
+### Using a recipe
+
+The Rust and prebuilt recipes expose a `compress` flag that gzip-compresses the output as the final build step:
+
+```yaml
+canisters:
+  - name: backend
+    recipe:
+      type: "@dfinity/rust@v3.2.0"
+      configuration:
+        package: backend
+        shrink: true
+        compress: true
+```
+
+Setting `shrink: true` first removes unused functions and debug metadata, then `compress: true` gzip-compresses the result. Using both together gives the largest size reduction.
+
+### Using a custom build script
+
+If you are not using a recipe, you can compress manually in your build steps:
+
+```yaml
+canisters:
+  - name: backend
+    build:
+      steps:
+        - type: script
+          commands:
+            - cargo build --target wasm32-unknown-unknown --release
+            - cp target/wasm32-unknown-unknown/release/backend.wasm "$ICP_WASM_OUTPUT_PATH"
+            - ic-wasm "$ICP_WASM_OUTPUT_PATH" -o "$ICP_WASM_OUTPUT_PATH" shrink --keep-name-section
+            - gzip --no-name "$ICP_WASM_OUTPUT_PATH"
+            - mv "${ICP_WASM_OUTPUT_PATH}.gz" "$ICP_WASM_OUTPUT_PATH"
+```
+
+The `--keep-name-section` flag preserves function names for readable backtraces while still removing dead code. Omit it if you do not need stack traces.
+
+## Approach 2: the Wasm chunk store
+
+When compression alone is not enough, the Wasm chunk store lets you upload modules larger than 2 MiB by splitting them into chunks, then assembling and installing them in one atomic operation.
+
+### How the chunk store works
+
+1. **Upload chunks** — Call `upload_chunk` on the management canister to store up to 1 MiB chunks in the target canister's chunk store. Each call returns the SHA-256 hash of the stored chunk.
+2. **Assemble and install** — Call `install_chunked_code` with the ordered list of chunk hashes. The system concatenates the chunks, verifies the aggregate hash matches `wasm_module_hash`, and installs the result as if you had called `install_code` directly.
+
+The chunk store is bounded: each chunk is at most 1 MiB, and there is a maximum number of chunks per store (`CHUNK_STORE_SIZE` in the IC interface spec). You can inspect stored chunks with `stored_chunks` and clear the store with `clear_chunk_store`.
+
+### icp-cli handles this automatically
+
+When you run `icp deploy` or `icp canister install` with a Wasm module larger than 2 MiB, icp-cli automatically uses the chunk store — no configuration required. The tool splits the module, uploads each chunk, and calls `install_chunked_code` behind the scenes.
+
+```bash
+icp deploy
+```
+
+### Combining compression with the chunk store
+
+You can combine gzip compression with the chunk store. A compressed module that is still larger than 2 MiB will still be split into chunks, but fewer chunks are needed — which means fewer upload calls and lower cycle costs. Enable both `shrink` and `compress` in your recipe, and let icp-cli decide whether chunking is needed.
+
+### Cycle costs
+
+Storing each chunk costs cycles proportional to 1 MiB of storage (even if the chunk is smaller). Chunks are temporary storage: they are consumed during `install_chunked_code` and do not accumulate after installation. If an installation attempt fails or is interrupted, call `clear_chunk_store` to reclaim the storage cycles before retrying.
+
+## Wasm64: 64-bit memory addressing
+
+Standard ICP canisters use the `wasm32-unknown-unknown` target, which limits addressable memory to 4 GiB. For canisters that need more — for example, those holding large in-memory datasets or running inference on large models — ICP supports the `wasm64-unknown-unknown` target with up to 6 GiB of addressable memory.
+
+Wasm64 is a separate concern from the chunk store. You might use one, the other, or both: the chunk store addresses the 2 MiB upload limit, while Wasm64 addresses the runtime memory limit.
+
+### Building a Wasm64 canister
+
+Wasm64 requires the Rust nightly toolchain and the `build-std` unstable feature, because the standard library must be compiled for the `wasm64-unknown-unknown` target rather than pulled from a precompiled artifact.
+
+Create a `build.sh` script in your project directory:
+
+```bash
+#!/bin/bash
+
+# Ensure nightly toolchain and rust-src are available
+rustup toolchain install nightly
+rustup component add rust-src --toolchain nightly
+
+# Build for wasm64
+cargo +nightly build \
+  -Z build-std=std,panic_abort \
+  --target wasm64-unknown-unknown \
+  --release \
+  -p backend
+
+cp target/wasm64-unknown-unknown/release/backend.wasm target/backend.wasm
+candid-extractor target/backend.wasm > backend/backend.did
+```
+
+Then reference the script in `icp.yaml`:
+
+```yaml
+canisters:
+  - name: backend
+    build:
+      steps:
+        - type: script
+          commands:
+            - ./build.sh
+            - cp target/backend.wasm "$ICP_WASM_OUTPUT_PATH"
+            - ic-wasm "$ICP_WASM_OUTPUT_PATH" -o "${ICP_WASM_OUTPUT_PATH}" metadata "candid:service" -f 'backend/backend.did' -v public --keep-name-section
+```
+
+The canister code itself does not require changes — the same Rust CDK code works on both `wasm32` and `wasm64`:
+
+```rust
+#[ic_cdk::query]
+fn greet(name: String) -> String {
+    format!("Hello, {}!", name)
+}
+
+ic_cdk::export_candid!();
+```
+
+See the [backend_wasm64 example](https://github.com/dfinity/examples/tree/master/rust/backend_wasm64) for a complete working project.
+
+### Memory limits and Wasm64
+
+Wasm64 canisters benefit from the `wasm_memory_limit` canister setting to cap WebAssembly heap usage, preventing runaway allocations:
+
+```yaml
+canisters:
+  - name: backend
+    build:
+      steps:
+        - type: script
+          commands:
+            - ./build.sh
+            - cp target/backend.wasm "$ICP_WASM_OUTPUT_PATH"
+    settings:
+      wasm_memory_limit: 4gib
+```
+
+## WebAssembly SIMD
+
+WebAssembly SIMD (Single Instruction, Multiple Data) is a set of more than 200 vector instructions defined in the WebAssembly core specification. SIMD allows a single instruction to operate on multiple data elements in parallel, which significantly accelerates compute-heavy workloads.
+
+SIMD is available on every ICP node and does not require any special canister configuration beyond enabling the target feature in your build.
+
+### When SIMD helps
+
+SIMD provides the largest gains for workloads with regular, data-parallel structure:
+
+- **AI/ML inference** — matrix multiplications, activation functions, convolutions
+- **Image processing** — pixel transforms, filtering, encoding/decoding
+- **Cryptographic operations** — hash computation, field arithmetic
+- **Scientific computing** — numerical simulations, signal processing
+
+For "classical" canister operations — reward distribution, token accounting, query logic — the gains are smaller but still measurable.
+
+### Loop auto-vectorization
+
+The simplest way to benefit from SIMD is to enable the `simd128` target feature and let the Rust compiler auto-vectorize loops. This is a one-line change that often provides significant speedup without rewriting any code.
+
+Enable SIMD globally for your entire workspace by creating `.cargo/config.toml`:
+
+```toml
+[build]
+target = ["wasm32-unknown-unknown"]
+
+[target.wasm32-unknown-unknown]
+rustflags = ["-C", "target-feature=+simd128"]
+```
+
+Or enable it only for a specific function:
+
+```rust
+#[target_feature(enable = "simd128")]
+#[ic_cdk::query]
+fn compute_heavy_operation() -> u64 {
+    // The compiler auto-vectorizes eligible loops in this function
+    // ...
+    0
+}
+```
+
+Auto-vectorization works best with tight numeric loops over contiguous arrays. The actual speedup depends on the algorithm, the compiler, and the input data.
+
+### SIMD intrinsics
+
+For maximum performance, you can use SIMD intrinsics directly. This gives full control over which vector instructions execute, at the cost of writing more complex code.
+
+The `wasm32` platform exposes SIMD intrinsics through the `core::arch::wasm32` module (available when `simd128` is enabled). For a complete working example comparing naive, optimized, auto-vectorized, and SIMD intrinsic implementations of matrix multiplication, see the [WebAssembly SIMD example](https://github.com/dfinity/examples/tree/master/rust/simd) in the examples repository.
+
+### Measuring SIMD performance
+
+Use the `ic0.performance_counter` system API to count Wasm instructions before and after a computation:
+
+```rust
+#[ic_cdk::query]
+fn benchmark_operation() -> u64 {
+    let before = ic_cdk::api::instruction_counter();
+    // ... your computation ...
+    ic_cdk::api::instruction_counter() - before
+}
+```
+
+Compare instruction counts with and without SIMD to measure the speedup. Lower instruction counts mean lower cycle costs and faster execution. The [`canbench`](https://github.com/dfinity/canbench) framework provides a more structured benchmarking workflow for tracking performance over time.
+
+## Troubleshooting
+
+**"Wasm module too large" error during install** — The module exceeds 2 MiB. Verify that icp-cli is up to date (automatic chunk store support was added in v0.2.x). If using a manual install flow, switch to the `install_chunked_code` management canister API.
+
+**"Wasm chunk store error" during install** — The canister may lack sufficient cycles to store chunks (each 1 MiB chunk incurs a storage cost). Top up the canister's cycles balance before retrying. If chunks from a previous failed attempt are occupying the store, call `clear_chunk_store` first.
+
+**Wasm64 build fails with missing target** — The `nightly` toolchain and `rust-src` component must both be installed. Run:
+
+```bash
+rustup toolchain install nightly
+rustup component add rust-src --toolchain nightly
+```
+
+**SIMD instructions have no measurable effect** — Some loops cannot be auto-vectorized. Check that the loop body is tight, operates on a contiguous slice, and does not contain branches or function calls that prevent vectorization. Profile with `ic_cdk::api::instruction_counter` to confirm the function is a bottleneck before investing in SIMD intrinsics.
+
+## Next steps
+
+- [Canister optimization](optimization.md) — reduce Wasm size before reaching for the chunk store
+- [Execution errors reference](../../reference/execution-errors.md) — Wasm size and chunk store error codes
+- [Canister lifecycle](lifecycle.md) — deployment modes and install options
+
+<!-- Upstream: informed by dfinity/portal docs/building-apps/developing-canisters/compile.mdx; dfinity/portal docs/building-apps/network-features/simd.mdx; dfinity/examples rust/backend_wasm64; dfinity/portal docs/references/ic-interface-spec.md -->

From 91a9036a50895e1ea13932d04807500e48066963 Mon Sep 17 00:00:00 2001
From: Marco Walz <marco.walz@dfinity.org>
Date: Thu, 16 Apr 2026 20:47:14 +0200
Subject: [PATCH 2/2] docs(large-wasm): address review feedback on shrink
 description and accuracy

- Fix shrink: true description to note --keep-name-section preserves function names
- Soften unsourced 50-70% compression estimate to "significantly"
- Clarify 6 GiB is an ICP platform limit, not a Wasm64 architectural limit
- Add TODO flag for unverifiable automatic chunking behavior claim
- Add management-canister reference link for CHUNK_STORE_SIZE
---
 docs/guides/canister-management/large-wasm.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/guides/canister-management/large-wasm.md b/docs/guides/canister-management/large-wasm.md
index 6651fb0c..93326576 100644
--- a/docs/guides/canister-management/large-wasm.md
+++ b/docs/guides/canister-management/large-wasm.md
@@ -24,7 +24,7 @@ Before reaching for the chunk store, consider whether [canister optimization](op
 
 ICP's management canister understands gzip-compressed Wasm modules. When the `wasm_module` field of `install_code` starts with the gzip magic bytes `[0x1f, 0x8b, 0x08]`, the system decompresses it automatically before installation.
 
-Gzip compression typically reduces Wasm binary size by 50-70%, which is often enough to bring a large module under the 2 MiB threshold.
+Gzip compression typically reduces Wasm binary size significantly, which is often enough to bring a large module under the 2 MiB threshold.
 
 ### Using a recipe
 
@@ -41,7 +41,7 @@ canisters:
         compress: true
 ```
 
-Setting `shrink: true` first removes unused functions and debug metadata, then `compress: true` gzip-compresses the result. Using both together gives the largest size reduction.
+Setting `shrink: true` first removes unused functions and debug info while preserving function names for readable backtraces, then `compress: true` gzip-compresses the result. Using both together gives the largest size reduction.
 
 ### Using a custom build script
 
@@ -72,11 +72,11 @@ When compression alone is not enough, the Wasm chunk store lets you upload modul
 1. **Upload chunks** — Call `upload_chunk` on the management canister to store up to 1 MiB chunks in the target canister's chunk store. Each call returns the SHA-256 hash of the stored chunk.
 2. **Assemble and install** — Call `install_chunked_code` with the ordered list of chunk hashes. The system concatenates the chunks, verifies the aggregate hash matches `wasm_module_hash`, and installs the result as if you had called `install_code` directly.
 
-The chunk store is bounded: each chunk is at most 1 MiB, and there is a maximum number of chunks per store (`CHUNK_STORE_SIZE` in the IC interface spec). You can inspect stored chunks with `stored_chunks` and clear the store with `clear_chunk_store`.
+The chunk store is bounded: each chunk is at most 1 MiB, and there is a maximum number of chunks per store (`CHUNK_STORE_SIZE`, defined in the IC interface spec — see the [management canister reference](../../reference/management-canister.md) for the exact value). You can inspect stored chunks with `stored_chunks` and clear the store with `clear_chunk_store`.
 
 ### icp-cli handles this automatically
 
-When you run `icp deploy` or `icp canister install` with a Wasm module larger than 2 MiB, icp-cli automatically uses the chunk store — no configuration required. The tool splits the module, uploads each chunk, and calls `install_chunked_code` behind the scenes.
+When you run `icp deploy` or `icp canister install` with a Wasm module larger than 2 MiB, icp-cli automatically uses the chunk store — no configuration required. The tool splits the module, uploads each chunk, and calls `install_chunked_code` behind the scenes. <!-- TODO: verify automatic chunking behavior against icp-cli release notes -->
 
 ```bash
 icp deploy
@@ -92,7 +92,7 @@ Storing each chunk costs cycles proportional to 1 MiB of storage (even if the ch
 
 ## Wasm64: 64-bit memory addressing
 
-Standard ICP canisters use the `wasm32-unknown-unknown` target, which limits addressable memory to 4 GiB. For canisters that need more — for example, those holding large in-memory datasets or running inference on large models — ICP supports the `wasm64-unknown-unknown` target with up to 6 GiB of addressable memory.
+Standard ICP canisters use the `wasm32-unknown-unknown` target, which limits addressable memory to 4 GiB. For canisters that need more — for example, those holding large in-memory datasets or running inference on large models — ICP supports the `wasm64-unknown-unknown` target with up to 6 GiB of addressable heap memory (an ICP platform limit).
 
 Wasm64 is a separate concern from the chunk store. You might use one, the other, or both: the chunk store addresses the 2 MiB upload limit, while Wasm64 addresses the runtime memory limit.