Issue with Current Probe Design
In the current design, estimate_kernel_peak_memory probes a kernel’s peak memory usage and returns peak_bytes:
peak_bytes = estimate_kernel_peak_memory(probe_fn=_probe)
kernel_bpt = peak_bytes // num_tokens # if needed
This derives a bytes-per-token (BPT) estimate by assuming memory scales linearly with the number of tokens.
Limitation
This assumption only holds for kernels whose memory footprint scales linearly with num_tokens.
However, for some kernels Scaling may be non-linear (e.g., quadratic)
As a result, the computed kernel_bpt can be inaccurate or misleading for such kernels.
Summary
The current BPT estimation is:
- ✅ Reasonable for linear-scaling kernels
- ❌ Unreliable for kernels with non-linear or structured memory behavior
Issue with Current Probe Design
In the current design,
estimate_kernel_peak_memoryprobes a kernel’s peak memory usage and returnspeak_bytes:This derives a bytes-per-token (BPT) estimate by assuming memory scales linearly with the number of tokens.
Limitation
This assumption only holds for kernels whose memory footprint scales linearly with
num_tokens.However, for some kernels Scaling may be non-linear (e.g., quadratic)
As a result, the computed
kernel_bptcan be inaccurate or misleading for such kernels.Summary
The current BPT estimation is: