feat: add kernel profiler with scalene integration by vgene · Pull Request #47 · aws-neuron/nkipy

vgene · 2026-03-27T06:17:15Z

Summary

Adds nkipy.tools.profiler — kernel-level profiler correlating NeuronCore device traces with Python source lines
Separate NRT (host submit, ~165μs/call) and NC (device compute, ~317μs/call) time attribution with NC/NRT ratio for identifying sync vs async execution
Device utilization analysis with idle gap breakdown across 5 categories (D2H/H2D transfers, memory mgmt, exec overhead, sync waits)
Scalene integration: CPU-device overlap ("Unused Device %"), combined CPU+device visualization via scalene GUI
Distributed support: scalene on rank 0 via torchrun + redirect_python, kernel-only merge for multi-rank
Profiling integrated into each model's test.sh --profile

Usage

# Qwen3-30B distributed (TP=4)
cd examples/models/qwen3 && bash test.sh --profile

# Qwen3-Embedding single process
cd examples/models/qwen3_embedding && bash test.sh --profile

# Merge CLI
python -m nkipy.tools.profiler scalene.json kernel_profile.json merged.json
python -m nkipy.tools.profiler --kernel-only kernel_profile.json merged.json

Known issues

LNC>1 double-counts parallel NC time

With lnc=2, each SpikeModel.__call__ triggers 2 nc_exec_running events (one per logical NeuronCore) that execute in parallel. The profiler currently sums all event durations, reporting core-time instead of wall-time. This inflates total NC time by ~2x and makes utilization % incorrect. Affects Qwen3-Embedding (LNC=2); Qwen3-30B TP=4 (LNC=1 per rank) is unaffected.

Fix: merge overlapping NC host intervals so parallel cores count once, and group LNC consecutive NC events per kernel call for per-line attribution.

Scalene signal overhead inflates NRT durations

When scalene CPU profiling is active, its signal-based sampling (SIGVTALRM/SIGPROF) adds overhead to nrt_execute host calls. This shifts the NC/NRT ratio from ~1.9 (async, true hardware behavior) to ~0.6 (appears sync). NC time is unaffected — only host-side NRT is inflated.

Workaround: use --no-scalene for accurate NRT measurements, or compare NRT with/without scalene to quantify the overhead.

Test plan

35 unit tests covering trace parsing, NRT/NC separation, merge, rank filtering, CPU overlap
Verified on hardware: Qwen3-30B TP=4 with scalene on rank 0
Verified on hardware: Qwen3-Embedding single process with scalene
Zero changes to spike/ — all profiling code lives in nkipy/tools/profiler/

…nalysis Adds nkipy.tools.profiler — a kernel-level profiler that correlates NeuronCore device traces with Python source lines, integrates with scalene for combined CPU + device profiling, and outputs scalene- compatible JSON for visualization. Key capabilities: - Per-line NRT (host submit) and NC (device compute) time attribution - NC/NRT ratio to identify sync vs async execution patterns - Device utilization analysis with idle gap breakdown (D2H/H2D transfers, memory mgmt, exec overhead, sync waits) - CPU-device overlap analysis (populates scalene "Unused Device %") - Per-kernel-name aggregation in scalene function profile table - Distributed support: scalene on rank 0 via torchrun + redirect_python - Kernel-only merge mode for device profiles without scalene Profiling is integrated into each model's test.sh via --profile flag: bash examples/models/qwen3/test.sh --profile bash examples/models/qwen3_embedding/test.sh --profile

vgene requested a review from a team March 27, 2026 06:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add kernel profiler with scalene integration#47

feat: add kernel profiler with scalene integration#47
vgene wants to merge 1 commit intomainfrom
feat/scalene

vgene commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vgene commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Usage

Known issues

LNC>1 double-counts parallel NC time

Scalene signal overhead inflates NRT durations

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vgene commented Mar 27, 2026 •

edited

Loading