Skip to content

feat: add kernel profiler with scalene integration#47

Open
vgene wants to merge 1 commit intomainfrom
feat/scalene
Open

feat: add kernel profiler with scalene integration#47
vgene wants to merge 1 commit intomainfrom
feat/scalene

Conversation

@vgene
Copy link
Copy Markdown
Contributor

@vgene vgene commented Mar 27, 2026

Summary

  • Adds nkipy.tools.profiler — kernel-level profiler correlating NeuronCore device traces with Python source lines
  • Separate NRT (host submit, ~165μs/call) and NC (device compute, ~317μs/call) time attribution with NC/NRT ratio for identifying sync vs async execution
  • Device utilization analysis with idle gap breakdown across 5 categories (D2H/H2D transfers, memory mgmt, exec overhead, sync waits)
  • Scalene integration: CPU-device overlap ("Unused Device %"), combined CPU+device visualization via scalene GUI
  • Distributed support: scalene on rank 0 via torchrun + redirect_python, kernel-only merge for multi-rank
  • Profiling integrated into each model's test.sh --profile

Usage

# Qwen3-30B distributed (TP=4)
cd examples/models/qwen3 && bash test.sh --profile

# Qwen3-Embedding single process
cd examples/models/qwen3_embedding && bash test.sh --profile

# Merge CLI
python -m nkipy.tools.profiler scalene.json kernel_profile.json merged.json
python -m nkipy.tools.profiler --kernel-only kernel_profile.json merged.json

Known issues

LNC>1 double-counts parallel NC time

With lnc=2, each SpikeModel.__call__ triggers 2 nc_exec_running events (one per logical NeuronCore) that execute in parallel. The profiler currently sums all event durations, reporting core-time instead of wall-time. This inflates total NC time by ~2x and makes utilization % incorrect. Affects Qwen3-Embedding (LNC=2); Qwen3-30B TP=4 (LNC=1 per rank) is unaffected.

Fix: merge overlapping NC host intervals so parallel cores count once, and group LNC consecutive NC events per kernel call for per-line attribution.

Scalene signal overhead inflates NRT durations

When scalene CPU profiling is active, its signal-based sampling (SIGVTALRM/SIGPROF) adds overhead to nrt_execute host calls. This shifts the NC/NRT ratio from ~1.9 (async, true hardware behavior) to ~0.6 (appears sync). NC time is unaffected — only host-side NRT is inflated.

Workaround: use --no-scalene for accurate NRT measurements, or compare NRT with/without scalene to quantify the overhead.

Test plan

  • 35 unit tests covering trace parsing, NRT/NC separation, merge, rank filtering, CPU overlap
  • Verified on hardware: Qwen3-30B TP=4 with scalene on rank 0
  • Verified on hardware: Qwen3-Embedding single process with scalene
  • Zero changes to spike/ — all profiling code lives in nkipy/tools/profiler/

…nalysis

Adds nkipy.tools.profiler — a kernel-level profiler that correlates
NeuronCore device traces with Python source lines, integrates with
scalene for combined CPU + device profiling, and outputs scalene-
compatible JSON for visualization.

Key capabilities:
- Per-line NRT (host submit) and NC (device compute) time attribution
- NC/NRT ratio to identify sync vs async execution patterns
- Device utilization analysis with idle gap breakdown (D2H/H2D
  transfers, memory mgmt, exec overhead, sync waits)
- CPU-device overlap analysis (populates scalene "Unused Device %")
- Per-kernel-name aggregation in scalene function profile table
- Distributed support: scalene on rank 0 via torchrun + redirect_python
- Kernel-only merge mode for device profiles without scalene

Profiling is integrated into each model's test.sh via --profile flag:
  bash examples/models/qwen3/test.sh --profile
  bash examples/models/qwen3_embedding/test.sh --profile
@vgene vgene requested a review from a team March 27, 2026 06:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant