-
Notifications
You must be signed in to change notification settings - Fork 16
Description
I am currently working to reproduce the Section 5.2 results from your paper “PyTorchSim: A Comprehensive, Fast, and Accurate NPU Simulation Framework”.
To fully understand the multi‑model tenancy behavior you describe, I am attempting to replicate the experiment involving ResNet‑18 (batch size 8) and BERT‑base/EncoderBlock (batch size 4) co‑located on a single NPU.
However, several details are unclear from both the paper and the repository, and I am unable to reproduce the reported latency and DRAM bandwidth results (ResNet‑18: 188→165 GB/s, BERT‑base: 303→424 GB/s). I would greatly appreciate clarification on the following points:
-
Exact NPU Configuration Used in Section 5.2
The paper states that both models run on a “single NPU,” but the exact JSON config is not specified.
Could you confirm which configuration file was used (systolic_ws_128x128_c1_simple_noc_tpuv3.json or c1_simple_half or c2_simple or c2_simple_partition) and whether any modifications were applied? -
Interpretation of Batch Sizes
Section 5.2 uses:
ResNet‑18 → batch size 8
BERT‑base → batch size 4
Should these be interpreted as:
a single request with batch‑8 or batch‑4 tensors, or
multiple independent requests scheduled concurrently?
This distinction significantly affects scheduling and DRAM behavior.
-
DRAM Bandwidth Sharing Mechanism
The paper reports specific bandwidth values for single‑tenant and co‑located runs.
Was this behavior purely emergent from Ramulator2, or were additional bandwidth partitioning or scheduling constraints applied? -
Model Used for “BERT‑base”
The repository provides an EncoderBlock implementation in tests/test_transformer.py.
Did Section 5.2 use:
the full HuggingFace BERT model, or
the provided EncoderBlock module?
This affects both compute intensity and DRAM usage.
- Capturing stdout Results
Most simulation statistics (latency, DRAM BW, utilization, etc.) are printed directly to stdout.
Is there a recommended method to:
automatically log these values,
redirect them to a file, or
extract them programmatically?
This is important for reproducibility and comparison with Figure 7b.
- Scripts or Pipeline Used for Section 5.2
If possible, could you share:
the exact script(s) used to run the co‑location experiment, or
a minimal example showing how you generated the results in Figure 7b?
Even a brief outline would be extremely helpful.
Thank you very much for your time and for developing such an impressive simulation framework. Any guidance you can provide will help ensure reproducibility of your results.