Skip to content

Clarification Needed to Reproduce Section 5.2 (Multi‑Model Tenancy) Results #211

@MzHameed

Description

@MzHameed

I am currently working to reproduce the Section 5.2 results from your paper “PyTorchSim: A Comprehensive, Fast, and Accurate NPU Simulation Framework”.
To fully understand the multi‑model tenancy behavior you describe, I am attempting to replicate the experiment involving ResNet‑18 (batch size 8) and BERT‑base/EncoderBlock (batch size 4) co‑located on a single NPU.

However, several details are unclear from both the paper and the repository, and I am unable to reproduce the reported latency and DRAM bandwidth results (ResNet‑18: 188→165 GB/s, BERT‑base: 303→424 GB/s). I would greatly appreciate clarification on the following points:

  1. Exact NPU Configuration Used in Section 5.2
    The paper states that both models run on a “single NPU,” but the exact JSON config is not specified.
    Could you confirm which configuration file was used (systolic_ws_128x128_c1_simple_noc_tpuv3.json or c1_simple_half or c2_simple or c2_simple_partition) and whether any modifications were applied?

  2. Interpretation of Batch Sizes
    Section 5.2 uses:

ResNet‑18 → batch size 8

BERT‑base → batch size 4

Should these be interpreted as:

a single request with batch‑8 or batch‑4 tensors, or

multiple independent requests scheduled concurrently?

This distinction significantly affects scheduling and DRAM behavior.

  1. DRAM Bandwidth Sharing Mechanism
    The paper reports specific bandwidth values for single‑tenant and co‑located runs.
    Was this behavior purely emergent from Ramulator2, or were additional bandwidth partitioning or scheduling constraints applied?

  2. Model Used for “BERT‑base”
    The repository provides an EncoderBlock implementation in tests/test_transformer.py.
    Did Section 5.2 use:

the full HuggingFace BERT model, or

the provided EncoderBlock module?

This affects both compute intensity and DRAM usage.

  1. Capturing stdout Results
    Most simulation statistics (latency, DRAM BW, utilization, etc.) are printed directly to stdout.
    Is there a recommended method to:

automatically log these values,

redirect them to a file, or

extract them programmatically?

This is important for reproducibility and comparison with Figure 7b.

  1. Scripts or Pipeline Used for Section 5.2
    If possible, could you share:

the exact script(s) used to run the co‑location experiment, or

a minimal example showing how you generated the results in Figure 7b?

Even a brief outline would be extremely helpful.

Thank you very much for your time and for developing such an impressive simulation framework. Any guidance you can provide will help ensure reproducibility of your results.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions