[Bug]: --enforce-eager reduces performance in -tp >1 significantly compared to vLLM


### 🐛 Describe the bug

Ran on 1x or 2x 3060 12GB, prompt was single one sentence coding instruction for a sample program

While there is speed reduction with vLLM as well in -tp 2 mode, it is comparable to the -tp 1 reduction, or in single digit %

I would like to use -q FP6 quantization, which enforces eager for now according to https://github.com/aphrodite-engine/aphrodite-engine/issues/1087 and  eager slashes performance for unknown reasons.

tp 2 + eager
`aphrodite run Qwen/Qwen2.5-Coder-7B-Instruct-AWQ --max-model-len 5000 --swap-space 0 --max-num-seqs 1 --disable-log-requests --enforce-eager --served-model-name model -tp 2`
will result in
`INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 31.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage:
0.0%.`

tp 2 + no eager
`aphrodite run Qwen/Qwen2.5-Coder-7B-Instruct-AWQ --max-model-len 5000 --swap-space 0 --max-num-seqs 1 --disable-log-requests --served-model-name mo
del -tp 2`
will result in
`INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 57.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage:
0.0%.`

tp 1 + no eager
`aphrodite run Qwen/Qwen2.5-Coder-7B-Instruct-AWQ --max-model-len 5000 --swap-space 0 --max-num-seqs 1 --disable-log-requests --served-model-name model`
will result in
`INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 52.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage:
0.0%.`

tp 1 + eager
`aphrodite run Qwen/Qwen2.5-Coder-7B-Instruct-AWQ --max-model-len 5000 --swap-space 0 --max-num-seqs 1 --disable-log-requests --served-model-name model --enforce-eager`
will result in
`INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 49.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage:
0.0%.`

### Your current environment

<details>
<summary>The output of `python env.py`</summary>
```text

PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.1 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.39

Python version: 3.12.3 (main, Jan 17 2025, 18:03:48) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 3060
GPU 1: NVIDIA GeForce RTX 3060
GPU 2: NVIDIA GeForce RTX 3060
GPU 3: NVIDIA GeForce RTX 3060
GPU 4: NVIDIA GeForce RTX 3060
GPU 5: NVIDIA GeForce RTX 3060

Nvidia driver version: 565.57.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        43 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               12
On-line CPU(s) list:                  0-11
Vendor ID:                            AuthenticAMD
Model name:                           AMD Ryzen 5 2600X Six-Core Processor
CPU family:                           23
Model:                                8
Thread(s) per core:                   2
Core(s) per socket:                   6
Socket(s):                            1
Stepping:                             2
Frequency boost:                      enabled
CPU(s) scaling MHz:                   73%
CPU max MHz:                          3600.0000
CPU min MHz:                          2200.0000
BogoMIPS:                             7199.59
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sev sev_es
Virtualization:                       AMD-V
L1d cache:                            192 KiB (6 instances)
L1i cache:                            384 KiB (6 instances)
L2 cache:                             3 MiB (6 instances)
L3 cache:                             16 MiB (2 instances)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-11
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT vulnerable
Vulnerability Spec rstack overflow:   Mitigation; Safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.45.2
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
Aphrodite Version: 0.6.5
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PIX     PIX     PIX     PHB     PHB     0-11    0               N/A
GPU1    PIX      X      PIX     PIX     PHB     PHB     0-11    0               N/A
GPU2    PIX     PIX      X      PIX     PHB     PHB     0-11    0               N/A
GPU3    PIX     PIX     PIX      X      PHB     PHB     0-11    0               N/A
GPU4    PHB     PHB     PHB     PHB      X      PHB     0-11    0               N/A
GPU5    PHB     PHB     PHB     PHB     PHB      X      0-11    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: --enforce-eager reduces performance in -tp >1 significantly compared to vLLM #1114

🐛 Describe the bug

Your current environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: --enforce-eager reduces performance in -tp >1 significantly compared to vLLM #1114

Description

🐛 Describe the bug

Your current environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions