Skip to content

[Bug]: --enforce-eager reduces performance in -tp >1 significantly compared to vLLM #1114

@markouustalu

Description

@markouustalu

🐛 Describe the bug

Ran on 1x or 2x 3060 12GB, prompt was single one sentence coding instruction for a sample program

While there is speed reduction with vLLM as well in -tp 2 mode, it is comparable to the -tp 1 reduction, or in single digit %

I would like to use -q FP6 quantization, which enforces eager for now according to #1087 and eager slashes performance for unknown reasons.

tp 2 + eager
aphrodite run Qwen/Qwen2.5-Coder-7B-Instruct-AWQ --max-model-len 5000 --swap-space 0 --max-num-seqs 1 --disable-log-requests --enforce-eager --served-model-name model -tp 2
will result in
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 31.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.

tp 2 + no eager
aphrodite run Qwen/Qwen2.5-Coder-7B-Instruct-AWQ --max-model-len 5000 --swap-space 0 --max-num-seqs 1 --disable-log-requests --served-model-name mo del -tp 2
will result in
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 57.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.

tp 1 + no eager
aphrodite run Qwen/Qwen2.5-Coder-7B-Instruct-AWQ --max-model-len 5000 --swap-space 0 --max-num-seqs 1 --disable-log-requests --served-model-name model
will result in
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 52.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%.

tp 1 + eager
aphrodite run Qwen/Qwen2.5-Coder-7B-Instruct-AWQ --max-model-len 5000 --swap-space 0 --max-num-seqs 1 --disable-log-requests --served-model-name model --enforce-eager
will result in
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 49.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%.

Your current environment

The output of `python env.py` ```text

PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.1 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.39

Python version: 3.12.3 (main, Jan 17 2025, 18:03:48) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 3060
GPU 1: NVIDIA GeForce RTX 3060
GPU 2: NVIDIA GeForce RTX 3060
GPU 3: NVIDIA GeForce RTX 3060
GPU 4: NVIDIA GeForce RTX 3060
GPU 5: NVIDIA GeForce RTX 3060

Nvidia driver version: 565.57.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 5 2600X Six-Core Processor
CPU family: 23
Model: 8
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
Stepping: 2
Frequency boost: enabled
CPU(s) scaling MHz: 73%
CPU max MHz: 3600.0000
CPU min MHz: 2200.0000
BogoMIPS: 7199.59
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sev sev_es
Virtualization: AMD-V
L1d cache: 192 KiB (6 instances)
L1i cache: 384 KiB (6 instances)
L2 cache: 3 MiB (6 instances)
L3 cache: 16 MiB (2 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-11
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Mitigation; untrained return thunk; SMT vulnerable
Vulnerability Spec rstack overflow: Mitigation; Safe RET
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.45.2
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
Aphrodite Version: 0.6.5
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PIX PIX PIX PHB PHB 0-11 0 N/A
GPU1 PIX X PIX PIX PHB PHB 0-11 0 N/A
GPU2 PIX PIX X PIX PHB PHB 0-11 0 N/A
GPU3 PIX PIX PIX X PHB PHB 0-11 0 N/A
GPU4 PHB PHB PHB PHB X PHB 0-11 0 N/A
GPU5 PHB PHB PHB PHB PHB X 0-11 0 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

</details>

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions