Skip to content

Pashasan/llm_price_sensitivity_evaluation

Repository files navigation

Are LLMs Price Sensitive?

A conjoint experiment measuring whether open-source LLMs exhibit price sensitivity and coherent economic preferences when choosing between hotel rooms.

Key Finding

LLMs have dramatically different economic "personalities." Same task, same prompt --- very different price-quality tradeoffs.

Price sensitivity by model

Qwen3 4B is an aggressive price optimizer with near-constant elasticity --- a doubling of price always hurts the same amount, whether from $50 to $100 or $2k to $4k. Textbook log-price consumer.

Gemma3 4B shows a threshold effect: it's insensitive to price below ~$300/night, then drops sharply. A $50 room and a $200 room are treated as essentially equivalent.

Llama 3.2 3B sits in between, with moderate sensitivity kicking in around $200.

Why It Matters

AI agents are starting to book travel, compare products, and make purchases on behalf of users. Even with guardrails ("stay under $300, at least 3 stars"), the model's default preferences determine the marginal choice among qualifying options.

We benchmark LLMs on intelligence (MMLU, reasoning, code). We don't benchmark them on what they choose when there's no right answer --- just tradeoffs. That's exactly the situation a shopping agent faces.

Method

Conjoint Design

3,600 binary hotel room comparisons, each varying four attributes:

Attribute Range Distribution
Price $20 -- $10,000/night Log-uniform
Star rating 2, 3, 4, 5 Uniform
Room size 150 -- 800 sq ft Uniform (25 ft steps)
Bed type single, double, queen, king Uniform

Every task is scored twice per model (original + swapped option order) to cancel position bias, yielding 7,200 observations per model.

Estimation

The utility function is estimated via binary logit. We use non-parametric price deciles --- instead of assuming a functional form (log or linear), we assign prices to decile bins and let each bin's coefficient speak for itself. This reveals the shape of the price response without imposing it.

We also estimate parametric specifications (log-price and linear-price) for comparison.

Models

All models run locally via Ollama with deterministic settings (temperature=0, top_k=1, seed=42).

Model Params Quantization Developer
Qwen3 4B 4B Q8_0 Alibaba
Gemma3 4B 4B Default Google DeepMind
Llama 3.2 3B 3B Default Meta
Mistral 7B 7B FP16 Mistral AI
Qwen3 4B (Q4) 4B Q4 Alibaba

Prompt Template

You are booking a hotel room in New York, New York for a one-night stay.
You must choose between the following two options.

Option A:
- Star rating: {a_stars} stars
- Room size: {a_sqft} square feet
- Bed type: {a_bed}
- En-suite bathroom: Yes
- Price per night: ${a_price}

Option B:
- Star rating: {b_stars} stars
- Room size: {b_sqft} square feet
- Bed type: {b_bed}
- En-suite bathroom: Yes
- Price per night: ${b_price}

Which option do you choose? Reply with only the letter A or B.

Reproducing

Prerequisites: Python (numpy, pandas, statsmodels, matplotlib), Ollama

# 1. Generate tasks (or use existing conjoint_tasks.csv)
python generate_conjoint_tasks.py

# 2. Score a model (original + swap)
python run_conjoint_llm.py --model gemma3:latest
python run_conjoint_llm.py --model gemma3:latest --swap

# 3. Register in config.py, then estimate and plot
python estimation/parametric.py
python estimation/nonparametric.py
python plots/decile_coefficients.py
python plots/parametric_vs_nonparametric.py
python plots/by_star_tier.py
python plots/by_room_size.py

All scripts auto-adapt to the number of models registered in config.py.

Project Structure

config.py                            # Model registry & shared helpers
generate_conjoint_tasks.py           # Creates conjoint_tasks.csv
run_conjoint_llm.py                  # Scores tasks via Ollama
estimation/
  parametric.py                      # Log-price & linear-price logit
  nonparametric.py                   # Price decile dummies logit
plots/
  decile_coefficients.py             # Price response curves (all models)
  parametric_vs_nonparametric.py     # Parametric vs non-parametric overlay
  by_star_tier.py                    # Split by 2-3 vs 4-5 star hotels
  by_room_size.py                    # Split by small vs large rooms
  hero.py                            # Polished dark-theme plot (3 models)
figures/                             # Committed output plots

Full Results (All 5 Models)

The plots below include all evaluated models. Mistral 7B exhibits zero sensitivity to any attribute --- it picks "Option A" 99.9% of the time regardless of content (pure position bias). The Q4 quantization of Qwen3 shows substantially attenuated price sensitivity compared to Q8, with the curve plateauing at higher prices.

Non-parametric price coefficients

Decile coefficients, all models

Parametric vs. non-parametric fit

Parametric vs nonparametric

By star tier (2-3 star vs. 4-5 star)

By star tier

By room size (small vs. large)

By room size

About

Conjoint experiment measuring price sensitivity and economic preferences of open-source LLMs using 7,200 binary hotel room choices per model.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages