A conjoint experiment measuring whether open-source LLMs exhibit price sensitivity and coherent economic preferences when choosing between hotel rooms.
LLMs have dramatically different economic "personalities." Same task, same prompt --- very different price-quality tradeoffs.
Qwen3 4B is an aggressive price optimizer with near-constant elasticity --- a doubling of price always hurts the same amount, whether from $50 to $100 or $2k to $4k. Textbook log-price consumer.
Gemma3 4B shows a threshold effect: it's insensitive to price below ~$300/night, then drops sharply. A $50 room and a $200 room are treated as essentially equivalent.
Llama 3.2 3B sits in between, with moderate sensitivity kicking in around $200.
AI agents are starting to book travel, compare products, and make purchases on behalf of users. Even with guardrails ("stay under $300, at least 3 stars"), the model's default preferences determine the marginal choice among qualifying options.
We benchmark LLMs on intelligence (MMLU, reasoning, code). We don't benchmark them on what they choose when there's no right answer --- just tradeoffs. That's exactly the situation a shopping agent faces.
3,600 binary hotel room comparisons, each varying four attributes:
| Attribute | Range | Distribution |
|---|---|---|
| Price | $20 -- $10,000/night | Log-uniform |
| Star rating | 2, 3, 4, 5 | Uniform |
| Room size | 150 -- 800 sq ft | Uniform (25 ft steps) |
| Bed type | single, double, queen, king | Uniform |
Every task is scored twice per model (original + swapped option order) to cancel position bias, yielding 7,200 observations per model.
The utility function is estimated via binary logit. We use non-parametric price deciles --- instead of assuming a functional form (log or linear), we assign prices to decile bins and let each bin's coefficient speak for itself. This reveals the shape of the price response without imposing it.
We also estimate parametric specifications (log-price and linear-price) for comparison.
All models run locally via Ollama with deterministic settings (temperature=0, top_k=1, seed=42).
| Model | Params | Quantization | Developer |
|---|---|---|---|
| Qwen3 4B | 4B | Q8_0 | Alibaba |
| Gemma3 4B | 4B | Default | Google DeepMind |
| Llama 3.2 3B | 3B | Default | Meta |
| Mistral 7B | 7B | FP16 | Mistral AI |
| Qwen3 4B (Q4) | 4B | Q4 | Alibaba |
You are booking a hotel room in New York, New York for a one-night stay.
You must choose between the following two options.
Option A:
- Star rating: {a_stars} stars
- Room size: {a_sqft} square feet
- Bed type: {a_bed}
- En-suite bathroom: Yes
- Price per night: ${a_price}
Option B:
- Star rating: {b_stars} stars
- Room size: {b_sqft} square feet
- Bed type: {b_bed}
- En-suite bathroom: Yes
- Price per night: ${b_price}
Which option do you choose? Reply with only the letter A or B.
Prerequisites: Python (numpy, pandas, statsmodels, matplotlib), Ollama
# 1. Generate tasks (or use existing conjoint_tasks.csv)
python generate_conjoint_tasks.py
# 2. Score a model (original + swap)
python run_conjoint_llm.py --model gemma3:latest
python run_conjoint_llm.py --model gemma3:latest --swap
# 3. Register in config.py, then estimate and plot
python estimation/parametric.py
python estimation/nonparametric.py
python plots/decile_coefficients.py
python plots/parametric_vs_nonparametric.py
python plots/by_star_tier.py
python plots/by_room_size.pyAll scripts auto-adapt to the number of models registered in config.py.
config.py # Model registry & shared helpers
generate_conjoint_tasks.py # Creates conjoint_tasks.csv
run_conjoint_llm.py # Scores tasks via Ollama
estimation/
parametric.py # Log-price & linear-price logit
nonparametric.py # Price decile dummies logit
plots/
decile_coefficients.py # Price response curves (all models)
parametric_vs_nonparametric.py # Parametric vs non-parametric overlay
by_star_tier.py # Split by 2-3 vs 4-5 star hotels
by_room_size.py # Split by small vs large rooms
hero.py # Polished dark-theme plot (3 models)
figures/ # Committed output plots
The plots below include all evaluated models. Mistral 7B exhibits zero sensitivity to any attribute --- it picks "Option A" 99.9% of the time regardless of content (pure position bias). The Q4 quantization of Qwen3 shows substantially attenuated price sensitivity compared to Q8, with the curve plateauing at higher prices.




