The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

This project contains the code accompanying the paper The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs.

Link to Dataset Page

If you like our work, consider citing us!

@misc{
      sinha2025illusiondiminishingreturnsmeasuring,
      title={The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs}, 
      author={Akshit Sinha and Arvindh Arun and Shashwat Goel and Steffen Staab and Jonas Geiping},
      year={2025},
      eprint={2509.09677},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2509.09677}, 
}

Setup

Prerequisites

Python 3.10 or higher
uv package manager

Installation

Clone the repository:

git clone <repository-url>
cd long-horizon-execution

Install dependencies (uv will handle this automatically):

uv sync

Environment Variables

To use OpenRouter API, set your API key:

export OPENROUTER_API_KEY=your_api_key_here

Project Structure

├── main.py                    # Main entry point
├── experiment_runner.py       # Core experiment orchestration
├── llm_clients.py            # Unified LLM client implementations
├── generate_dataset_json.py   # Dataset generation utilities
├── utils.py                  # General utility functions
├── pyproject.toml            # Project dependencies and configuration
├── words_alpha.txt           # Word list for experiments
├── src/
│   ├── config.py             # Main configuration classes
│   ├── _config/              # Configuration modules
│   │   ├── experiment_config.py    # Experiment-specific configs
│   │   ├── model_config.py        # LLM model configurations
│   │   ├── openrouter_config.py   # OpenRouter API settings
│   │   └── wandb_config.py        # Weights & Biases integration
│   └── experiments/          # Experiment implementations
│       ├── base_experiment.py     # Base experiment class
│       └── dict_sum/              # Dictionary sum experiment
│           ├── exp.py             # Main experiment logic
│           └── dict_sum_util.py   # Utilities and evaluators

Configuration

The project uses a hierarchical configuration system with the following main components:

Model Configuration

provider: LLM provider ("openrouter", "vllm", etc.)
name: Model name/identifier
multi_turn: Enable multi-turn conversation mode
thinking_mode: Enable chain-of-thought reasoning
cot: Chain-of-thought mode
max_model_len: Maximum model context length

Experiment Configuration

exp: Experiment type (e.g., "dict_sum")
num_samples: Number of test samples
dict_size: Size of dictionaries (for dict_sum)
working_capacity: Number of inputs processed per turn
horizon_length: Length of the execution horizon (working_capacity * number of turns you want to run)
llm_temperature: Sampling temperature
llm_top_p: Top-p sampling parameter
llm_max_tokens: Maximum tokens per response

Weights & Biases Configuration

mode: "online", "offline", or "disabled"
project: W&B project name

Running Experiments

Basic Usage

uv run main.py --cfg.exp dict_sum

Complete Example

Here's a comprehensive example with all configuration options:

uv run main.py --cfg.exp dict_sum \
    --cfg.model_config.provider "openrouter" \
    --cfg.model_config.name "$MODEL" \
    --cfg.model_config.thinking_mode $THINKING_MODE \
    --cfg.model_config.cot $COT \
    --cfg.model_config.max_model_len 40960 \
    --cfg.experiments.dict_sum.num_samples 1 \
    --cfg.experiments.dict_sum.dict_size 100 \
    --cfg.experiments.dict_sum.working_capacity ${WORKING_CAPACITY} \
    --cfg.experiments.dict_sum.horizon_length ${WORKING_CAPACITY} \
    --cfg.experiments.dict_sum.llm_temperature 0.6 \
    --cfg.experiments.dict_sum.llm_top_p 0.95 \
    --cfg.experiments.dict_sum.llm_max_tokens 100000 \
    --cfg.experiments.dict_sum.max_input_value 99 \
    --cfg.experiments.dict_sum.min_input_value -99 \
    --cfg.wandb_settings.mode "online" \
    --cfg.wandb_settings.project "frontier-final" \
    --cfg.experiments.dict_sum.local_dataset_path "dict_sum_100.json"

Thinking Modes

thinking_mode=true: Enables advanced reasoning mode for supported models (e.g., Claude 4)
cot=true: Enables chain-of-thought reasoning for step-by-step problem solving, supported for models that are not explicitly in thinking mode (e.g., Deepseek V3)

Only one of these can be enabled at a time.

Disabling both will run the model in standard mode, where it attempts to execute each turn in one go without intermediate reasoning steps.

Environment Variables

You can use environment variables for dynamic configuration:

export MULTI_TURN=true
export MODEL="anthropic/claude-3.5-sonnet"
export THINKING_MODE=true
export COT=true
export WORKING_CAPACITY=10

uv run main.py --cfg.exp dict_sum \
    --cfg.model_config.multi_turn $MULTI_TURN \
    --cfg.model_config.name "$MODEL"
    # ... other parameters

Experiments

Dictionary Sum Experiment (`dict_sum`)

The main experiment implemented evaluates a model's ability to perform arithmetic operations over dictionaries across multiple turns:

Task: Given a dictionary with key-value pairs, perform cumulative sum operations
Evaluation: Accuracy of final computed values
Parameters:
- dict_size: Number of key-value pairs in each dictionary
- horizon_length: Number of operations to perform
- working_capacity: Number of operations processed per turn
- min_input_value/max_input_value: Range of values in dictionaries

Experiment Modes

Multi-turn mode (multi_turn=true): Operations split across multiple conversation turns REQUIRED
Chain-of-thought (cot=true): Enables step-by-step reasoning
Thinking mode (thinking_mode=true): Advanced reasoning mode for supported models

Output and Logging

Output Directory Structure

Results are saved to timestamped directories:

output/
└── dict_sum_{model_name}_{timestamp}_{mode_flags}/
    ├── results.json          # Experiment results
    ├── config.yaml          # Full configuration used
    └── logs/                # Detailed logs

Logging

The application provides detailed logging including:

Configuration validation
Experiment progress
Model response times
Error handling
W&B integration status

Weights & Biases Integration

Results are automatically logged to W&B when configured:

Experiment metrics and results
Configuration parameters
Model performance statistics

Development

Adding New Experiments

Create a new experiment directory under src/experiments/
Implement the experiment class inheriting from BaseExperiment
Add configuration class in src/_config/experiment_config.py
Register the experiment in the appropriate modules

Dataset Generation

Use generate_dataset_json.py to create custom datasets:

uv run generate_dataset_json.py

Testing

Run experiments with minimal samples for testing:

uv run main.py --cfg.exp dict_sum \
    --cfg.experiments.dict_sum.num_samples 1 \
    --cfg.wandb_settings.mode "disabled"

Dependencies

Key dependencies include:

jsonargparse: Configuration management
openai: OpenRouter and OpenAI API integration
vllm: Local model inference
wandb: Experiment tracking
torch: Deep learning framework
transformers: Hugging Face model support

For a complete list, see pyproject.toml.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

Table of Contents

Setup

Prerequisites

Installation

Environment Variables

Project Structure

Configuration

Model Configuration

Experiment Configuration

Weights & Biases Configuration

Running Experiments

Basic Usage

Complete Example

Thinking Modes

Environment Variables

Experiments

Dictionary Sum Experiment (`dict_sum`)

Experiment Modes

Output and Logging

Output Directory Structure

Logging

Weights & Biases Integration

Development

Adding New Experiments

Dataset Generation

Testing

Dependencies

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.gitignore		.gitignore
README.md		README.md
experiment_runner.py		experiment_runner.py
generate_dataset_json.py		generate_dataset_json.py
llm_clients.py		llm_clients.py
main.py		main.py
pyproject.toml		pyproject.toml
utils.py		utils.py
uv.lock		uv.lock
words_alpha.txt		words_alpha.txt

long-horizon-execution/measuring-execution

Folders and files

Latest commit

History

Repository files navigation

The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

Table of Contents

Setup

Prerequisites

Installation

Environment Variables

Project Structure

Configuration

Model Configuration

Experiment Configuration

Weights & Biases Configuration

Running Experiments

Basic Usage

Complete Example

Thinking Modes

Environment Variables

Experiments

Dictionary Sum Experiment (dict_sum)

Experiment Modes

Output and Logging

Output Directory Structure

Logging

Weights & Biases Integration

Development

Adding New Experiments

Dataset Generation

Testing

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Dictionary Sum Experiment (`dict_sum`)

Packages