Skip to content

ACL 2026 camera-ready: reviewer feedback & action items#251

Draft
smirnovlad wants to merge 3 commits into
mainfrom
camera-ready/acl2026
Draft

ACL 2026 camera-ready: reviewer feedback & action items#251
smirnovlad wants to merge 3 commits into
mainfrom
camera-ready/acl2026

Conversation

@smirnovlad

Copy link
Copy Markdown
Collaborator

Context

ThinkBooster (Submission #191) was accepted at ACL 2026 (decision 24 Apr 2026, Program Chairs).

This PR is a tracking branch for the camera-ready work. The full verbatim decision, meta review, and three reviewer comments are committed under docs/paper/acl2026_reviews.md. Each review is also posted as a separate PR comment below so individual asks can be discussed and resolved in their own threads.

Camera-ready punch list

Synthesised from the meta review + three reviewers (paXg, yXVQ, 9UwS). Owner / target commit can be assigned per item as we go.

Required (called out by Area Chair)

  • Wall-clock latency analysis alongside TFLOPs/token efficiency metrics (paXg, meta)
  • Live demo accessibility — fix http://demo-thinkbooster.nlpresearch.group (paXg, meta)
  • Comparison with existing TTC frameworks — LLM Reasoners, search-and-learn, OpenR, OptiLLM (paXg, meta)
  • Critique-based scaling family — self-correction, self-verification refs:
    • [1] Training Language Models to Self-Correct via Reinforcement Learning. ICLR 2025
    • [2] Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards. NeurIPS 2025
    • (9UwS, meta)
  • Discuss limitations: white-box access requirements for fully hosted models (meta)
  • Discuss limitations: scope to math / coding / science tasks (meta, yXVQ)

Optional / nice-to-have

  • Commercial models evaluation — GPT-5, Claude via the OpenAI-compatible proxy (paXg)
  • PRM interchangeability — clarify swap story for non-math domains (paXg)
  • Reproducibility tightening — parameter specifications flagged as underspecified (yXVQ)
  • Generalisation discussion — long-context / tool-using / black-box models, multi-agent and chain-of-tool pipelines, dynamic compute budgets, debugger scaling (yXVQ)

How to use this PR

  • The doc commit (docs/paper/acl2026_reviews.md) is the source of truth.
  • Per-reviewer threads live as comments below — reply there to discuss specific asks.
  • As items are addressed, add follow-up commits to this branch (or open child PRs that this one tracks).
  • When the camera-ready submission is in, this PR can be merged or closed.

Verbatim copy of the OpenReview decision, meta review, and three
official reviews for ThinkBooster (Submission #191). Adds a synthesised
camera-ready punch list at the bottom that consolidates the action
items raised across the meta review and reviewers paXg / yXVQ / 9UwS.

Lives under docs/paper/ so the camera-ready work has a single source
of truth and can be tracked alongside code changes.
@smirnovlad

Copy link
Copy Markdown
Collaborator Author

Paper Decision

Decision by Program Chairs · 24 Apr 2026, 16:48 (modified: 24 Apr 2026, 22:14)
Visible to: Program Chairs, Authors

Decision: Accept


Meta Review of Submission 191 by Area Chair 9gJd

Meta Review by Area Chair 9gJd · 12 Apr 2026, 07:07 (modified: 24 Apr 2026, 22:14)
Visible to: Senior Area Chairs, Area Chairs, Authors, Program Chairs

Metareview

ThinkBooster is a unified framework for test-time compute (TTC) scaling of LLM reasoning, combining a modular Python library of strategies and scorers, a joint performance-compute benchmark, and an OpenAI-compatible endpoint gateway with a visual debugger for reasoning trajectory inspection.

Pros:

  • Addresses a real and practical need, bringing multiple TTC scaling strategies under the same unified framework
  • Solid engineering choices with OpenAI-compatible gateway that could enable adoption and a modular design with clean open-source artifacts

Cons:

  • No comparison to existing TTC frameworks (LLM Reasoners, search-and-learn, OpenR, OptiLLM) would help make the contribution of this paper stronger.
  • Agreeing with Reviewer 9UwS, critique-based scaling methods (e.g. self-correction, self-verification) are a notable missing family from the strategy taxonomy

Recommendations for camera ready: For the camera ready, the authors should: add wall-clock latency analysis alongside the TFLOPs/token efficiency metrics; ensure the live demo is accessible; discuss the white-box access requirements as a limitation for fully hosted models; and acknowledge the current scope limitation to math/coding/science tasks.

@smirnovlad

Copy link
Copy Markdown
Collaborator Author

Reviewer paXg — Overall: 6 (Marginally above acceptance threshold)

A Unified Python Toolkit for Test-Time Scaling of LLM Reasoning
Official Review by Reviewer paXg · 04 Apr 2026, 20:32 (modified: 24 Apr 2026, 22:14)
Visible to: Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Reviewer paXg, Authors

Summary

This paper proposes ThinkBooster, a unified framework for test-time compute (TTC) scaling of LLM reasoning. It addresses two key gaps in current TTC scaling research: (1) the lack of standardized evaluation that jointly considers both performance and compute efficiency, and (2) the absence of a unified implementation across diverse TTC scaling methods. ThinkBooster provides a modular Python library implementing common TTC scaling strategies and scorers, along with a benchmark that evaluates both task performance and compute efficiency (in TFLOPs and token cost).

Review

Overall Evaluation

ThinkBooster presents a well-engineered and practically useful unified framework for test-time compute scaling of LLM reasoning. However, the paper lacks sufficient comparison with existing TTC scaling frameworks and the evaluation is limited to open-source models without wall-clock latency analysis, and the live demo is also inaccessible during review.

Pros

Good system design and practical engineering contribution. ThinkBooster cleanly separates different TTC modules and supports both swift implementation with OpenAI compatible SDK, making it practical for both research and production use.

Performance–compute efficiency benchmarking. The framework provides a comprehensive and systematic evaluation of existing TTC scaling methods, jointly measuring task performance and compute efficiency.

Clean open-source artifacts. The demo video and code repository appear well-organized and of good quality, which lowers the barrier for adoption and implementation.

Cons

Insufficient comparison with existing frameworks. Several TTC scaling frameworks already exist e.g., LLM Reasoners (MCTS, BFS/DFS, beam search with multiple backends), HuggingFace's search-and-learn (Best-of-N, beam search, DVTS), OpenR (Best-of-N, beam search, MCTS with PRM), and OptiLLM (an OpenAI-compatible proxy with 20+ techniques). The paper lacks a systematic comparison with these alternatives in terms of strategy coverage, scorer diversity, and deployment design, making it difficult to assess ThinkBooster's unique positioning.

Only open-source models evaluated. All tested models (Qwen2.5-Math-7B, Qwen3-8B, GPT-OSS-120B) are open-source. Commercial models, which are common in production settings, are not evaluated.

Efficiency measured only in tokens/TFLOPs, lacking wall-clock time. While reporting token counts and TFLOPs is reasonable, it is difficult to assess practical feasibility from these numbers alone. R.g., a 10x or 100x compute ratio does not tell practitioners whether the actual latency is acceptable. Supplementing with wall-clock execution time under specific hardware configurations would make the efficiency analysis more informative.

Demo link is inaccessible. The demo URL provided in the footnotes (http://demo-thinkbooster.nlpresearch.group) is currently not working. For a demo track submission, being unable to access the live demo during review is a notable issue.

Reasons To Accept

ThinkBooster offers a well-designed, modular framework that unifies fragmented TTC scaling implementations behind a consistent API, filling a practical gap for both researchers and practitioners. The OpenAI-compatible endpoint gateway enables drop-in adoption with minimal friction. The joint performance–compute benchmark sets a good precedent for the field by systematically evaluating efficiency alongside accuracy.

Rating: 6: Marginally above acceptance threshold

Reasons To Reject

The paper lacks comparison with several existing TTC scaling frameworks (LLM Reasoners, search-and-learn, OpenR, OptiLLM), making it difficult to assess its unique contributions. The evaluation is limited to open-source models with no commercial model experiments, and efficiency is reported only in theoretical TFLOPs without wall-clock latency. The live demo is inaccessible during review, which is a significant concern for a demo track submission.

Questions And Additional Feedback

Questions

  1. Comparison with existing frameworks. Could the authors provide a more detailed comparison to better position ThinkBooster's unique contributions against these alternatives?
  2. Wall-clock latency measurement. The current efficiency analysis relies on theoretical TFLOPs and token counts. Could the authors supplement this with wall-clock execution time under specific hardware configurations?
  3. Commercial model evaluation. All experiments use open-source models. Given that the framework is designed as an OpenAI-compatible proxy, have the authors considered evaluating commercial models (e.g., GPT-5, Claude)?
  4. PRM interchangeability across tasks. Does the framework support easy swapping of PRMs for different task domains? The current experiments use a math-trained PRM (Qwen2.5-Math-PRM-7B) even for coding tasks. For harder or domain-specific tasks, a stronger or domain-matched PRM may be needed — how straightforward is this replacement within the framework?

Form fields

  • Needs Ethical Review: No
  • Reproducibility: 4 — They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method.
  • Software Or Live Demo: 4 — Useful: I would recommend the new software / live demo to other researchers or developers for their ongoing work.
  • Datasets: 3 — Potentially useful: Someone might find the new datasets useful for their work.
  • Overall Assessment: 6: Marginally above acceptance threshold

@smirnovlad

Copy link
Copy Markdown
Collaborator Author

Reviewer yXVQ — Overall: 6 (Marginally above acceptance threshold)

Review
Official Review by Reviewer yXVQ · 01 Apr 2026, 05:26 (modified: 24 Apr 2026, 22:14)
Visible to: Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Reviewer yXVQ, Authors

Summary

This paper presents THINKBOOSTER, a unified framework for test-time compute (TTC) scaling of LLM reasoning. It addresses the problem that existing TTC scaling strategies—like best-of-N, tree-of-thought, and self-consistency—are fragmented, evaluated inconsistently, and rarely consider compute-efficiency trade-offs. THINKBOOSTER provides a modular Python library with state-of-the-art scaling strategies and scorers, a benchmark for joint performance–compute evaluation, an OpenAI-compatible endpoint gateway for real-world deployment, and a visual debugger for inspecting reasoning trajectories. Its main contributions are enabling seamless test-time scaling, principled evaluation of performance–compute trade-offs, and practical integration into applications like coding assistants and mathematical problem solvers, demonstrating improved accuracy and efficiency on math, scientific, and programming benchmarks.

Review

Review of "THINKBOOSTER: A Unified Framework for Test‑Time Compute Scaling of LLM Reasoning"

Summary

The paper introduces THINKBOOSTER, a modular library and evaluation framework for test-time compute (TTC) scaling of large‑language‑model reasoning. It addresses limitations in existing TTC strategies (e.g., best‑of‑N, self‑consistency, tree‑of‑thought) by unifying them under a common API, providing principled performance–compute benchmarking, real‑world endpoints, and tools for debugging reasoning trajectories. The work demonstrates improved accuracy and more efficient compute use on math, science, and coding benchmarks.

Evaluation

Quality. The implementation appears engineered with practical utility in mind, offering a standard interface to plug in diverse scaling strategies and scorers. The benchmarking setup considers both reward and computational cost, which is important for realistic evaluation. However, the quality of empirical validation depends on the breadth and depth of benchmarks chosen (details not in summary). The paper defines relevant metrics and trade‑offs, e.g.:

(equation rendered in original review — not preserved in plaintext copy)

which formalizes how added compute translates into performance improvement.

Clarity. The presentation is generally clear, with modular descriptions of components (strategy, scorer, gateway, debugger). Architectural diagrams and code examples (if included) likely help clarify usage. One area that may benefit from more explicit description is how scalability and cost are measured consistently across different strategies.

Originality. The work's main novelty lies in systematizing disparate TTC strategies into a unified framework with common interfaces and joint performance–compute evaluation. While existing work explores many scaling strategies in isolation, THINKBOOSTER's contribution is in engineering unification and tooling, rather than new algorithmic scaling methods per se.

Significance. The framework has practical significance for practitioners and researchers needing to compare and deploy TTC scaling methods. By exposing a shared API and benchmarking tools, THINKBOOSTER can foster more comparable research and real‑world adoption of compute‑aware reasoning strategies. Its integration with production endpoints and a visual debugger also aids interpretability and deployment.

Pros

  • Unified API for diverse TTC strategies (best‑of‑N, self‑consistency, tree‑of‑thought).
  • Principled performance–compute evaluation, enabling trade‑off analysis rather than raw accuracy.
  • Modular design encouraging extensibility (swap strategy, scorer, or model backend).
  • Includes production integration (OpenAI‑compatible endpoint gateway) and visual debugging of reasoning trajectories.
  • Demonstrated practical gains on reasoning benchmarks in math, science, and programming domains.

Cons

  • Focus is more on systems engineering and benchmarking rather than proposing fundamentally new TTC algorithms.
  • Empirical evaluation quality depends heavily on selected benchmarks and cost models; generalization to other tasks may vary.
  • The cost model may not account for all real‑world constraints (e.g., memory, latency, parallelization).
  • Comparisons with the latest adaptive or learned scaling strategies beyond classic ones may be limited.
  • Users must still choose scoring functions and compute trade‑offs, which may require domain expertise.

Overall

THINKBOOSTER presents a practical and well‑engineered framework that consolidates test‑time compute scaling approaches under a coherent API and evaluation paradigm. It is clear, original in its system perspective, and significant for reproducible and comparable reasoning research. The work's strengths lie in its tooling and benchmarking infrastructure, which can support both research and deployment. However, its contribution is largely in unification and tooling rather than novel algorithms.

Recommendation: Accept — the contribution is valuable for the NLP community's shift toward compute‑aware, benchmarkable reasoning.

Reasons To Accept

The paper's strengths are its unified, modular framework for test-time compute (TTC) scaling, standardized benchmarking of performance–compute trade-offs, and practical deployment through an OpenAI-compatible endpoint and visual debugger. Presenting it benefits the NLP community by providing a reproducible, comparable, and developer-friendly system for evaluating and deploying adaptive LLM reasoning strategies, enabling more systematic research and practical adoption of compute-aware LLM reasoning methods.

Rating: 6: Marginally above acceptance threshold

Reasons To Reject

Weaknesses of the paper include its focus on systems integration and benchmarking rather than novel TTC algorithms, reliance on specific LLMs and benchmark datasets (math, science, coding), and white-box strategy requirements that may not generalize to all hosted LLMs. Presenting it may risk overstating generalizability and real-world impact, as performance–compute trade-offs could vary for other models, domains, or deployment environments.

Questions And Additional Feedback

  1. How well does THINKBOOSTER generalize to LLMs outside math, coding, and scientific QA, such as long-context or tool-using models?
  2. Are there plans to support fully black-box LLMs without white-box signals like logits or prefill options?
  3. How does the visual debugger scale for very long reasoning trajectories or multiple simultaneous requests?
  4. Can the framework handle dynamic or adaptive compute budgets in real-time applications with latency constraints?
  5. How would THINKBOOSTER integrate with multi-agent or chain-of-tool pipelines, and are there plans for such evaluations?

Form fields

  • Needs Ethical Review: Yes
  • Reproducibility: 3 — They could reproduce the results with some difficulty. The settings of parameters are underspecified or subjectively determined, and/or the training/evaluation data are not widely available.
  • Software Or Live Demo: 3 — Potentially useful: Someone might find the new software / live demo useful for their work.
  • Datasets: 3 — Potentially useful: Someone might find the new datasets useful for their work.
  • Overall Assessment: 6: Marginally above acceptance threshold

@smirnovlad

Copy link
Copy Markdown
Collaborator Author

Reviewer 9UwS — Overall: 6 (Marginally above acceptance threshold)

Official Review of ThinkBooster
Official Review by Reviewer 9UwS · 30 Mar 2026, 06:36 (modified: 24 Apr 2026, 22:14)
Visible to: Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Reviewer 9UwS, Authors

Summary

This paper presents a modular and practical framework for improving LLM reasoning by allocating more computation at inference time. The paper tackles the current fragmentation in test-time scaling methods by offering a unified system that supports multiple reasoning strategies, scoring mechanisms, and evaluation tools.

Review

See "Reasons To Accept" and "Reasons To Reject"

Reasons To Accept

Practical and easy to integrate: The OpenAI-compatible gateway is a strong design choice because it makes the framework easy to adopt in existing applications without major engineering changes.

Well-rounded framework (comprehensive for TTS methods): The paper does a good job of combining methods, deployment, and benchmarking in one system, which makes it more complete and useful than a narrowly scoped demo.

Strong emphasis on transparency: The visual debugger is especially appealing, as it gives users insight into reasoning trajectories and makes the system more interpretable and easier to analyze.

Rating: 6: Marginally above acceptance threshold

Reasons To Reject

Evaluation scope seems somewhat narrow: The current focus on domains like math and coding is promising, but it would be even stronger to see broader validation on other tasks.

Another line of scaling is "critique" based methods [1, 2]. It is better to cover this kind of TTS as well.

[1] Training Language Models to Self-Correct via Reinforcement Learning. ICLR 2025

[2] Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards. NeurIPS 2025

Questions And Additional Feedback

No

Form fields

  • Needs Ethical Review: No
  • Reproducibility: 4 — They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method.
  • Software Or Live Demo: 4 — Useful: I would recommend the new software / live demo to other researchers or developers for their ongoing work.
  • Datasets: 1 — No usable datasets submitted.
  • Overall Assessment: 6: Marginally above acceptance threshold

Schedules an HTTP probe of http://demo-thinkbooster.nlpresearch.group/
every 10 min (with manual workflow_dispatch). On failure (3 attempts
with 5 s back-off), it opens a labelled `demo-down` issue or appends a
"still down" comment if one is already open. On recovery it adds a
recovery comment and closes the issue.

Addresses the camera-ready ask raised by Reviewer paXg / the Area Chair
(Submission #191): the demo URL was unreachable during ACL 2026 review.
This gives us a public audit trail of demo uptime alongside the paper.

Pair with an external monitor (UptimeRobot or similar) to cover GitHub
Actions cron drift and Actions outages — covered separately.
Addresses Reviewer paXg / Area Chair ask for a comparison against
existing TTC scaling frameworks. Drafts:

- 12 differentiating feature axes (strategy/scoring/deployment/eval)
- 4 competitor columns: LLM Reasoners, search-and-learn, OpenR, OptiLLM
- First-pass cells with ?@<owner> markers for verification by:
  Vlad (LLM Reasoners), Quang (search-and-learn),
  Sergey (OpenR), Artem S (OptiLLM)
- LaTeX booktabs version drop-in for the paper
- Open questions for camera-ready decisions
@smirnovlad

Copy link
Copy Markdown
Collaborator Author

TTC framework comparison — repos under analysis

Reviewer paXg / Area Chair asked for a comparison table against existing TTC scaling libraries. I'm drafting one in docs/paper/ttc_framework_comparison.md (just pushed).

The four frameworks named by the reviewer (and the repos I'm reading to fill the table):

Framework Repository
LLM Reasoners https://github.com/maitrix-org/llm-reasoners
search-and-learn https://github.com/huggingface/search-and-learn
OpenR https://github.com/openreasoner/openr
OptiLLM https://github.com/codelion/optillm

Once the draft fill is in, the table assignments per the offline plan are:

  • LLM Reasoners → Vlad
  • search-and-learn → Quang
  • OpenR → Sergey
  • OptiLLM → Artem S

Posting this first so the source-of-truth repos are pinned before reviewers are asked to validate.

@smirnovlad

Copy link
Copy Markdown
Collaborator Author

LLM Reasoners (https://github.com/maitrix-org/llm-reasoners)

  • Stars / license / home: 2.3k stars, Apache-2.0, https://www.llm-reasoners.net/. Core abstractions: WorldModel + SearchConfig (reward) + SearchAlgorithm, composed by a Reasoner (reasoners/base.py).
  1. Scope: A research library for "advanced LLM reasoning" — primarily a search/planning toolkit that frames reasoning as world-model + reward + search, with example pipelines for math/QA/blocksworld; it is not positioned as a production TTC inference scaler.

  2. Strategies / search algorithms (in-tree, first-class in reasoners/algorithm/): 5 core search algos — MCTS, BeamSearch, DFS (used for ToT-DFS), Greedy, Random. ToT-BFS/DFS, RAP, CoT, Least-to-most, Self-Eval Decoding, Grace Decoding, PromptAgent, ReAct, ReasonerAgent are realized as example pipelines (examples/) on top of those 5 algos rather than as separate algorithm classes. README also mentions "Inference-time Scaling with Process Reward Models" as an example.

  3. Scorers / world models: Reward is a user-supplied SearchConfig.reward(state, action) returning a float — no first-class scorer registry. PRM support exists only as one example (examples/Inference-Scaling-SGL/math500/, using peiyi9979/math-shepherd-mistral-7b-prm via SGLang). No first-class abstractions for confidence/uncertainty scorers, supervised step scorers, or LLM-as-critic — they must be hand-coded inside SearchConfig.

  4. Online vs offline: Online — algorithms steer generation step-by-step by expanding actions and scoring partial states via the world model. There is no built-in "rerank N full candidates" offline pass (you would write that as a custom SearchConfig).

  5. Adaptive / uncertainty-driven scaling: No. Compute budget is fixed by hyperparameters (n_iters, beam_size, depth_limit); no built-in mechanism to allocate more rollouts at uncertain steps.

  6. Confidence-based steering (DeepConf-style): No. No reference to DeepConf, confidence gating, or token-level entropy steering anywhere in the repo.

  7. Aggregation knobs: Partial. BeamSearch exposes reward_aggregator ∈ {cumulative/accumulative, mean/average, last, or custom callable}; MCTS exposes cum_reward (default sum), calc_q (default np.mean), and MCTSAggregation.weight_policy ∈ {edge, edge_inverse_depth, uniform}. No min/max/product or sliding-window aggregators out of the box.

  8. Backend(s): HF Transformers (HFModel), llama.cpp, ExLlama, native LLaMA-1/2/3 weights, OpenAI, Anthropic, Gemini, SGLang — exposed in reasoners/lm/__init__.py. No first-class vllm backend in the core library (one example examples/DRPO/models/vllm_model.py only). Black-box-friendly via OpenAIModel / ClaudeModel / BardCompletionModel.

  9. REST gateway / OpenAI-compatible server: No. The library is a Python SDK; no built-in HTTP server. Examples consume SGLang/OpenAI servers but do not serve an OpenAI-compatible endpoint.

  10. Visual debugger / trajectory inspector: Yes. reasoners/visualization/ produces TreeLog artifacts and uploads them to a hosted web visualizer at main.d1puk3wdon4rk8.amplifyapp.com via an AWS API Gateway — interactive search-tree inspection.

  11. Joint compute–performance benchmarks: Token-only at best. README reports accuracy on GSM8K, Game-of-24, Blocksworld, ProsQA, Math-500, etc.; the Math-500 README reports accuracy + wall-clock seconds. No TFLOPs accounting, no joint compute–quality Pareto curves in the repo.

  12. Crash-resistant evaluation pipeline: Partial. Evaluator.evaluate(..., resume=N) skips the first N examples and pickles each algo output to algo_output/{i}.pkl — sufficient to manually restart after a crash, but no automatic checkpointing of in-progress search trees and no recovery of mid-example state.

  13. Last release / activity: One tagged release v1.0.0 (2024-05-02). Last commit on main: 2025-06-10 (README edit). Open issues: 32. Issue cadence has slowed — most 2025 issues are user bug reports rather than maintainer activity. Repo is best described as research-active but not actively maintained as a product.

  14. Notable / gap: Distinctive — clean three-component (world-model / reward / search) abstraction and the hosted tree visualizer are unusual and useful for debugging search trajectories. Gaps relative to a TTC-scaling toolkit like ThinkBooster: no vLLM-native backend, no OpenAI-compatible server, no DeepConf-style confidence steering, no adaptive per-step compute allocation, no joint compute–accuracy benchmarks (TFLOPs/tokens), no first-class PRM/critic/uncertainty-scorer registry (PRM is one example, not core), and only coarse aggregation knobs (no min/sliding-window). Several core examples (RAP/ToT/PRM) are wired to specific models and have open issues about breakage on newer LLMs (Gemma 3, Llama 3.1).

@smirnovlad

Copy link
Copy Markdown
Collaborator Author

search-and-learn (https://github.com/huggingface/search-and-learn)

1. Scope. Inference-time-compute scaling toolkit ("recipes to enhance LLM capabilities by scaling inference-time compute") focused on verifier-guided search for math reasoning (MATH-500), accompanying the HF blog post replicating Snell et al. 2024.

2. Strategies / search algorithms (3 total).

  • Best-of-N (best_of_n.py)
  • Beam search (beam_search.py)
  • Diverse Verifier Tree Search / DVTS (diverse_verifier_tree_search.py)
  • Dispatch is a hard-coded dict in scripts/test_time_compute.py; Config.approach is Literal["best_of_n", "beam_search", "dvts"]. No MCTS, lookahead-MCTS, self-consistency-as-a-strategy, or rollout-based variants.

3. Scorers / PRMs. Five first-class PRMs in src/sal/models/reward_models.py (load_prm):

  • RLHFlow/Llama3.1-8B-PRM-Deepseek-Data (default)
  • peiyi9979/math-shepherd-mistral-7b-prm (Math-Shepherd)
  • Skywork/Skywork-o1-Open-PRM-Qwen-2.5-{1.5B, 7B}
  • Qwen/Qwen2.5-Math-PRM-7B
  • All are supervised PRMs. No uncertainty/entropy scorers, no LLM-as-judge / critic, no logprob-based or self-consistency scorers as separate scorer types. Adding a new PRM requires editing load_prm (raises NotImplementedError).

4. Online vs offline. Offline rerank / batched search only. Pipeline = vLLM.generate → PRM.score → aggregate → pick argmax, run via dataset.map(approach_fn, batched=True). No streaming / token-level steering hook, no early-exit during decoding.

5. Adaptive / uncertainty-driven scaling. No. Compute is fixed by n, beam_width, num_iterations, lookahead. No conditional allocation on uncertain steps; no stopping criterion driven by score variance. (GitHub code search for confidence|uncertainty|entropy|adaptive in repo: 0 hits.)

6. Confidence-based steering (DeepConf-style). No. No confidence-gated branching or pruning mechanism.

7. Aggregation knobs. Yes, but minimal: agg_strategy: Literal["last", "min", "prod"] over per-step PRM scores (utils/score.py::aggregate_scores). Final answer aggregation across n adds majority / weighted / naive voting (utils/math.py). No learned aggregator, no configurable per-step weighting.

8. Backend(s). vLLM only (hard-coded from vllm import LLM in the entrypoint and every search file; pinned to vllm==0.6.3 in setup.py). PRMs loaded via HF transformers directly. No OpenAI-compatible client, no HF-Inference / TGI adapter, no black-box / API-only path.

9. REST gateway / OpenAI-compatible server. No. fastapi is listed in install_requires but no server code or routes exist anywhere in src/, scripts/, or recipes/. Appears to be a vestigial dependency.

10. Visual debugger / trajectory inspector. No. No notebook UI, dashboard, or trajectory-rendering tool. Outputs are JSONL + HF dataset branches.

11. Joint compute–performance benchmarks. Token-only. completion_tokens is recorded per problem; there is no FLOPs / TFLOPs accounting and no compute-vs-accuracy plotting code. The blog post reports tokens-vs-accuracy curves; the repo itself doesn't include a benchmark harness — final accuracy is computed via an external fork of Qwen2.5-Math (README: "stand-alone evaluation script directly in search-and-learn: stay tuned!").

12. Crash-resistant evaluation pipeline. Partial, coarse-grained only. No per-problem checkpointing or resume. Crash-resistance is achieved via Slurm array sharding (recipes/launch_array.slurm, default 20 chunks of 25 problems) where each shard is pushed to a separate HF Hub revision, and Config.__post_init__ does an exit() if the revision already exists on the Hub. Re-merging is a separate manual step (scripts/merge_chunks.py). No mid-run checkpoint, no resume-from-failure within a shard.

13. Last release / activity.

14. Anything notable.

  • Distinctive: clean implementation of DVTS (only public reference impl I'm aware of); ships per-model recipe YAMLs (Llama-3.2-1B/3B, Qwen2.5-1.5B, AceMath-7B); includes a PRM training sub-project (recipes/training/, TRL-based) plus ProcessBench eval — most TTC repos only do inference.
  • Notable gaps for an ACL TTC comparison: no online steering, no adaptive/confidence-driven compute, single backend (vLLM-only — not black-box-friendly), no server, no debugger, no FLOPs accounting, no in-repo evaluator (depends on external Qwen2.5-Math fork), no resume/checkpointing within a shard, only 3 algorithms and only PRM scorers, hard-coded PRM allowlist. Fundamentally a reproducibility recipe set for one paper, not an extensible TTC framework.

@smirnovlad

Copy link
Copy Markdown
Collaborator Author

OpenR (https://github.com/openreasoner/openr)

  • 1. Scope: Open-source framework for advanced LLM reasoning that combines test-time search, process-reward modeling, and online RL training (APPO/GRPO/TPPO) on math/reasoning tasks (GSM8K, MATH).

  • 2. Strategies / search algorithms (5 first-class): (1) CoT / Greedy, (2) Best-of-N, (3) Beam Search (step-level, PRM-guided), (4) Vanilla MCTS (PUCT with pb_c_base/pb_c_init), (5) rStar-MCTS (mutual reasoning, two-actor MCTS env). README also mentions "Critic-MCTS" but it is marked "Under Review" (PR Add evaluation protocol documentation #44), so 5 merged + 1 pending. No DVTS/lookahead/sequential-revision.

  • 3. Scorers / PRMs / RMs: PRMs are the only first-class scorer family. They ship their own trained PRM (Math-psa) and integrate Math-Shepherd-Mistral-7B-PRM, Skywork-o1-PRM-7B, Qwen2.5-Math-7B-PRM. They also have a Generative RM track (gen_rm/, "Direct GenRM"). They do train PRMs themselves (prm/code/finetune_qwen*.py, OmegaPRM data generation) — at inference time this is still a single supervised step-classifier scorer, so by the strict "supervised step scorer" criterion it counts as one family, not multiple. No uncertainty-based scorer, no LLM-as-judge / critic scorer at inference, no logprob/self-consistency-confidence scorer — only majority vote + PRM-min/last × max/vote (see vote_utils.py).

  • 4. Online vs offline: Both, but mostly offline rerank. Best-of-N is pure rerank. Beam/MCTS/rStar are online step-level steering (the PRM is queried per step inside SearchTree.beam_search / vanila_mcts). No streaming/token-level steering — the granularity is "step" (split by lm_step_tag).

  • 5. Adaptive / uncertainty-driven scaling: No. Budgets are fixed per-config (num_sequence, tree_max_width, tree_max_depth, beam_size, num_path). Nothing in methods.py allocates more compute on uncertain/low-confidence steps.

  • 6. Confidence-based steering (DeepConf-style): No. No entropy/logprob-based early-exit or branch-pruning logic; cumulative_logprob/logp_avg_by_len are computed by vLLM but only stored, never used for steering decisions.

  • 7. Aggregation knobs: Configurable but limited. reason/reranking/vote_utils.py exposes 5 aggregators: majority_vote, prm_min_max, prm_min_vote, prm_last_max, prm_last_vote (ORM variants commented out). So step aggregation = min or last only — no mean / max / product / sliding-window. Final aggregation across N samples = vote or max.

  • 8. Backend(s): vLLM and HuggingFace, both wrapped via FastChat workers (reason/llm_service/workers/{vllm_worker,model_worker}.py). Architecture is FastChat controller + LM workers + RM workers, addressed by controller_addr. Not natively black-box-friendly: requires self-hosted models for both policy and PRM; no OpenAI/Anthropic API client path in lm_call.py (only VLLMRemoteCaller and FastChatRemoteCaller).

  • 9. REST gateway / OpenAI-compatible server: Indirectly. They run FastChat's controller + vLLM worker (FastChat does provide an OpenAI-shape openai_api_server), but OpenR itself talks to that controller via FastChat's internal worker protocol (/worker_generate, /worker_generate_stream), not an OpenAI-shape /v1/chat/completions. So "OpenAI-compatible" only by inheritance from FastChat, not as a first-class OpenR feature. Unclear from README whether they document the OpenAI endpoint.

  • 10. Visual debugger / trajectory inspector: No in-repo visualizer (no Gradio/Streamlit code in repo; no viz/ or dashboard/ dir). A ModelScope demo is linked in the README but it is a hosted inference page, not a local trajectory inspector. Trees/trajectories are persisted as JSONL only (record.jsonl).

  • 11. Joint compute–performance benchmarks: Token-only. benchmark/tables.md reports MATH accuracy vs "Budget" expressed as 2^N samples; per-method outputs include total_completion_tokens (and tree_completion_tokens for tree methods). No TFLOPs / wall-clock / GPU-hours accounting.

  • 12. Crash-resistant evaluation pipeline: Partial resume. evaluate.py has --resume_dir that re-reads record.jsonl and skips already-answered questions. No mid-question checkpointing, no atomic write/lockfile, no retry-on-failure. Ray ActorPool fans out across workers but failures of a single actor are not explicitly handled.

  • 13. Last release / activity: No GitHub releases, no git tags. Last commit on main: 2025-01-17 (fix direction_answer action in rstar_env, refactor configs #81). Repo created 2024-10-11. 1.84k stars, 131 forks, 44 open issues, last issue activity 2025-12-25 (Optimize offline Best-of-N with single-call trajectory generation #98 "fix bug in omegaprm_v2"). Effectively dormant on main for ~16 months as of 2026-05-02; community still files issues.

  • 14. Notable: (a) Strong RL training side that most TTC-only toolkits lack — APPO/GRPO/TPPO trainers in train/mat/ for online policy training against a PRM. (b) Open-source PRM + dataset (Math-psa + MATH-APS on HF). (c) OmegaPRM-style automated process-supervision data generation in data/. (d) Tied to FastChat infra (heavy, tmux-based service launch); Ray used for parallel evaluation. (e) Math-domain focused: envs/MATH/ with latex2sympy answer checking; no clear path for general-purpose / non-math tasks.

@smirnovlad

Copy link
Copy Markdown
Collaborator Author

OptiLLM (https://github.com/codelion/optillm)

Repo: algorithmicsuperintelligence/optillm (formerly codelion/optillm), Apache-2.0, 3.4k stars, 266 forks, last commit 2026-03-19, last release v0.3.14 (2026-03-19), ~monthly releases.

1. Scope

An OpenAI-API-compatible inference proxy (Flask server exposing /v1/chat/completions, /v1/models, /health) that wraps any OpenAI-compatible backend and applies one of ~20 inference-time "approaches" or plugins selected via a model-name prefix (e.g. moa-gpt-4o-mini) or extra_body.

2. Strategies / techniques (~20+ "approaches" + ~16 plugins, ~36 total)

  • Genuine search / sampling algorithms (~7): mcts, rstar (R*), plansearch, mars (multi-agent reasoning), bon (best-of-N), pvg (prover-verifier game), self_consistency.
  • Decoding-time interventions (~5): cot_decoding, entropy_decoding, thinkdeeper (reasoning-effort budget), autothink (classifier + steering vectors), deepconf.
  • Prompting / scaffolding tricks (~7): cot_reflection, re2 (re-read), leap (few-shot induction), rto (round-trip), moa (mixture-of-agents), cepo (Cerebras planning + self-reflection), z3 (LLM-emits-then-Z3-solves).
  • Plugins (orchestration / tools, not search): spl, deepthink, longcepo, majority_voting, genselect, coc, mcp, router, memory, privacy, readurls, executecode, json, web_search, deep_research, proxy (load-balancer).
  • Honest tally: of the ~20 "approaches", roughly 7 are real search/sampling, 5 are decoding-level interventions, and the remaining ~8 are prompt-engineering scaffolds. Many of the 16 "plugins" are tool-use / infra (MCP, web search, code exec, routing) rather than TTC.

3. Scorers

  • No PRM / supervised step scorer / value model in the README. Verification = pvg prover-verifier game (LLM-as-judge style) and moa-style critique; selection = self-consistency / majority-vote / genselect.
  • Internal-signal "scoring" only in deepconf (token entropy + top-k logprob → group/trace confidence) and entropy_decoding (token-entropy-driven sampling). These are uncertainty proxies, not learned PRMs.

4. Online vs offline

Both, but mostly offline aggregation. Most approaches generate full samples and rerank/vote (bon, moa, self_consistency, mcts, plansearch, rstar). Genuinely online (mid-stream): cot_decoding, entropy_decoding, thinkdeeper, autothink, deepconf (early-terminates traces by confidence).

5. Adaptive / uncertainty-driven scaling

Partial. deepconf does trace-level early termination when confidence/consensus crosses a threshold (warmup → online filter → consensus stop), and autothink allocates a token budget by query-complexity classification. There is no per-step compute reallocation to uncertain steps — gating is per-trace, not per-step.

6. Confidence-based steering (DeepConf-style)

Yesoptillm/deepconf/ directly implements the Fu et al. "Deep Think with Confidence" paper (token entropy + top-k logprob, sliding-window group confidence, low/high variants, weighted majority vote). Local-models-only.

7. Aggregation knobs

Limited / not configurable as mean/median/min/max/last. Aggregation is per-approach: majority vote (majority_voting, self_consistency), weighted majority by confidence (deepconf), critique-then-aggregate (moa), LLM-pick (genselect). No uniform aggregator API.

8. Backends

OpenAI-compatible by design. Supports OpenAI, Azure OpenAI (incl. Managed Identity), Cerebras, plus LiteLLM passthrough (Anthropic, Gemini, etc.). Local: HuggingFace + PEFT/LoRA via optillm/inference.py, MLX on Apple Silicon, llama.cpp / Ollama as external servers. No vLLM / SGLang integration; in-house request batcher (batching.py, simple per-model queue with max_batch_size/max_wait_ms).

9. REST gateway / OpenAI-compatible server

Yes, that is the entire product. Flask app at /v1/chat/completions. Approach is selected via model="<approach>-<model>" prefix or extra_body={"optillm_approach": ...}. No native Python SDK — clients use openai SDK with base_url="http://localhost:8000/v1".

10. Visual debugger / trajectory inspector

No. Only conversation_logger.py writes JSONL of provider calls (daily-rotated). No UI/dashboard.

11. Joint compute–performance benchmarks

Token-only, no FLOPs. AutoThink table reports avg tokens vs accuracy on GPQA-Diamond / MMLU-Pro; DeepConf advertises 50–70% token reduction. No TFLOPs reported anywhere in README.

12. Crash-resistant evaluation pipeline

Partial. scripts/eval_aime_benchmark.py (and siblings) implement load_existing_results() + processed-index skip → resumes from a JSON results file, but no formal checkpoint/state machine; only per-task idempotence keyed by index.

13. Last release / activity

  • Last commit: 2026-03-19 (v0.3.14).
  • Release cadence: roughly monthly through 2025–early 2026 (v0.3.10 Nov-2025 → v0.3.14 Mar-2026).
  • 21 open issues; 3 open security/concurrency bugs filed 2026-04-11 (RCE in executecode, RCE in z3_solver via prompt injection, MCTS race condition) — unaddressed since.

14. Notable

  • Proxy-first architecture is the defining trait — competes with a server, not a library. Approach selection via model-name prefix is clever but conflates routing with TTC.
  • Plugin system is loaded from optillm/plugins/*_plugin.py at startup.
  • MoA implementation = N critiques + aggregator; popularised the "moa-gpt-4o-mini matches GPT-4 on Arena-Hard-Auto" result.
  • Bundles agent/tool plumbing (MCP client, web search, code execution, memory, privacy/PII redaction) that are orthogonal to TTC scaling.
  • Open RCE issues in executecode and z3_solver worth flagging if cited as a deployment baseline.

@smirnovlad

smirnovlad commented May 2, 2026

Copy link
Copy Markdown
Collaborator Author

Deep Research — additional TTC frameworks (ChatGPT)

ChatGPT Deep Research surfaced 5 additional candidates worth considering for the comparison table, plus rulings on the ones I asked it to verify.


Candidates to add

1. Tree-of-Thoughts-LLM — princeton-nlp/tree-of-thought-llm

Field Value
Scope Implements the Tree-of-Thoughts algorithm for LLM reasoning (searching over intermediate "thoughts" with BFS/DFS).
Strategies Breadth-first and depth-first tree search
Scorers LLM-as-critic (value estimation) or majority-vote among generated thoughts
Online / Offline Offline (batch search)
Adaptive No
DeepConf-style steering No
Backends OpenAI GPT by default (also compatible with HF models or vLLM)
OpenAI gateway Yes
Visual debugger None
Checkpointing None
Last commit Jan 16, 2025
Last release v0.1.0 on Jul 6, 2023
Stars / License 5.9k / MIT

Why compare: canonical and actively-maintained Tree-of-Thoughts library, algorithmically distinct (explicit tree search) and highly cited in the LLM reasoning community.


2. Tree-of-Thoughts — kyegomez/tree-of-thoughts

Field Value
Scope A plug-and-play Python package for Tree-of-Thoughts search.
Strategies Currently depth-first search (DFS) through thought trees (BFS planned)
Scorers LLM-based quality evaluator with a threshold (prunes low-quality branches)
Online / Offline Offline
Adaptive No
DeepConf-style steering No
Backends Designed for OpenAI (GPT-4o, GPT-4 etc.)
OpenAI gateway Yes
Visual debugger No
Checkpointing None
Last commit Jul 29, 2025
Last release v0.3.6 on Jul 29, 2023
Stars / License 4.6k / Apache-2.0

Why compare: widely-used TOT library (4.6k★) with DFS search; complements Princeton's implementation and adds a user-friendly interface and parallelism.


3. LanguageAgentTreeSearch — lapisrocks/LanguageAgentTreeSearch

Field Value
Scope Framework for LLM agents performing MCTS across planning and reasoning (ICML 2024).
Strategies Monte Carlo Tree Search with learned policy/value (ALPaCA-like approach)
Scorers Policy and value networks for state expansion (plus task-specific heuristics)
Online / Offline Offline
Adaptive Yes (MCTS adjusts as search progresses)
DeepConf-style steering No
Backends Uses OpenAI GPT-4 for rollouts and evaluation
OpenAI gateway Yes
Visual debugger No
Checkpointing None
Last commit Jul 30, 2024
Last release None
Stars / License 832 / MIT

Why compare: implements LLM planning as MCTS (LATS), offering a distinct, RL-backed search approach not covered by simpler BoN or TOT methods.


4. TreeQuest — SakanaAI/treequest

Field Value
Scope General-purpose answer-tree search library designed for LLMs.
Strategies Adaptive batch MCTS (AB-MCTS-A and -M) over answer trees
Scorers User-defined scoring function per node (pluggable)
Online / Offline Offline
Adaptive Yes (batch sampling adapts tree expansion)
DeepConf-style steering No
Backends Agnostic — users supply state generator and scorer; can wrap any LLM calls in those
OpenAI gateway Yes (via wrapped generator)
Visual debugger No
Checkpointing None
Last commit Feb 5, 2026
Last release None
Stars / License 534 / Apache-2.0

Why compare: provides a flexible MCTS engine for LLM outputs (AB-MCTS variants), representing batch-parallel search; algorithmically distinct and actively maintained (recent commits).


5. TextGrad — zou-group/textgrad

Field Value
Scope LLM-driven text optimization framework (Nature 2024). Uses LLM "gradients" to iteratively refine text answers.
Strategies Gradient-descent over text space (an iterative search using LLM feedback as gradient signal)
Scorers Analytic loss (e.g. answer correctness) to guide updates
Online / Offline Offline
Adaptive No (non-stochastic gradient steps)
DeepConf-style steering No
Backends OpenAI GPT-4 / GPT-4o by default (can also use other LLM interfaces)
OpenAI gateway Yes
Visual debugger No
Checkpointing None
Last release v0.1.6 on Dec 15, 2024
Stars / License 3.5k / MIT

Why compare: introduces a novel continuous refinement approach (LLM-based "auto-diff") for test-time improvement, distinct from discrete sampling; highly popular (3.5k★) but less common in camera-ready tables, so offers a contrasting methodology.


Reviewed other candidates

Repo Ruling Reason
RUC-NLPIR/FlashRAG EXCLUDE RAG pipeline toolkit, not search/TTC
princeton-nlp/tree-of-thought-llm INCLUDE (handled above)
princeton-nlp/tree-of-thoughts-llm INCLUDE duplicate of above
kyegomez/tree-of-thoughts INCLUDE (handled above)
MARIO-Math-Reasoning/Super_MARIO (AlphaMath) EXCLUDE Domain-specific MCTS for math reasoning (RL+MCTS), not a general inference-time framework
MARIO-Math-Reasoning/MARIO_EVAL EXCLUDE Evaluation toolkit for math LLMs, not a TTC search framework
OpenBMB/UltraFeedback EXCLUDE Preference dataset, no TTC algorithms
RLHFlow/RLHF-Reward-Modeling EXCLUDE Reward-model training (RLHF), not inference-time scaling
Skywork-AI/skywork-o1-Open DOES NOT EXIST No public repository found; likely only a model release
MARIO-Math-Reasoning/AlphaMath DOES NOT EXIST AS CODE AlphaMath refers to model; code is in Super_MARIO (math-specific)
Beier1224/AlphaLLM EXCLUDE Experimental MCTS code without license; incomplete as framework
OpenSearchAI/tree-search-llm DOES NOT EXIST No such repository found
Reasoner / GenericReasoner DOES NOT EXIST No separate project aside from LLM Reasoners
microsoft/reasoning DOES NOT EXIST No matching public repo
openai/simple-evals EXCLUDE Evaluation harness, no search algorithms
stanfordnlp/dspy EXCLUDE Prompt-optimization framework (compile-time, not TTC algorithms like BoN/ToT)
SylphAI-Inc/AdalFlow EXCLUDE Agent-oriented LLM SDK for prompt tuning, not multi-try inference search
pat-jj/IRCoT EXCLUDE One-off project (Iterative Reasoning CoT), no framework layer
IBM/Sterling / IBM/granite-reasoning DOES NOT EXIST No relevant open repos
cornell-zhang/llm-test-time-scaling DOES NOT EXIST Likely only an arXiv survey

Final ranking

  1. Tree-of-Thoughts-LLM (princeton-nlp) — well-maintained (last commit Jan 2025), widely used (5.9k★) TOT library; unique tree-search approach not covered by existing columns.
  2. Tree-of-Thoughts (kyegomez) — very popular (4.6k★) user-friendly TOT implementation; complements Princeton's with DFS focus and pluggable design.
  3. LanguageAgentTreeSearch (LATS) — MCTS-based planner (832★) with policy/value networks; distinct RL-inspired algorithm.
  4. TreeQuest (SakanaAI) — flexible AB-MCTS library (534★) for answer-tree search; adds batch-MCTS methods absent from others.
  5. TextGrad — high-profile gradient-based refinement (3.5k★); offers a continuous optimization angle unlike the sample-based methods above.

Each chosen candidate is actively maintained, algorithmically distinct from the existing four frameworks, and has community recognition. The top picks provide complementary search strategies (tree search, MCTS, gradient search) to round out the comparison table.

@smirnovlad

smirnovlad commented May 2, 2026

Copy link
Copy Markdown
Collaborator Author

ThinkBooster — feature inventory (current branch)

Authoritative inventory built from reading the actual repo on camera-ready/acl2026. Used to lock the row axes for the comparison table.

Strategies (11 implemented)

# Strategy File Mode
1 Baseline thinkbooster/strategies/strategy_baseline.py offline
2 Chain-of-Thought thinkbooster/strategies/strategy_chain_of_thought.py offline
3 Best-of-N (offline) thinkbooster/strategies/strategy_offline_best_of_n.py offline
4 Best-of-N (online) thinkbooster/strategies/strategy_online_best_of_n.py online
5 Beam Search thinkbooster/strategies/strategy_beam_search.py online
6 Self-Consistency thinkbooster/strategies/strategy_self_consistency.py offline
7 Extended Thinking thinkbooster/strategies/strategy_extended_thinking.py online
8 Uncertainty-CoT thinkbooster/strategies/strategy_uncertainty_cot.py online, adaptive
9 MUR / Adaptive Best-of-N thinkbooster/strategies/adaptive_scaling_best_of_n.py + scale_discriminator.py online, adaptive
10 Phi-decoding thinkbooster/strategies/phi.py online
11 DeepConf (online + offline) thinkbooster/strategies/deepconf/strategy.py both

Scorers (6 families, all in thinkbooster/scorers/)

  • PRMstep_scorer_prm.py (vLLM + HF backends)
  • Uncertaintystep_scorer_uncertainty.py (wraps lm-polygraph estimators)
  • Confidencestep_scorer_confidence.py (validity-score based)
  • LLM-as-criticstep_scorer_llm_critic.py (Value + Vote modes, vLLM/OpenAI backends)
  • Majority votingmajority_voting.py (MajorityVotingScorer, ChainMajorityVotingScorer)
  • ReProbe / supervised step scorer (UHead) — installed via setup.sh (llm-uncertainty-head); used through the uncertainty-scorer pipeline

Aggregation knobs

mean / max / min / median / std / product + scoring_window (sliding-window int or "all"). Configured in strategy YAML (config/strategy/*.yaml).

Adaptive / uncertainty-driven

  • MUR discriminator (scale_discriminator.py) → AdaptiveScalingBestOfN
  • Uncertainty-CoT (token- or sequence-level branching)
  • Phi-decoding (foresight + clustering)

DeepConf

thinkbooster/strategies/deepconf/strategy.py — both online (token-level early-stopping via ConfidenceEarlyStopping) and offline (post-hoc rerank).

Backends — API + local

Local inference: vLLM (>=0.12 <0.13), HuggingFace Transformers (>=4.56). White-box (logprobs, hidden states, prefill) available.
Remote API: OpenAI / OpenRouter / any OpenAI-shape endpoint (thinkbooster/generators/api.py). Black-box only (logprobs optional via API).
Both pinned in pyproject.toml; backend selected via config/model/*.yaml.

OpenAI-compatible REST gateway / endpoint

service_app/main.py (FastAPI). Three URL shapes:

  • /v1/chat/completions (strategy + scorer in body)
  • /v1/{strategy}/chat/completions
  • /v1/{strategy}/{scorer}/chat/completions

Visual debugger

http://localhost:8001/debugger — three modes: main interface, step inspector, trajectory tree. React frontend served via FastAPI static files.

Compute–performance benchmarks

Real TFLOPs accounting in thinkbooster/utils/flops.py (simple Kaplan + precise architecture-aware via ModelArchitecture). Reported alongside accuracy + tokens to W&B.

Black-box vs white-box

  • Black-box: Baseline, CoT, Self-Consistency, Best-of-N (both), DeepConf-offline (logprobs only)
  • White-box: Beam Search, Extended Thinking, Uncertainty-CoT, MUR, Phi-decoding, DeepConf-online

Crash-resistant evaluation

scripts/run_tts_eval.py --resume; per-question index-based idempotence; checkpoint every checkpoint_batch_size (default 32).

Hidden-states extraction

scripts/utils/hook_hs_extension.py — vLLM v1 worker_extension_cls; register_forward_hook-based; works with enforce_eager=True; TP-aware (rank-0 only).

lm-polygraph integration

Pinned >=0.6.0. Estimators: MeanTokenEntropy, Perplexity, MaximumSequenceProbability, UHead (optional via setup.sh).

Config system

Hydra. Groups: dataset/, model/, generation/, strategy/, scorer/, evaluation/, system/. Run dir: ${save_dir}/eval/${date}/${timestamp} with .hydra/config.yaml snapshot.

Modular architecture

Clean separation across packages: strategies/ (search algorithms), scorers/ (verifiers), generators/ (vLLM/HF/API backends), step_boundary_detectors/ (pluggable thinking/non-thinking parsers), evaluation/ (parsers + judges), service_app/ (REST + debugger), utils/ (FLOPs, answer extraction). Strategies and scorers are independently swappable; new ones plug in via the registry pattern.

Distinctive extras

  • Pluggable step-boundary detectors (thinking vs. non-thinking)
  • Trajectory replay via StrategyProgressHandler
  • W&B logging of accuracy + tokens + TFLOPs + cost
  • Pluggable answer extraction (boxed, patterns, regex per dataset)

Proposed row axes for the comparison table

Locking these rows based on the inventory above (ThinkBooster ✓ on every row by design):

  1. Strategy taxonomy breadth (≥6 algorithmic families) — 11 strategies across BoN / beam-ToT / self-consistency / extended-thinking / adaptive-uncertainty / DeepConf / phi-decoding
  2. Online + offline modes as first-class
  3. Adaptive / uncertainty-driven scaling (MUR, phi-decoding, Uncertainty-CoT)
  4. Confidence-based steering (DeepConf online + offline)
  5. ≥3 scorer families (PRM / uncertainty / LLM-critic / majority — 4 in ThinkBooster)
  6. Support for ReProbe / supervised step scorer (UHead)
  7. Configurable score aggregation incl. sliding window (scoring_window)
  8. Real TFLOPs accounting (not token-only) for joint perf–compute benchmarks
  9. OpenAI-compatible REST gateway / endpoint with strategy+scorer routing in URL path
  10. Visual debugger with trajectory tree view
  11. Crash-resistant per-question resume / checkpointing
  12. Dual-backend support — both local inference (vLLM / HF) and remote API (OpenAI / OpenRouter / OpenAI-shape)
  13. Black-box compatibility for ≥5 strategies (works without logits / prefill / hidden states)
  14. Hidden-states extraction compatible with vLLM enforce_eager=True
  15. Modular architecture — strategies / scorers / generators / step-detectors independently swappable via registry
  16. Coverage of all major TTC methods to date — BoN, self-consistency, beam/ToT, extended thinking, MUR, phi-decoding, DeepConf, uncertainty-CoT

@smirnovlad

smirnovlad commented May 2, 2026

Copy link
Copy Markdown
Collaborator Author

Comparison table — locked row axes (v1, 10 rows)

After ranking and merging the inventory, these are the 10 row axes for the TTC framework comparison table. Six algorithmic & evaluation, four systems/UX. Next step: fill in cells per competitor.

Tier-1 — Algorithmic & Evaluation (6 rows)

# Row
A Strategy taxonomy breadth — number of algorithmic families / strategies shipped
B Scorer family breadth — ≥4 families: PRM / uncertainty / LLM-critic / supervised step scorer (UHead)
C Supports all TTC methods up to year — the most recent year for which the framework supports all major published TTC methods (cell value is a year, e.g. 2023 / 2024 / 2025 / 2026)
D Built on lm-polygraph for uncertainty / confidence signals — first-class access to a broad family of uncertainty estimators (MeanTokenEntropy, Perplexity, MaximumSequenceProbability, UHead, …) that any strategy or scorer can consume
E Joint performance–compute benchmarks — TFLOPs + tokens, not token-only
F Bundled benchmark suite — ships with 8 math datasets (AIME 2025, GSM8K, MATH-500, Minerva Math, OlympiadBench, GaoKao 2023 EN, ProofNet, Game-of-24) and 3 coding datasets (HumanEval+, MBPP+, KernelBench), each with dataset configs, prompts, answer extraction, and judging pre-wired

Tier-2 — Systems / UX (4 rows)

# Row
G Backend flexibility — local (vLLM / HF) + remote API (OpenAI / OpenRouter), with black-box-compatible strategies for each
H OpenAI-compatible REST gateway — drop-in for any OpenAI SDK; URL path encodes strategy and scorer (/v1/{strategy}/{scorer}/chat/completions)
I Visual debugger — interactive trajectory tree, step inspector, and replay over cached or custom inputs
J Modular architecture — orthogonal step_generator / step_scorer / strategy components, swappable via registry

Dropped from earlier drafts

  • Adaptive / uncertainty-driven scaling — too strategy-specific; replaced with row C "Supports all TTC methods up to year"
  • Crash-resistant evaluation pipeline (resume + checkpoints) — engineering hygiene, not a research differentiator
  • Configurable score aggregation incl. sliding window — sub-mechanism, not a top-level axis
  • Confidence-based steering (DeepConf online + offline) — too strategy-specific; captured under row C
  • Single-pass hidden-states extraction with vLLM — too implementation-specific to be a public-facing differentiator
  • Online + offline modes (first-class with shared abstractions) — not a strong enough standalone differentiator

Next step

Fill cells for each competitor (LLM Reasoners, search-and-learn, OpenR, OptiLLM + Deep Research candidates) using the per-framework analyses already in this PR. Symbols: ✓ supported · ✗ not supported · ◐ partial / limited. Row C takes a literal year value; row F takes counts (e.g., "0 math / 0 coding") for competitors.

@smirnovlad

smirnovlad commented May 2, 2026

Copy link
Copy Markdown
Collaborator Author

Comparison table — first-pass cell fill (v2)

Filled from the per-framework analyses already on this PR. Symbols: ✓ supported · ✗ not supported · ◐ partial / limited.

Changes from v1: 6 competitors only (dropped Tree-of-Thoughts-LLM, Tree-of-Thoughts-kyegomez, LATS — overlapping or single-paper); columns reordered "most-similar-in-scope first" with ThinkBooster leftmost; row F split into F1 (math) + F2 (coding) with bare counts; row A counts strategies strictly; row J downgraded for TreeQuest; row G now lists exact backends.

Comparison table

Feature ThinkBooster OptiLLM LLM Reasoners OpenR search-and-learn TreeQuest TextGrad
A. Strategy taxonomy breadth ✓ (11) ◐ (7) ◐ (5) ◐ (5) ✗ (3) ✗ (1) ✗ (1)
B. Scorer family breadth ✓ (4) ◐ (2) ◐ (2) ✗ (1) ✗ (1) ◐ (1) ◐ (1)
C. Supports all TTC methods up to year 2026 2026 2024 2024 2024 2026 2024
D. Built on lm-polygraph
E. Joint perf–compute benchmarks (TFLOPs + tokens)
F1. Bundled math benchmarks 8 0 0 2 1 0 0
F2. Bundled coding benchmarks 3 0 0 0 0 0 0
G. Backends supported vLLM + HF + API API + HF + MLX HF + SGLang + API vLLM + HF (FastChat orchestrated) vLLM only agnostic (user-supplied) API
H. OpenAI-compatible REST gateway
I. Visual debugger
J. Modular architecture

"API" = OpenAI-compatible HTTP API (OpenAI / Anthropic / Gemini / OpenRouter / any OpenAI-shape endpoint).

Per-cell rationale

  • A. Strategy breadth (strict count). Counted only genuine search/scoring/decoding-intervention strategies. OptiLLM 7 = mcts, rstar, plansearch, mars, bon, pvg, self_consistency (excludes prompt-engineering scaffolds like cot_reflection, re2, leap, rto). LLM Reasoners 5 = MCTS, BeamSearch, DFS, Greedy, Random (RAP/ToT are example pipelines on top of these). OpenR 5 = CoT, BoN, Beam, MCTS, rStar-MCTS. search-and-learn 3 = BoN, Beam, DVTS.
  • B. Scorer family breadth. OptiLLM ◐ (uncertainty-proxy + LLM-critic, no PRM). LLM Reasoners ◐ (PRM via one example + self-eval). OpenR ✗ (PRM + GenRM are both reward-model family — counts as 1). search-and-learn ✗ (PRM-only, hardcoded list of 5 instances, 1 family). TreeQuest ◐ (1: user-defined scoring fn). TextGrad ◐ (1: analytic loss).
  • C. Year. "Up to which year does the framework natively cover all major published TTC methods." OptiLLM ties at 2026 (DeepConf, AutoThink, ThinkDeeper). TreeQuest also 2026 via AB-MCTS + active maintenance.
  • E. TFLOPs. Only ThinkBooster reports actual TFLOPs. Token-counting frameworks → ◐. No compute tracking at all → ✗.
  • F1 / F2. Counts of bundled benchmark configs (dataset + prompt + answer extraction + judging all prewired). Eval scripts that download a dataset don't count. ThinkBooster math: AIME 2025, GSM8K, MATH-500, Minerva Math, OlympiadBench, GaoKao 2023 EN, ProofNet, Game-of-24. ThinkBooster coding: HumanEval+, MBPP+, KernelBench. OpenR math: GSM8K + MATH (envs/MATH/ with latex2sympy answer checking). search-and-learn math: MATH-500 only.
  • G. Backends. Cells list the inference backends each framework natively supports. OpenR's FastChat layer technically routes vLLM/HF traffic, but there is no native OpenAI-API client. search-and-learn hardcodes from vllm import LLM. TreeQuest is fully backend-agnostic — the user supplies the generator function.
  • H. REST gateway (OpenAI-shape). Verified OpenR README — FastChat infra is mentioned, but OpenR ships no /v1/chat/completions server itself. Users would have to launch FastChat's openai_api_server separately, which OpenR neither documents nor wires up. → ✗.
  • I. Visual debugger. Must be local + interactive. LLM Reasoners ◐ (hosted static TreeLog renderer, not interactive replay). OpenR ✗ (only a hosted ModelScope demo). Sergey's edit: I suggest using ◐ for OpenR for better communication.
  • J. Modular architecture. Must have orthogonal abstractions for both generator and scorer. TreeQuest dropped from ✓ to ◐ — abstraction is state-generator + scorer + AB-MCTS, not step_generator / step_scorer / strategy. ThinkBooster's modularity is the only one that decomposes into our three orthogonal axes.

Open

Cells are mine; column owners (Vlad / Quang / Sergey / Artem S) should verify their assigned framework before camera-ready lock.

@ssenichev

ssenichev commented May 8, 2026

Copy link
Copy Markdown
Collaborator
  • Sergey's edit: I suggest using ◐ for OpenR for better communication.

I've reviewed OpenR and applied minor fixes for column I about demo and column G on the matter of backends.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants