ACL 2026 camera-ready: reviewer feedback & action items#251
Conversation
Verbatim copy of the OpenReview decision, meta review, and three official reviews for ThinkBooster (Submission #191). Adds a synthesised camera-ready punch list at the bottom that consolidates the action items raised across the meta review and reviewers paXg / yXVQ / 9UwS. Lives under docs/paper/ so the camera-ready work has a single source of truth and can be tracked alongside code changes.
Paper Decision
Decision: Accept Meta Review of Submission 191 by Area Chair
|
Reviewer paXg — Overall: 6 (Marginally above acceptance threshold)
SummaryThis paper proposes ThinkBooster, a unified framework for test-time compute (TTC) scaling of LLM reasoning. It addresses two key gaps in current TTC scaling research: (1) the lack of standardized evaluation that jointly considers both performance and compute efficiency, and (2) the absence of a unified implementation across diverse TTC scaling methods. ThinkBooster provides a modular Python library implementing common TTC scaling strategies and scorers, along with a benchmark that evaluates both task performance and compute efficiency (in TFLOPs and token cost). ReviewOverall Evaluation ThinkBooster presents a well-engineered and practically useful unified framework for test-time compute scaling of LLM reasoning. However, the paper lacks sufficient comparison with existing TTC scaling frameworks and the evaluation is limited to open-source models without wall-clock latency analysis, and the live demo is also inaccessible during review. Pros Good system design and practical engineering contribution. ThinkBooster cleanly separates different TTC modules and supports both swift implementation with OpenAI compatible SDK, making it practical for both research and production use. Performance–compute efficiency benchmarking. The framework provides a comprehensive and systematic evaluation of existing TTC scaling methods, jointly measuring task performance and compute efficiency. Clean open-source artifacts. The demo video and code repository appear well-organized and of good quality, which lowers the barrier for adoption and implementation. Cons Insufficient comparison with existing frameworks. Several TTC scaling frameworks already exist e.g., LLM Reasoners (MCTS, BFS/DFS, beam search with multiple backends), HuggingFace's search-and-learn (Best-of-N, beam search, DVTS), OpenR (Best-of-N, beam search, MCTS with PRM), and OptiLLM (an OpenAI-compatible proxy with 20+ techniques). The paper lacks a systematic comparison with these alternatives in terms of strategy coverage, scorer diversity, and deployment design, making it difficult to assess ThinkBooster's unique positioning. Only open-source models evaluated. All tested models (Qwen2.5-Math-7B, Qwen3-8B, GPT-OSS-120B) are open-source. Commercial models, which are common in production settings, are not evaluated. Efficiency measured only in tokens/TFLOPs, lacking wall-clock time. While reporting token counts and TFLOPs is reasonable, it is difficult to assess practical feasibility from these numbers alone. R.g., a 10x or 100x compute ratio does not tell practitioners whether the actual latency is acceptable. Supplementing with wall-clock execution time under specific hardware configurations would make the efficiency analysis more informative. Demo link is inaccessible. The demo URL provided in the footnotes ( Reasons To AcceptThinkBooster offers a well-designed, modular framework that unifies fragmented TTC scaling implementations behind a consistent API, filling a practical gap for both researchers and practitioners. The OpenAI-compatible endpoint gateway enables drop-in adoption with minimal friction. The joint performance–compute benchmark sets a good precedent for the field by systematically evaluating efficiency alongside accuracy. Rating: 6: Marginally above acceptance threshold Reasons To RejectThe paper lacks comparison with several existing TTC scaling frameworks (LLM Reasoners, search-and-learn, OpenR, OptiLLM), making it difficult to assess its unique contributions. The evaluation is limited to open-source models with no commercial model experiments, and efficiency is reported only in theoretical TFLOPs without wall-clock latency. The live demo is inaccessible during review, which is a significant concern for a demo track submission. Questions And Additional FeedbackQuestions
Form fields
|
Reviewer yXVQ — Overall: 6 (Marginally above acceptance threshold)
SummaryThis paper presents THINKBOOSTER, a unified framework for test-time compute (TTC) scaling of LLM reasoning. It addresses the problem that existing TTC scaling strategies—like best-of-N, tree-of-thought, and self-consistency—are fragmented, evaluated inconsistently, and rarely consider compute-efficiency trade-offs. THINKBOOSTER provides a modular Python library with state-of-the-art scaling strategies and scorers, a benchmark for joint performance–compute evaluation, an OpenAI-compatible endpoint gateway for real-world deployment, and a visual debugger for inspecting reasoning trajectories. Its main contributions are enabling seamless test-time scaling, principled evaluation of performance–compute trade-offs, and practical integration into applications like coding assistants and mathematical problem solvers, demonstrating improved accuracy and efficiency on math, scientific, and programming benchmarks. ReviewReview of "THINKBOOSTER: A Unified Framework for Test‑Time Compute Scaling of LLM Reasoning" Summary The paper introduces THINKBOOSTER, a modular library and evaluation framework for test-time compute (TTC) scaling of large‑language‑model reasoning. It addresses limitations in existing TTC strategies (e.g., best‑of‑N, self‑consistency, tree‑of‑thought) by unifying them under a common API, providing principled performance–compute benchmarking, real‑world endpoints, and tools for debugging reasoning trajectories. The work demonstrates improved accuracy and more efficient compute use on math, science, and coding benchmarks. Evaluation Quality. The implementation appears engineered with practical utility in mind, offering a standard interface to plug in diverse scaling strategies and scorers. The benchmarking setup considers both reward and computational cost, which is important for realistic evaluation. However, the quality of empirical validation depends on the breadth and depth of benchmarks chosen (details not in summary). The paper defines relevant metrics and trade‑offs, e.g.:
which formalizes how added compute translates into performance improvement. Clarity. The presentation is generally clear, with modular descriptions of components (strategy, scorer, gateway, debugger). Architectural diagrams and code examples (if included) likely help clarify usage. One area that may benefit from more explicit description is how scalability and cost are measured consistently across different strategies. Originality. The work's main novelty lies in systematizing disparate TTC strategies into a unified framework with common interfaces and joint performance–compute evaluation. While existing work explores many scaling strategies in isolation, THINKBOOSTER's contribution is in engineering unification and tooling, rather than new algorithmic scaling methods per se. Significance. The framework has practical significance for practitioners and researchers needing to compare and deploy TTC scaling methods. By exposing a shared API and benchmarking tools, THINKBOOSTER can foster more comparable research and real‑world adoption of compute‑aware reasoning strategies. Its integration with production endpoints and a visual debugger also aids interpretability and deployment. Pros
Cons
Overall THINKBOOSTER presents a practical and well‑engineered framework that consolidates test‑time compute scaling approaches under a coherent API and evaluation paradigm. It is clear, original in its system perspective, and significant for reproducible and comparable reasoning research. The work's strengths lie in its tooling and benchmarking infrastructure, which can support both research and deployment. However, its contribution is largely in unification and tooling rather than novel algorithms. Recommendation: Accept — the contribution is valuable for the NLP community's shift toward compute‑aware, benchmarkable reasoning. Reasons To AcceptThe paper's strengths are its unified, modular framework for test-time compute (TTC) scaling, standardized benchmarking of performance–compute trade-offs, and practical deployment through an OpenAI-compatible endpoint and visual debugger. Presenting it benefits the NLP community by providing a reproducible, comparable, and developer-friendly system for evaluating and deploying adaptive LLM reasoning strategies, enabling more systematic research and practical adoption of compute-aware LLM reasoning methods. Rating: 6: Marginally above acceptance threshold Reasons To RejectWeaknesses of the paper include its focus on systems integration and benchmarking rather than novel TTC algorithms, reliance on specific LLMs and benchmark datasets (math, science, coding), and white-box strategy requirements that may not generalize to all hosted LLMs. Presenting it may risk overstating generalizability and real-world impact, as performance–compute trade-offs could vary for other models, domains, or deployment environments. Questions And Additional Feedback
Form fields
|
Reviewer 9UwS — Overall: 6 (Marginally above acceptance threshold)
SummaryThis paper presents a modular and practical framework for improving LLM reasoning by allocating more computation at inference time. The paper tackles the current fragmentation in test-time scaling methods by offering a unified system that supports multiple reasoning strategies, scoring mechanisms, and evaluation tools. ReviewSee "Reasons To Accept" and "Reasons To Reject" Reasons To AcceptPractical and easy to integrate: The OpenAI-compatible gateway is a strong design choice because it makes the framework easy to adopt in existing applications without major engineering changes. Well-rounded framework (comprehensive for TTS methods): The paper does a good job of combining methods, deployment, and benchmarking in one system, which makes it more complete and useful than a narrowly scoped demo. Strong emphasis on transparency: The visual debugger is especially appealing, as it gives users insight into reasoning trajectories and makes the system more interpretable and easier to analyze. Rating: 6: Marginally above acceptance threshold Reasons To RejectEvaluation scope seems somewhat narrow: The current focus on domains like math and coding is promising, but it would be even stronger to see broader validation on other tasks. Another line of scaling is "critique" based methods [1, 2]. It is better to cover this kind of TTS as well. [1] Training Language Models to Self-Correct via Reinforcement Learning. ICLR 2025 [2] Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards. NeurIPS 2025 Questions And Additional FeedbackNo Form fields
|
Schedules an HTTP probe of http://demo-thinkbooster.nlpresearch.group/ every 10 min (with manual workflow_dispatch). On failure (3 attempts with 5 s back-off), it opens a labelled `demo-down` issue or appends a "still down" comment if one is already open. On recovery it adds a recovery comment and closes the issue. Addresses the camera-ready ask raised by Reviewer paXg / the Area Chair (Submission #191): the demo URL was unreachable during ACL 2026 review. This gives us a public audit trail of demo uptime alongside the paper. Pair with an external monitor (UptimeRobot or similar) to cover GitHub Actions cron drift and Actions outages — covered separately.
Addresses Reviewer paXg / Area Chair ask for a comparison against existing TTC scaling frameworks. Drafts: - 12 differentiating feature axes (strategy/scoring/deployment/eval) - 4 competitor columns: LLM Reasoners, search-and-learn, OpenR, OptiLLM - First-pass cells with ?@<owner> markers for verification by: Vlad (LLM Reasoners), Quang (search-and-learn), Sergey (OpenR), Artem S (OptiLLM) - LaTeX booktabs version drop-in for the paper - Open questions for camera-ready decisions
TTC framework comparison — repos under analysisReviewer paXg / Area Chair asked for a comparison table against existing TTC scaling libraries. I'm drafting one in The four frameworks named by the reviewer (and the repos I'm reading to fill the table):
Once the draft fill is in, the table assignments per the offline plan are:
Posting this first so the source-of-truth repos are pinned before reviewers are asked to validate. |
LLM Reasoners (https://github.com/maitrix-org/llm-reasoners)
|
search-and-learn (https://github.com/huggingface/search-and-learn)1. Scope. Inference-time-compute scaling toolkit ("recipes to enhance LLM capabilities by scaling inference-time compute") focused on verifier-guided search for math reasoning (MATH-500), accompanying the HF blog post replicating Snell et al. 2024. 2. Strategies / search algorithms (3 total).
3. Scorers / PRMs. Five first-class PRMs in
4. Online vs offline. Offline rerank / batched search only. Pipeline = 5. Adaptive / uncertainty-driven scaling. No. Compute is fixed by 6. Confidence-based steering (DeepConf-style). No. No confidence-gated branching or pruning mechanism. 7. Aggregation knobs. Yes, but minimal: 8. Backend(s). vLLM only (hard-coded 9. REST gateway / OpenAI-compatible server. No. 10. Visual debugger / trajectory inspector. No. No notebook UI, dashboard, or trajectory-rendering tool. Outputs are JSONL + HF dataset branches. 11. Joint compute–performance benchmarks. Token-only. 12. Crash-resistant evaluation pipeline. Partial, coarse-grained only. No per-problem checkpointing or resume. Crash-resistance is achieved via Slurm array sharding ( 13. Last release / activity.
14. Anything notable.
|
OpenR (https://github.com/openreasoner/openr)
|
OptiLLM (https://github.com/codelion/optillm)Repo: 1. ScopeAn OpenAI-API-compatible inference proxy (Flask server exposing 2. Strategies / techniques (~20+ "approaches" + ~16 plugins, ~36 total)
3. Scorers
4. Online vs offlineBoth, but mostly offline aggregation. Most approaches generate full samples and rerank/vote ( 5. Adaptive / uncertainty-driven scalingPartial. 6. Confidence-based steering (DeepConf-style)Yes — 7. Aggregation knobsLimited / not configurable as mean/median/min/max/last. Aggregation is per-approach: majority vote ( 8. BackendsOpenAI-compatible by design. Supports OpenAI, Azure OpenAI (incl. Managed Identity), Cerebras, plus LiteLLM passthrough (Anthropic, Gemini, etc.). Local: HuggingFace + PEFT/LoRA via 9. REST gateway / OpenAI-compatible serverYes, that is the entire product. Flask app at 10. Visual debugger / trajectory inspectorNo. Only 11. Joint compute–performance benchmarksToken-only, no FLOPs. AutoThink table reports avg tokens vs accuracy on GPQA-Diamond / MMLU-Pro; DeepConf advertises 50–70% token reduction. No TFLOPs reported anywhere in README. 12. Crash-resistant evaluation pipelinePartial. 13. Last release / activity
14. Notable
|
Deep Research — additional TTC frameworks (ChatGPT)ChatGPT Deep Research surfaced 5 additional candidates worth considering for the comparison table, plus rulings on the ones I asked it to verify. Candidates to add1. Tree-of-Thoughts-LLM —
|
| Field | Value |
|---|---|
| Scope | Implements the Tree-of-Thoughts algorithm for LLM reasoning (searching over intermediate "thoughts" with BFS/DFS). |
| Strategies | Breadth-first and depth-first tree search |
| Scorers | LLM-as-critic (value estimation) or majority-vote among generated thoughts |
| Online / Offline | Offline (batch search) |
| Adaptive | No |
| DeepConf-style steering | No |
| Backends | OpenAI GPT by default (also compatible with HF models or vLLM) |
| OpenAI gateway | Yes |
| Visual debugger | None |
| Checkpointing | None |
| Last commit | Jan 16, 2025 |
| Last release | v0.1.0 on Jul 6, 2023 |
| Stars / License | 5.9k / MIT |
Why compare: canonical and actively-maintained Tree-of-Thoughts library, algorithmically distinct (explicit tree search) and highly cited in the LLM reasoning community.
2. Tree-of-Thoughts — kyegomez/tree-of-thoughts
| Field | Value |
|---|---|
| Scope | A plug-and-play Python package for Tree-of-Thoughts search. |
| Strategies | Currently depth-first search (DFS) through thought trees (BFS planned) |
| Scorers | LLM-based quality evaluator with a threshold (prunes low-quality branches) |
| Online / Offline | Offline |
| Adaptive | No |
| DeepConf-style steering | No |
| Backends | Designed for OpenAI (GPT-4o, GPT-4 etc.) |
| OpenAI gateway | Yes |
| Visual debugger | No |
| Checkpointing | None |
| Last commit | Jul 29, 2025 |
| Last release | v0.3.6 on Jul 29, 2023 |
| Stars / License | 4.6k / Apache-2.0 |
Why compare: widely-used TOT library (4.6k★) with DFS search; complements Princeton's implementation and adds a user-friendly interface and parallelism.
3. LanguageAgentTreeSearch — lapisrocks/LanguageAgentTreeSearch
| Field | Value |
|---|---|
| Scope | Framework for LLM agents performing MCTS across planning and reasoning (ICML 2024). |
| Strategies | Monte Carlo Tree Search with learned policy/value (ALPaCA-like approach) |
| Scorers | Policy and value networks for state expansion (plus task-specific heuristics) |
| Online / Offline | Offline |
| Adaptive | Yes (MCTS adjusts as search progresses) |
| DeepConf-style steering | No |
| Backends | Uses OpenAI GPT-4 for rollouts and evaluation |
| OpenAI gateway | Yes |
| Visual debugger | No |
| Checkpointing | None |
| Last commit | Jul 30, 2024 |
| Last release | None |
| Stars / License | 832 / MIT |
Why compare: implements LLM planning as MCTS (LATS), offering a distinct, RL-backed search approach not covered by simpler BoN or TOT methods.
4. TreeQuest — SakanaAI/treequest
| Field | Value |
|---|---|
| Scope | General-purpose answer-tree search library designed for LLMs. |
| Strategies | Adaptive batch MCTS (AB-MCTS-A and -M) over answer trees |
| Scorers | User-defined scoring function per node (pluggable) |
| Online / Offline | Offline |
| Adaptive | Yes (batch sampling adapts tree expansion) |
| DeepConf-style steering | No |
| Backends | Agnostic — users supply state generator and scorer; can wrap any LLM calls in those |
| OpenAI gateway | Yes (via wrapped generator) |
| Visual debugger | No |
| Checkpointing | None |
| Last commit | Feb 5, 2026 |
| Last release | None |
| Stars / License | 534 / Apache-2.0 |
Why compare: provides a flexible MCTS engine for LLM outputs (AB-MCTS variants), representing batch-parallel search; algorithmically distinct and actively maintained (recent commits).
5. TextGrad — zou-group/textgrad
| Field | Value |
|---|---|
| Scope | LLM-driven text optimization framework (Nature 2024). Uses LLM "gradients" to iteratively refine text answers. |
| Strategies | Gradient-descent over text space (an iterative search using LLM feedback as gradient signal) |
| Scorers | Analytic loss (e.g. answer correctness) to guide updates |
| Online / Offline | Offline |
| Adaptive | No (non-stochastic gradient steps) |
| DeepConf-style steering | No |
| Backends | OpenAI GPT-4 / GPT-4o by default (can also use other LLM interfaces) |
| OpenAI gateway | Yes |
| Visual debugger | No |
| Checkpointing | None |
| Last release | v0.1.6 on Dec 15, 2024 |
| Stars / License | 3.5k / MIT |
Why compare: introduces a novel continuous refinement approach (LLM-based "auto-diff") for test-time improvement, distinct from discrete sampling; highly popular (3.5k★) but less common in camera-ready tables, so offers a contrasting methodology.
Reviewed other candidates
| Repo | Ruling | Reason |
|---|---|---|
RUC-NLPIR/FlashRAG |
EXCLUDE | RAG pipeline toolkit, not search/TTC |
princeton-nlp/tree-of-thought-llm |
INCLUDE | (handled above) |
princeton-nlp/tree-of-thoughts-llm |
INCLUDE | duplicate of above |
kyegomez/tree-of-thoughts |
INCLUDE | (handled above) |
MARIO-Math-Reasoning/Super_MARIO (AlphaMath) |
EXCLUDE | Domain-specific MCTS for math reasoning (RL+MCTS), not a general inference-time framework |
MARIO-Math-Reasoning/MARIO_EVAL |
EXCLUDE | Evaluation toolkit for math LLMs, not a TTC search framework |
OpenBMB/UltraFeedback |
EXCLUDE | Preference dataset, no TTC algorithms |
RLHFlow/RLHF-Reward-Modeling |
EXCLUDE | Reward-model training (RLHF), not inference-time scaling |
Skywork-AI/skywork-o1-Open |
DOES NOT EXIST | No public repository found; likely only a model release |
MARIO-Math-Reasoning/AlphaMath |
DOES NOT EXIST AS CODE | AlphaMath refers to model; code is in Super_MARIO (math-specific) |
Beier1224/AlphaLLM |
EXCLUDE | Experimental MCTS code without license; incomplete as framework |
OpenSearchAI/tree-search-llm |
DOES NOT EXIST | No such repository found |
Reasoner / GenericReasoner |
DOES NOT EXIST | No separate project aside from LLM Reasoners |
microsoft/reasoning |
DOES NOT EXIST | No matching public repo |
openai/simple-evals |
EXCLUDE | Evaluation harness, no search algorithms |
stanfordnlp/dspy |
EXCLUDE | Prompt-optimization framework (compile-time, not TTC algorithms like BoN/ToT) |
SylphAI-Inc/AdalFlow |
EXCLUDE | Agent-oriented LLM SDK for prompt tuning, not multi-try inference search |
pat-jj/IRCoT |
EXCLUDE | One-off project (Iterative Reasoning CoT), no framework layer |
IBM/Sterling / IBM/granite-reasoning |
DOES NOT EXIST | No relevant open repos |
cornell-zhang/llm-test-time-scaling |
DOES NOT EXIST | Likely only an arXiv survey |
Final ranking
- Tree-of-Thoughts-LLM (princeton-nlp) — well-maintained (last commit Jan 2025), widely used (5.9k★) TOT library; unique tree-search approach not covered by existing columns.
- Tree-of-Thoughts (kyegomez) — very popular (4.6k★) user-friendly TOT implementation; complements Princeton's with DFS focus and pluggable design.
- LanguageAgentTreeSearch (LATS) — MCTS-based planner (832★) with policy/value networks; distinct RL-inspired algorithm.
- TreeQuest (SakanaAI) — flexible AB-MCTS library (534★) for answer-tree search; adds batch-MCTS methods absent from others.
- TextGrad — high-profile gradient-based refinement (3.5k★); offers a continuous optimization angle unlike the sample-based methods above.
Each chosen candidate is actively maintained, algorithmically distinct from the existing four frameworks, and has community recognition. The top picks provide complementary search strategies (tree search, MCTS, gradient search) to round out the comparison table.
ThinkBooster — feature inventory (current branch)Authoritative inventory built from reading the actual repo on Strategies (11 implemented)
Scorers (6 families, all in
|
Comparison table — locked row axes (v1, 10 rows)After ranking and merging the inventory, these are the 10 row axes for the TTC framework comparison table. Six algorithmic & evaluation, four systems/UX. Next step: fill in cells per competitor. Tier-1 — Algorithmic & Evaluation (6 rows)
Tier-2 — Systems / UX (4 rows)
Dropped from earlier drafts
Next stepFill cells for each competitor (LLM Reasoners, search-and-learn, OpenR, OptiLLM + Deep Research candidates) using the per-framework analyses already in this PR. Symbols: ✓ supported · ✗ not supported · ◐ partial / limited. Row C takes a literal year value; row F takes counts (e.g., "0 math / 0 coding") for competitors. |
Comparison table — first-pass cell fill (v2)Filled from the per-framework analyses already on this PR. Symbols: ✓ supported · ✗ not supported · ◐ partial / limited. Changes from v1: 6 competitors only (dropped Tree-of-Thoughts-LLM, Tree-of-Thoughts-kyegomez, LATS — overlapping or single-paper); columns reordered "most-similar-in-scope first" with ThinkBooster leftmost; row F split into F1 (math) + F2 (coding) with bare counts; row A counts strategies strictly; row J downgraded for TreeQuest; row G now lists exact backends. Comparison table
"API" = OpenAI-compatible HTTP API (OpenAI / Anthropic / Gemini / OpenRouter / any OpenAI-shape endpoint). Per-cell rationale
OpenCells are mine; column owners (Vlad / Quang / Sergey / Artem S) should verify their assigned framework before camera-ready lock. |
I've reviewed OpenR and applied minor fixes for column I about demo and column G on the matter of backends. |
Context
ThinkBooster (Submission #191) was accepted at ACL 2026 (decision 24 Apr 2026, Program Chairs).
This PR is a tracking branch for the camera-ready work. The full verbatim decision, meta review, and three reviewer comments are committed under
docs/paper/acl2026_reviews.md. Each review is also posted as a separate PR comment below so individual asks can be discussed and resolved in their own threads.Camera-ready punch list
Synthesised from the meta review + three reviewers (paXg, yXVQ, 9UwS). Owner / target commit can be assigned per item as we go.
Required (called out by Area Chair)
http://demo-thinkbooster.nlpresearch.group(paXg, meta)Optional / nice-to-have
How to use this PR
docs/paper/acl2026_reviews.md) is the source of truth.