question regarding the efficiency comparison

Hi, thanks for the interesting work! I enjoyed reading the paper and the clean codebase.

I have a question regarding the efficiency comparison. Since DMLR is fundamentally a test-time optimization method, each example requires an iterative RL loop (max_rl_steps forward passes with output_attentions=True, plus reward computation per step). From the code, the default configuration runs ~20 optimization steps per example before the final model.generate() call. This means the total compute per example is substantially higher than both vanilla inference and standard CoT — even though the output is shorter.

Could you clarify: What exactly does the "Efficiency" metric (Fig 11 in the arXiv preprint) on the x-axis measure?  

Thanks again for the great work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question regarding the efficiency comparison #20

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

question regarding the efficiency comparison #20

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions