Skip to content

Reproduction Divergence on FEVER, PDDL, and SciWorld Benchmarks (with Qwen2.5-14B-Instruct) #25

@ioir123ju

Description

@ioir123ju

The success rates for FEVER and PDDL empty are higher than in the paper, while the success rates for SciWorld empty and g-memory are far lower than in the paper.

ALFWorld

mas_type mas_memory insights_topk 总任务数 成功数 成功率 论文结果 completion_tokens prompt_tokens 备注
autogen empty 5 134 100 74.63% - 40,789 4,143,280 无记忆基线
autogen g-memory 5 134 113 84.33% 85.82% 124,516 5,967,969 使用官方 tw-pddl 文件,与论文基本一致
autogen context-share 5 134 116 86.57% - 35,483 4,867,399
macnet context-share 5 134 93 69.40% - 442,919 38,204,698

FEVER

mas_type mas_memory insights_topk 总任务数 成功数 成功率 论文结果 completion_tokens prompt_tokens 备注
autogen empty 5 100 67 67.00% 57.1% 21,299 1,348,762 无记忆基线
autogen g-memory 5 100 67 67.00% - 137,534 4,046,300
macnet g-memory 5 100 67 67.00% - 244,821 13,250,957 基线

PDDL

mas_type mas_memory insights_topk 总任务数 成功数 成功率 论文结果 completion_tokens prompt_tokens 备注
autogen empty 3 60 22 36.67% 23.53% 25,298 4,406,893 无记忆基线
autogen g-memory 3 60 18 30.00% 27.77% 90,730 7,634,967

SciWorld

mas_type mas_memory insights_topk 总任务数 成功数 成功率 论文结果 completion_tokens prompt_tokens 备注
autogen empty 3 90 23 25.56% 54.49% 105,902 7,800,120 无记忆基线,max_trials=50,修复反引号格式
autogen g-memory 3 90 38 42.22% 67.4% 205,742 12,790,542

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions