The success rates for FEVER and PDDL empty are higher than in the paper, while the success rates for SciWorld empty and g-memory are far lower than in the paper.
ALFWorld
| mas_type |
mas_memory |
insights_topk |
总任务数 |
成功数 |
成功率 |
论文结果 |
completion_tokens |
prompt_tokens |
备注 |
| autogen |
empty |
5 |
134 |
100 |
74.63% |
- |
40,789 |
4,143,280 |
无记忆基线 |
| autogen |
g-memory |
5 |
134 |
113 |
84.33% |
85.82% |
124,516 |
5,967,969 |
使用官方 tw-pddl 文件,与论文基本一致 |
| autogen |
context-share |
5 |
134 |
116 |
86.57% |
- |
35,483 |
4,867,399 |
|
| macnet |
context-share |
5 |
134 |
93 |
69.40% |
- |
442,919 |
38,204,698 |
|
FEVER
| mas_type |
mas_memory |
insights_topk |
总任务数 |
成功数 |
成功率 |
论文结果 |
completion_tokens |
prompt_tokens |
备注 |
| autogen |
empty |
5 |
100 |
67 |
67.00% |
57.1% |
21,299 |
1,348,762 |
无记忆基线 |
| autogen |
g-memory |
5 |
100 |
67 |
67.00% |
- |
137,534 |
4,046,300 |
|
| macnet |
g-memory |
5 |
100 |
67 |
67.00% |
- |
244,821 |
13,250,957 |
基线 |
PDDL
| mas_type |
mas_memory |
insights_topk |
总任务数 |
成功数 |
成功率 |
论文结果 |
completion_tokens |
prompt_tokens |
备注 |
| autogen |
empty |
3 |
60 |
22 |
36.67% |
23.53% |
25,298 |
4,406,893 |
无记忆基线 |
| autogen |
g-memory |
3 |
60 |
18 |
30.00% |
27.77% |
90,730 |
7,634,967 |
|
SciWorld
| mas_type |
mas_memory |
insights_topk |
总任务数 |
成功数 |
成功率 |
论文结果 |
completion_tokens |
prompt_tokens |
备注 |
| autogen |
empty |
3 |
90 |
23 |
25.56% |
54.49% |
105,902 |
7,800,120 |
无记忆基线,max_trials=50,修复反引号格式 |
| autogen |
g-memory |
3 |
90 |
38 |
42.22% |
67.4% |
205,742 |
12,790,542 |
|
The success rates for FEVER and PDDL empty are higher than in the paper, while the success rates for SciWorld empty and g-memory are far lower than in the paper.
ALFWorld
FEVER
PDDL
SciWorld