Hello,
I am trying to reproduce the results from your paper, and I encountered a few points that are somewhat unclear to me. I opened this issue to ask for clarification so that I can faithfully follow the experimental setup.
- Dataset splits (train / eval / test)
First, I would like to clarify the exact train, evaluation, and test splits used in the experiments.
According to Appendix C.1 of the paper, for the writing collaboration task (TLDR):
- Training set: TLDR[0:1000]
- Test set: TLDR[1000:1100]
However, when I checked magrpo_tldr_config.yaml, I noticed that:
- train_split is set to train[:1100]
- eval_split is set to test[:1100]
This seems slightly different from what is described in the paper.
Additionally, the trl-lib/tldr dataset already provides train, validation, and test splits.
In this context, does TLDR[1000:1100] refer to:
- indices [1000:1100] from the train split, or
- indices [1000:1100] from the test split?
I would really appreciate clarification on this point.
Similarly, for the Minecraft experiments, in house_build_magrpo_config.yaml, I see that:
- train_split is set to [:]
whereas the paper mentions using [0:8] for training.
I was wondering whether this discrepancy was an oversight during code release, or if there is another intended explanation.
- Critic types used in main experiments
For the Actor-Critic–based algorithms, could you clarify which critic type was used for each method in the main experimental results reported in the paper?
Specifically, were the critic types exactly those specified in the released configuration files, or were there any differences between the paper experiments and the public code?
- Feedback types in Minecraft experiments
For the Minecraft tasks, I assume that performance may vary depending on the feedback type.
Could you please let me know which feedback types were used when training StrBuild and HouseBuild, respectively, for the results reported in Table 1?
⸻
Sorry for bothering you again with many questions
I genuinely think your work is excellent, and I would really like to run the experiments myself to better understand and verify the results.
Thank you very much for your time and for sharing such great research!!
Hello,
I am trying to reproduce the results from your paper, and I encountered a few points that are somewhat unclear to me. I opened this issue to ask for clarification so that I can faithfully follow the experimental setup.
First, I would like to clarify the exact train, evaluation, and test splits used in the experiments.
According to Appendix C.1 of the paper, for the writing collaboration task (TLDR):
However, when I checked magrpo_tldr_config.yaml, I noticed that:
This seems slightly different from what is described in the paper.
Additionally, the trl-lib/tldr dataset already provides train, validation, and test splits.
In this context, does TLDR[1000:1100] refer to:
I would really appreciate clarification on this point.
Similarly, for the Minecraft experiments, in house_build_magrpo_config.yaml, I see that:
whereas the paper mentions using [0:8] for training.
I was wondering whether this discrepancy was an oversight during code release, or if there is another intended explanation.
For the Actor-Critic–based algorithms, could you clarify which critic type was used for each method in the main experimental results reported in the paper?
Specifically, were the critic types exactly those specified in the released configuration files, or were there any differences between the paper experiments and the public code?
For the Minecraft tasks, I assume that performance may vary depending on the feedback type.
Could you please let me know which feedback types were used when training StrBuild and HouseBuild, respectively, for the results reported in Table 1?
⸻
Sorry for bothering you again with many questions
I genuinely think your work is excellent, and I would really like to run the experiments myself to better understand and verify the results.
Thank you very much for your time and for sharing such great research!!