Hi, I'm confused with the pooling strategy you used here.
For training, you use the avg token
While for evaluation, you are not specifing any pooling flag here,
|
2) evaluate on STS benchmark |
|
```bash |
|
BiLLM_START_INDEX=31 CUDA_VISIBLE_DEVICES=0 python eval_sts.py \ |
|
--model_name_or_path NousResearch/Llama-2-7b-hf \ |
|
--lora_name_or_path SeanLee97/bellm-llama-7b-nli \ |
|
--apply_bfloat16 0 |
|
``` |
so this should be default value [cls], right?
|
parser.add_argument("--pooling_strategy", type=str, default='cls') |
As for the paper, you mentioned that you used the representative word as the pivot, so this should be the last non-padding token, right? So I'm wondering which token should I use or does it make no difference in a decoder based model like llama?
Hi, I'm confused with the pooling strategy you used here.
For training, you use the avg token
BeLLM/README.md
Line 52 in 9da9269
While for evaluation, you are not specifing any pooling flag here,
BeLLM/README.md
Lines 99 to 105 in 9da9269
so this should be default value [cls], right?
BeLLM/eval_sts.py
Line 57 in 9da9269
As for the paper, you mentioned that you used the representative word as the pivot, so this should be the last non-padding token, right? So I'm wondering which token should I use or does it make no difference in a decoder based model like llama?