Hi. Thank you for interesting work and releasing the code. Can i ask some questions when finetuing on SROIE?
-
As SROIE is a dataset of reciept, the number of valid token is pretty small compared to 512. Did you still use 512 for sequence length? I think too many padding tokens may give noise to model since the model initialized with RoBERTa.
-
Did you use BIO tags in word-level? SROIE key-value pair is not word level, rather line level. Let me give some example. If given key 'company', value is "STARBUCKS Store #10208" and OCR is give as splitted strings "STARBUCKS Store", "#10208".
Some people tags B-Company to the first OCR string and I-Company to the second OCR string, and then tokenizes it, so that the result seems like:
("STAR", "B-Company"), ("BUCKS", "B-Company"), ("Store", "B-Company"), ("#", "I-Company"), ("10208", "I-Company").
But i think BIO should be tagged in word level. So the results should be
("STAR", "B-Company"), ("BUCKS", "B-Company"), ("Store", "I-Company"), ("#", "I-Company"), ("10208", "I-Company").
In short, only the first word 'STARBUCKS" is tagged as "B-Company" and the others are "I-Company".
It may be a minor issue, but SROIE contains complicate address or company name and i just want to check the right way. Can you tell me which one did you use?
In short, only the first word 'STARBUCKS" is tagged as "B-Company" and the others are "I-Company".
- Or, did you use token-level tags?
("STAR", "B-Company"), ("BUCKS", "I-Company"), ("Store", "I-Company"), ("#", "I-Company"), ("10208", "I-Company").
Thank you for reading!
Best.
Hi. Thank you for interesting work and releasing the code. Can i ask some questions when finetuing on SROIE?
As SROIE is a dataset of reciept, the number of valid token is pretty small compared to 512. Did you still use 512 for sequence length? I think too many padding tokens may give noise to model since the model initialized with RoBERTa.
Did you use BIO tags in word-level? SROIE key-value pair is not word level, rather line level. Let me give some example. If given key 'company', value is "STARBUCKS Store #10208" and OCR is give as splitted strings "STARBUCKS Store", "#10208".
Some people tags B-Company to the first OCR string and I-Company to the second OCR string, and then tokenizes it, so that the result seems like:
("STAR", "B-Company"), ("BUCKS", "B-Company"), ("Store", "B-Company"), ("#", "I-Company"), ("10208", "I-Company").
But i think BIO should be tagged in word level. So the results should be
("STAR", "B-Company"), ("BUCKS", "B-Company"), ("Store", "I-Company"), ("#", "I-Company"), ("10208", "I-Company").
In short, only the first word 'STARBUCKS" is tagged as "B-Company" and the others are "I-Company".
It may be a minor issue, but SROIE contains complicate address or company name and i just want to check the right way. Can you tell me which one did you use?
In short, only the first word 'STARBUCKS" is tagged as "B-Company" and the others are "I-Company".
("STAR", "B-Company"), ("BUCKS", "I-Company"), ("Store", "I-Company"), ("#", "I-Company"), ("10208", "I-Company").
Thank you for reading!
Best.