Skip to content

When finetuning on SROIE #5

@TaekyungKi

Description

@TaekyungKi

Hi. Thank you for interesting work and releasing the code. Can i ask some questions when finetuing on SROIE?

  1. As SROIE is a dataset of reciept, the number of valid token is pretty small compared to 512. Did you still use 512 for sequence length? I think too many padding tokens may give noise to model since the model initialized with RoBERTa.

  2. Did you use BIO tags in word-level? SROIE key-value pair is not word level, rather line level. Let me give some example. If given key 'company', value is "STARBUCKS Store #10208" and OCR is give as splitted strings "STARBUCKS Store", "#10208".
    Some people tags B-Company to the first OCR string and I-Company to the second OCR string, and then tokenizes it, so that the result seems like:
    ("STAR", "B-Company"), ("BUCKS", "B-Company"), ("Store", "B-Company"), ("#", "I-Company"), ("10208", "I-Company").
    But i think BIO should be tagged in word level. So the results should be
    ("STAR", "B-Company"), ("BUCKS", "B-Company"), ("Store", "I-Company"), ("#", "I-Company"), ("10208", "I-Company").
    In short, only the first word 'STARBUCKS" is tagged as "B-Company" and the others are "I-Company".
    It may be a minor issue, but SROIE contains complicate address or company name and i just want to check the right way. Can you tell me which one did you use?
    In short, only the first word 'STARBUCKS" is tagged as "B-Company" and the others are "I-Company".

  • Or, did you use token-level tags?
    ("STAR", "B-Company"), ("BUCKS", "I-Company"), ("Store", "I-Company"), ("#", "I-Company"), ("10208", "I-Company").

Thank you for reading!
Best.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions