Skip to content

Support duplicated tokens in Vocabulary #87

@ThomasKluiters

Description

@ThomasKluiters

Feature Description

Currently, the dictionary cannot handle duplicate entries. It would be interesting if this would be supported. Possibly a flag that allows one to 'allow' multiples would be a feature.

Use Case

When using code-switched tokenizers (Like the 'AggregateTokenizer' in NeMo) you may have the same token appear twice. For example "Is" in the Dutch language and "Is" in the English language. Generally, we observe better Word Error Rates when using code-switched (aggregate) tokenizers as opposed to single tokenizers.

Additional Context

I would be happy to implement this feature, if, this is something the Flashlight team would be open to!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions