Does it work for very long documents?

Hello there, I am trying to make it work using the "mt5" model type since I want to use it on an italian dataset. Unfortunately, all the documents are longer that the max length supported by the model so I thought I would specify `truncation = True, max_length = 512` when calling the `split_into_paragraphs()` function at `wc_temp = len(self.tokenizer.tokenize(temp, max_length=512, truncation=True))` but this is not working -- _**Token indices sequence length is longer than the specified maximum sequence length for this model (6508 > 512). Running this sequence through the model will result in indexing errors.**_

Have you already found the solution to this problem?

Thank you in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does it work for very long documents? #9

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Does it work for very long documents? #9

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions