Hello there, I am trying to make it work using the "mt5" model type since I want to use it on an italian dataset. Unfortunately, all the documents are longer that the max length supported by the model so I thought I would specify truncation = True, max_length = 512 when calling the split_into_paragraphs() function at wc_temp = len(self.tokenizer.tokenize(temp, max_length=512, truncation=True)) but this is not working -- Token indices sequence length is longer than the specified maximum sequence length for this model (6508 > 512). Running this sequence through the model will result in indexing errors.
Have you already found the solution to this problem?
Thank you in advance!
Hello there, I am trying to make it work using the "mt5" model type since I want to use it on an italian dataset. Unfortunately, all the documents are longer that the max length supported by the model so I thought I would specify
truncation = True, max_length = 512when calling thesplit_into_paragraphs()function atwc_temp = len(self.tokenizer.tokenize(temp, max_length=512, truncation=True))but this is not working -- Token indices sequence length is longer than the specified maximum sequence length for this model (6508 > 512). Running this sequence through the model will result in indexing errors.Have you already found the solution to this problem?
Thank you in advance!