Skip to content

Support multi-field TSV parsing in collection loader#40

Open
mushkanrana73 wants to merge 1 commit intoJuliaGenAI:mainfrom
mushkanrana73:fix-tsv-parsing
Open

Support multi-field TSV parsing in collection loader#40
mushkanrana73 wants to merge 1 commit intoJuliaGenAI:mainfrom
mushkanrana73:fix-tsv-parsing

Conversation

@mushkanrana73
Copy link

The current loader reads TSV lines without parsing, causing document IDs and additional fields (e.g., titles) to be included in the document text.

This PR updates the logic to:

  • Split lines using \t

  • Ignore the document ID

  • Concatenate remaining fields into a single document string

  • Skip malformed lines

Example:
doc_id \t title \t body → "title body"

This improves compatibility with multi-field datasets and aligns with Python ColBERT behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant