Skip to content

About Data #4

@liaolea

Description

@liaolea

Hello, Thanks for your great work. I have a question regarding the SFT training data.

I noticed that a complete “think” sequence is split across different timestamps, and I’m curious about the criterion used for this segmentation. For example:

{"think": "After user query, confirmed driver’s right hand is at the bottom of the wheel. Vehicle", "timestamp": 10.5},
{"think": "still moving steadily. Isolation persists—no other signs of life or traffic in this desert journey.", "timestamp": 11.5}

Here, the sentence is split in the middle (“Vehicle” → “still moving steadily”). Could you clarify how these boundaries are determined? Is the segmentation based on fixed temporal windows, token length, streaming chunks, or some other strategy?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions