Skip to content

[Bug]: Checkpoint resume diverges from continuous training due to missing LR scheduler/grad scaler states #77

@t-muser

Description

@t-muser

Since the LR scheduling state and the grad scaler are not stored with the rest of the training run data, the model re-initializes them, which leads significantly different results than had the training not been interrupted.

NB: There is also a typo "optimizer_state_dit" should probably be "optimizer_state_dict"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions