Skip to content

訓練步驟 #30

@yiwei0730

Description

@yiwei0730

We first train the audio codec using 8 NVIDIA TESLA V100 16GB GPUs with a batch size of 200 audios per GPU for 440K steps. We follow the implementation and experimental setting of SoundStream [19] and adopt Adam optimizer with 2e-4 learning rate. Then we use the trained codec to extract the quantized latent vectors for each audio to train the diffusion model in NaturalSpeech 2.

The diffusion model in NaturalSpeech 2 is trained using 16 NVIDIA TESLA V100 32GB GPUs with a batch size of 6K frames of latent vectors per GPU for 300K steps (our model is still underfitting and longer training will result in better performance). We optimize the models with the AdamW optimizer with 5e-4 learning rate, 32k warmup steps following the inverse square root learning schedule.

根據原論文的敘述,似乎他將audio codec 和 diffusion的部分分開來做訓練。
想向您請教,不知道有沒有嘗試過將兩個部分分開來做訓練的嘗試,我看到在NS2-ttsv2的訓練上似乎把codec相關的使用全部給mark起來了,是codec的效果不盡人意嗎?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions