Self Checks
Cloud or Self Hosted
Self Hosted (Source)
Environment Details
Python 3.10
Steps to Reproduce
I would like to extend my sincere gratitude to the team for fully open-sourcing the Fish Audio S2 model, including the weights, fine-tuning code, and inference engine. This is a significant contribution to the open-source community and has greatly lowered the barrier to entry for high-quality speech synthesis, allowing us to learn from and utilize such advanced technology.
However, I am currently attempting to reproduce the results reported in the Fish Audio S2 technical report on the Emergent TTS Bench, but I am facing a significant performance gap. My current reproduction yields an Overall Win Rate of 0.2534, which is far lower than the results reported in the paper.
For my current test setup, I utilized the raw text from the evaluation set and did not use prompt audio as a condition. To help identify the discrepancy, could you please clarify the following details regarding your experimental setup?
Text Processing: What specific text processing pipeline was applied to the evaluation data?
Model Selection: Were the scores achieved using the public open-source weights or an internal model?
Inference Mode: Was the evaluation conducted using voice cloning (with prompt conditions) or unconditional generation?
Thanks!
✔️ Expected Behavior
No response
❌ Actual Behavior
No response
Self Checks
Cloud or Self Hosted
Self Hosted (Source)
Environment Details
Python 3.10
Steps to Reproduce
I would like to extend my sincere gratitude to the team for fully open-sourcing the Fish Audio S2 model, including the weights, fine-tuning code, and inference engine. This is a significant contribution to the open-source community and has greatly lowered the barrier to entry for high-quality speech synthesis, allowing us to learn from and utilize such advanced technology.
However, I am currently attempting to reproduce the results reported in the Fish Audio S2 technical report on the Emergent TTS Bench, but I am facing a significant performance gap. My current reproduction yields an Overall Win Rate of 0.2534, which is far lower than the results reported in the paper.
For my current test setup, I utilized the raw text from the evaluation set and did not use prompt audio as a condition. To help identify the discrepancy, could you please clarify the following details regarding your experimental setup?
Text Processing: What specific text processing pipeline was applied to the evaluation data?
Model Selection: Were the scores achieved using the public open-source weights or an internal model?
Inference Mode: Was the evaluation conducted using voice cloning (with prompt conditions) or unconditional generation?
Thanks!
✔️ Expected Behavior
No response
❌ Actual Behavior
No response