Inquiry regarding reproduction of "Emergent TTS Eval" results

### Self Checks

- [x] This template is only for bug reports. For questions, please visit [Discussions](https://github.com/fishaudio/fish-speech/discussions).
- [x] I have thoroughly reviewed the project documentation (installation, training, inference) but couldn't find information to solve my problem. [English](https://speech.fish.audio/) [中文](https://speech.fish.audio/zh/) [日本語](https://speech.fish.audio/ja/) [Portuguese (Brazil)](https://speech.fish.audio/pt/)
- [x] I have searched for existing issues, including closed ones. [Search issues](https://github.com/fishaudio/fish-speech/issues)
- [x] I confirm that I am using English to submit this report (我已阅读并同意 [Language Policy](https://github.com/fishaudio/fish-speech/issues/515)).
- [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
- [x] Please do not modify this template and fill in all required fields.

### Cloud or Self Hosted

Self Hosted (Source)

### Environment Details

Python 3.10

### Steps to Reproduce

I would like to extend my sincere gratitude to the team for fully open-sourcing the Fish Audio S2 model, including the weights, fine-tuning code, and inference engine. This is a significant contribution to the open-source community and has greatly lowered the barrier to entry for high-quality speech synthesis, allowing us to learn from and utilize such advanced technology.

However, I am currently attempting to reproduce the results reported in the Fish Audio S2 technical report on the Emergent TTS Bench, but I am facing a significant performance gap. My current reproduction yields an Overall Win Rate of 0.2534, which is far lower than the results reported in the paper.

For my current test setup, I utilized the raw text from the evaluation set and did not use prompt audio as a condition. To help identify the discrepancy, could you please clarify the following details regarding your experimental setup?
Text Processing: What specific text processing pipeline was applied to the evaluation data?
Model Selection: Were the scores achieved using the public open-source weights or an internal model?
Inference Mode: Was the evaluation conducted using voice cloning (with prompt conditions) or unconditional generation?

Thanks!

### ✔️ Expected Behavior

_No response_

### ❌ Actual Behavior

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry regarding reproduction of "Emergent TTS Eval" results #1253

Self Checks

Cloud or Self Hosted

Environment Details

Steps to Reproduce

✔️ Expected Behavior

❌ Actual Behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inquiry regarding reproduction of "Emergent TTS Eval" results #1253

Description

Self Checks

Cloud or Self Hosted

Environment Details

Steps to Reproduce

✔️ Expected Behavior

❌ Actual Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions