I don't know if this is a bug per se, but in case anybody else runs into this issue, it's probably good to be aware of at least. If you try to use reference audio that is longer than 12s, there is logic in f5_tts/infer/utils_infer.py that trims the audio down to 12 seconds. The problem comes into play when you also specify a transcription for your audio, because it won't trim the transcription to match what your new audio clip says after the trim, which leads to some really odd behavior with the generated audio.
One solution could be, if it Has to trim the audio to 12s for some reason, it could force calling the transcribe method for your new audio length, but to be on the safe side, you should make sure your sample audio is less than 12 seconds.
I don't know if this is a bug per se, but in case anybody else runs into this issue, it's probably good to be aware of at least. If you try to use reference audio that is longer than 12s, there is logic in f5_tts/infer/utils_infer.py that trims the audio down to 12 seconds. The problem comes into play when you also specify a transcription for your audio, because it won't trim the transcription to match what your new audio clip says after the trim, which leads to some really odd behavior with the generated audio.
One solution could be, if it Has to trim the audio to 12s for some reason, it could force calling the transcribe method for your new audio length, but to be on the safe side, you should make sure your sample audio is less than 12 seconds.