Parameters can be passed as keyword arguments to model.generate(...) or via the OmniVoiceGenerationConfig dataclass. See below for the full list and which category each belongs to.
# 1) Direct keyword arguments
audio = model.generate(text="Hello world", num_step=32, guidance_scale=2.0)
# 2) Via OmniVoiceGenerationConfig dataclass
from omnivoice import OmniVoiceGenerationConfig
config = OmniVoiceGenerationConfig(num_step=32, guidance_scale=2.0)
audio = model.generate(text="Hello world", generation_config=config)
| Parameter |
Type |
Default |
Description |
num_step |
int |
32 |
Number of iterative unmasking steps. Higher values improve quality but slow down generation. Use 16 for faster inference. |
denoise |
bool |
True |
Prepend the `< |
guidance_scale |
float |
2.0 |
Classifier-free guidance scale. |
t_shift |
float |
0.1 |
Time-step shift for the noise schedule. Smaller values emphasise earlier steps in decoding. |
| Parameter |
Type |
Default |
Description |
position_temperature |
float |
5.0 |
Temperature for mask-position selection. 0 = greedy (deterministic). Higher values increase randomness. |
class_temperature |
float |
0.0 |
Temperature for token sampling at each step. 0 = greedy (deterministic). Higher values increase randomness. |
layer_penalty_factor |
float |
5.0 |
Penalty applied to deeper codebook layers, encouraging earlier (lower) layers to unmask first. |
These accept a single value applied to all items, or a per-item list (useful in batch mode):
# Fixed 10-second output
audio = model.generate(text="Hello, this is a test of duration control", duration=10.0)
# Faster speech (1.2x faster than estimated)
audio = model.generate(text="Hello, this is a test of duration control", speed=1.2)
| Parameter |
Type |
Default |
Description |
duration |
float or list[float | None] |
None |
Fixed output duration in seconds. Overrides speed when set. |
speed |
float or list[float | None] |
None |
Speed factor. Values > 1.0 produce shorter audio (faster); values < 1.0 produce longer audio (slower). Ignored when duration is set. Defaults to 1.0 when both are None. |
Priority: duration > speed.
| Parameter |
Type |
Default |
Description |
preprocess_prompt |
bool |
True |
Whether to apply preprocessing to the voice-clone prompt audio (remove long silences in reference audio, add punctuation in the end of reference text). |
postprocess_output |
bool |
True |
Apply post-processing to generated audio (remove long silences). |
To support stable long-form speech generation with low VRAM consumption, the text is automatically split into smaller segments when the estimated duration of the generated speech exceeds audio_chunk_duration, with each segment producing approximately audio_chunk_duration seconds of audio. This approach allows the model to accept arbitrarily long text and generate arbitrarily long speech with near-constant VRAM consumption.
| Parameter |
Type |
Default |
Description |
audio_chunk_duration |
float |
15.0 |
Target chunk duration (seconds) when splitting long text. |
audio_chunk_threshold |
float |
30.0 |
Estimated audio duration (seconds) above which chunking is activated. |