Generation Parameters

Parameters can be passed as keyword arguments to model.generate(...) or via the OmniVoiceGenerationConfig dataclass. See below for the full list and which category each belongs to.

# 1) Direct keyword arguments
audio = model.generate(text="Hello world", num_step=32, guidance_scale=2.0)

# 2) Via OmniVoiceGenerationConfig dataclass
from omnivoice import OmniVoiceGenerationConfig

config = OmniVoiceGenerationConfig(num_step=32, guidance_scale=2.0)
audio = model.generate(text="Hello world", generation_config=config)

Decoding

Parameter	Type	Default	Description
`num_step`	int	32	Number of iterative unmasking steps. Higher values improve quality but slow down generation. Use 16 for faster inference.
`denoise`	bool	True	Prepend the `<
`guidance_scale`	float	2.0	Classifier-free guidance scale.
`t_shift`	float	0.1	Time-step shift for the noise schedule. Smaller values emphasise earlier steps in decoding.

Sampling

Parameter	Type	Default	Description
`position_temperature`	float	5.0	Temperature for mask-position selection. 0 = greedy (deterministic). Higher values increase randomness.
`class_temperature`	float	0.0	Temperature for token sampling at each step. 0 = greedy (deterministic). Higher values increase randomness.
`layer_penalty_factor`	float	5.0	Penalty applied to deeper codebook layers, encouraging earlier (lower) layers to unmask first.

Duration & Speed

These accept a single value applied to all items, or a per-item list (useful in batch mode):

# Fixed 10-second output
audio = model.generate(text="Hello, this is a test of duration control", duration=10.0)

# Faster speech (1.2x faster than estimated)
audio = model.generate(text="Hello, this is a test of duration control", speed=1.2)

Parameter	Type	Default	Description
`duration`	float or list[float \| None]	None	Fixed output duration in seconds. Overrides `speed` when set.
`speed`	float or list[float \| None]	None	Speed factor. Values > 1.0 produce shorter audio (faster); values < 1.0 produce longer audio (slower). Ignored when `duration` is set. Defaults to 1.0 when both are None.

Priority: duration > speed.

Pre/Post Processing

Parameter	Type	Default	Description
`preprocess_prompt`	bool	True	Whether to apply preprocessing to the voice-clone prompt audio (remove long silences in reference audio, add punctuation in the end of reference text).
`postprocess_output`	bool	True	Apply post-processing to generated audio (remove long silences).

Long-Form Generation

To support stable long-form speech generation with low VRAM consumption, the text is automatically split into smaller segments when the estimated duration of the generated speech exceeds audio_chunk_duration, with each segment producing approximately audio_chunk_duration seconds of audio. This approach allows the model to accept arbitrarily long text and generate arbitrarily long speech with near-constant VRAM consumption.

Parameter	Type	Default	Description
`audio_chunk_duration`	float	15.0	Target chunk duration (seconds) when splitting long text.
`audio_chunk_threshold`	float	30.0	Estimated audio duration (seconds) above which chunking is activated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generation Parameters

Decoding

Sampling

Duration & Speed

Pre/Post Processing

Long-Form Generation

FilesExpand file tree

generation-parameters.md

Latest commit

History

generation-parameters.md

File metadata and controls

Generation Parameters

Decoding

Sampling

Duration & Speed

Pre/Post Processing

Long-Form Generation