Skip to content

Latest commit

 

History

History
66 lines (47 loc) · 3.47 KB

File metadata and controls

66 lines (47 loc) · 3.47 KB

Generation Parameters

Parameters can be passed as keyword arguments to model.generate(...) or via the OmniVoiceGenerationConfig dataclass. See below for the full list and which category each belongs to.

# 1) Direct keyword arguments
audio = model.generate(text="Hello world", num_step=32, guidance_scale=2.0)

# 2) Via OmniVoiceGenerationConfig dataclass
from omnivoice import OmniVoiceGenerationConfig

config = OmniVoiceGenerationConfig(num_step=32, guidance_scale=2.0)
audio = model.generate(text="Hello world", generation_config=config)

Decoding

Parameter Type Default Description
num_step int 32 Number of iterative unmasking steps. Higher values improve quality but slow down generation. Use 16 for faster inference.
denoise bool True Prepend the `<
guidance_scale float 2.0 Classifier-free guidance scale.
t_shift float 0.1 Time-step shift for the noise schedule. Smaller values emphasise earlier steps in decoding.

Sampling

Parameter Type Default Description
position_temperature float 5.0 Temperature for mask-position selection. 0 = greedy (deterministic). Higher values increase randomness.
class_temperature float 0.0 Temperature for token sampling at each step. 0 = greedy (deterministic). Higher values increase randomness.
layer_penalty_factor float 5.0 Penalty applied to deeper codebook layers, encouraging earlier (lower) layers to unmask first.

Duration & Speed

These accept a single value applied to all items, or a per-item list (useful in batch mode):

# Fixed 10-second output
audio = model.generate(text="Hello, this is a test of duration control", duration=10.0)

# Faster speech (1.2x faster than estimated)
audio = model.generate(text="Hello, this is a test of duration control", speed=1.2)
Parameter Type Default Description
duration float or list[float | None] None Fixed output duration in seconds. Overrides speed when set.
speed float or list[float | None] None Speed factor. Values > 1.0 produce shorter audio (faster); values < 1.0 produce longer audio (slower). Ignored when duration is set. Defaults to 1.0 when both are None.

Priority: duration > speed.

Pre/Post Processing

Parameter Type Default Description
preprocess_prompt bool True Whether to apply preprocessing to the voice-clone prompt audio (remove long silences in reference audio, add punctuation in the end of reference text).
postprocess_output bool True Apply post-processing to generated audio (remove long silences).

Long-Form Generation

To support stable long-form speech generation with low VRAM consumption, the text is automatically split into smaller segments when the estimated duration of the generated speech exceeds audio_chunk_duration, with each segment producing approximately audio_chunk_duration seconds of audio. This approach allows the model to accept arbitrarily long text and generate arbitrarily long speech with near-constant VRAM consumption.

Parameter Type Default Description
audio_chunk_duration float 15.0 Target chunk duration (seconds) when splitting long text.
audio_chunk_threshold float 30.0 Estimated audio duration (seconds) above which chunking is activated.