Skip to content

Eval/tts multilingual parakeet metrics#15826

Open
quapham wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
quapham:eval/tts-multilingual-parakeet-metrics
Open

Eval/tts multilingual parakeet metrics#15826
quapham wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
quapham:eval/tts-multilingual-parakeet-metrics

Conversation

@quapham

@quapham quapham commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Adds multilingual MagpieTTS evaluation support by enabling Japanese Katakana CER, multilingual Parakeet ASR target-language routing

Collection: [TTS]

Changelog

  • Add Japanese Katakana CER for MagpieTTS evaluation.
    - Computes CER on Katakana readings of Japanese reference and ASR hypothesis.
    - Keeps Katakana CER guarded to Japanese datasets only.
    - Adds filewise and aggregate Katakana CER outputs.

    • Add multilingual Parakeet prompt ASR support for MagpieTTS evaluation.
      • Supports local .nemo ASR checkpoints for multilingual evaluation.
      • Maps eval language metadata to Parakeet target_lang.
      • Keeps Whisper / existing ASR behavior as fallback where applicable.

Usage

TESTSET_ROOT=/path/to/Magpietts_testset

python examples/tts/magpietts_inference.py \
  --hparams_files /path/to/hparams.yaml \
  --checkpoint_files /path/to/checkpoint.ckpt \
  --codecmodel_path /path/to/codec_model.nemo \
  --datasets_json_path "$TESTSET_ROOT/evalset.json" \
  --root "$TESTSET_ROOT" \
  --datasets ja_JP_jvs_jsut \
  --out_dir /path/to/eval_outputs/ja_JP_jvs_jsut \
  --run_evaluation \
  --use_local_transformer \
  --asr_model_name /path/to/multilingual_parakeet.nemo
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • [ x] Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • [ x] New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

quapham added 3 commits June 24, 2026 03:50
Adds a reading-based CER for Japanese, computed on the Katakana reading
(via pyopenjtalk g2p) of both reference and ASR hypothesis. Robust to
kanji/kana spelling differences that inflate raw character CER.

- text_to_katakana(): lazy-imports pyopenjtalk, returns '' if unavailable
  (graceful no-op for non-ja or environments without the dep).
- katakana_cer / gt_katakana / pred_katakana computed only when language=='ja',
  saved per-utterance in filewise metrics.
- katakana_cer_filewise_avg + katakana_cer_cumulative aggregated globally
  (only emitted for ja datasets), added to the results CSV header/rows.

Signed-off-by: quanpham <youngkwan199@gmail.com>
Add target language mapping for multilingual Parakeet prompt ASR checkpoints during MagpieTTS evaluation. Local .nemo ASR models can now be used for non-English evalsets, while Whisper remains the fallback when no NeMo ASR model is provided. Japanese Katakana CER remains guarded to Japanese datasets only.

Signed-off-by: quanpham <youngkwan199@gmail.com>
Add --root to MagpieTTS inference so evalset manifest_path and audio_dir entries can remain relative. Also use the evalset language field for evaluation, preserving whisper_language as a legacy fallback, which is required for multilingual Parakeet target_lang selection.

Signed-off-by: quanpham <youngkwan199@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the TTS label Jun 24, 2026
if gt_audio_text is not None:
gt_audio_text = gt_audio_text.replace(" ", "")
else:
pred_text = pred_text
gt_audio_text = gt_audio_text.replace(" ", "")
else:
pred_text = pred_text
gt_text = gt_text
# Remove Hindi-specific punctuation (danda, double danda)
input_text = re.sub(r'[।॥॰]', '', input_text)
# Remove Mandarin-specific punctuation
input_text = re.sub(r'[,。!?;:""''()【】《》〈〉「」『』、…·~—–\u3000]', '', input_text)
Comment on lines -80 to -99
# Validate that all evaluation datasets exist
for dataset_name, info in dataset_meta_info.items():
manifest_path = Path(info["manifest_path"])
audio_dir = Path(info["audio_dir"])

if dataset_base_path:
# Replace relative paths with absolute paths where appropriate
if not manifest_path.is_absolute():
manifest_path = dataset_base_path / manifest_path
info["manifest_path"] = str(manifest_path)

if not audio_dir.is_absolute():
audio_dir = dataset_base_path / audio_dir
info["audio_dir"] = str(audio_dir)

if not manifest_path.exists():
raise ValueError(f"Manifest does not exist for dataset {dataset_name}: {manifest_path}")

if not audio_dir.exists():
raise ValueError(f"Audio directory does not exist for dataset {dataset_name}: {audio_dir}")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we delete and rewrite this code? This existing implementing looks more readable and has better error handling.

type=Path,
default=None,
help='Optional base path that paths in the "datasets_json_path" file are relative to',
'--root',

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'datasets_base_path' is a more specific name than 'root'. We could rename it 'dataset_root_path' if we think that is clearer.

logging.info(f"Doing batched ASR transcription with batch size {asr_batch_size}...")

# Transcribe predicted audios
text_processor = get_text_processor(language)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please implement the new text processing within the get_text_processor(language) function, as new implementations of TextProcesor. https://github.com/NVIDIA-NeMo/NeMo/blob/main/nemo/collections/tts/parts/utils/tts_dataset_utils.py#L881


gt_text = gt_texts_processed[ridx]

if language in ("zh", "zh-CN", "zh-TW"):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not be representing both languages (like "zh") and locales (like "zh-CN") interchangeably. I think referring to locales as 'language' is a misnomer and it is going to be very confusing. The language and locale should either be passed around as separate arguments/variables, or we should refactor the code and configs to only use locales.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants