Text to Speech MagpieTTS Multilingual Demo

MagpieTTS is NVIDIA's state-of-the-art multilingual text-to-speech system built on the NeMo framework. It employs a two-stage pipeline architecture: a language model generates discrete acoustic tokens from text, which are then decoded into high-fidelity audio using a neural audio codec: NanoCodec. The model is a text-to-speech model that generates speech in 5 different English speakers - Sofia, Aria, Jason, Leo, John Van Stan. Each speakers can speak nine different languages (En, Es, De, Fr, Vi, It, Zh, Hi, Ja). The model predicts discrete audio codec tokens autoregressively using a transformer encoder-decoder architecture. It employs multi-codebook prediction (typically 8 codebooks) with optional local transformer refinement for high-quality audio generation, and leverages techniques like attention priors, classifier-free guidance (CFG), and Group Relative Policy Optimization (GRPO) for improved alignment.

Key Features of the model

Multilingual Support — Synthesizes natural speech in English, Spanish, German, French, Vietnamese, Italian, Mandarin, Hindi and Japanese.
Expressive Voices — Multiple voice options with emotional tones and gender variations including 4 proprietary voices and 1 public voice
Text Normalization — Built-in text normalization for handling numbers, abbreviations, and special characters for all languages.

Resources

🤗 MagpieTTS Weights: nvidia/magpie_tts_multilingual_357m
🤗 NanoCodec Weights: nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps
💻 NeMo Framework: github.com/NVIDIA/NeMo

Note about the demo:

Text normalization takes time to load, so if you want faster generation select "Do not apply TN."
Text normalization works for En, Es, De, Fr, It, Zh, Vi, Hi, Ja. Added support for Vi, Hi, Ja text normalization in V26.02 version of the checkpoint.
Text normalization is required for numbers to be processed and spoken.
When the model runs on ZeroGPU Hardware, expect slower generation because the model checkpoint is loaded everytime.
As the model's speakers are all native English speakers, expect accented speech in the other languages.
The current model can generate long-form speech (more than 20 seconds) in English. However, longer generations can cause timeout due to Huggingface timeout limit.
Loan words are not supported at this time. For example, English characters in Mandarin will lead to unexpected results.
To add custom phone pronunciations for supported languages, replace the word with it's IPA characters surrounded by '|' characters and add a space inbetween each IPA phone. For example, "Hello world from NeMo Text to Speech." could be written as "Hello world from | ˈ n ɛ m o ʊ | Text to Speech." replacing "NeMo" with "| ˈ n ɛ m o ʊ |".
- Text normalization does not work with custom pronunciations. To mix the two, normalize the input first and then replace the word with the custom pronunciation.
For the enterprise offering, see the MagpieTTS NIM which includes additional native voices in the supported languages, emotional speech capabilities, and optimized batch and latency inference pipeline.