Text to Speech MagpieTTS Multilingual Demo

MagpieTTS is NVIDIA's state-of-the-art multilingual text-to-speech system built on the NeMo framework. It employs a two-stage pipeline architecture: a language model generates discrete acoustic tokens from text, which are then decoded into high-fidelity audio using a neural audio codec: NanoCodec. The model is a text-to-speech model that generates speech in 5 different English speakers - Sofia, Aria, Jason, Leo, John Van Stan. Each speakers can speak nine different languages (En, Es, De, Fr, Vi, It, Zh, Hi, Ja). The model predicts discrete audio codec tokens autoregressively using a transformer encoder-decoder architecture. It employs multi-codebook prediction (typically 8 codebooks) with optional local transformer refinement for high-quality audio generation, and leverages techniques like attention priors, classifier-free guidance (CFG), and Group Relative Policy Optimization (GRPO) for improved alignment.

Key Features of the model

  • Multilingual Support — Synthesizes natural speech in English, French, Spanish, German, French, Vietnamese, Italian, Mandarin, Hindi and Japanese.
  • Expressive Voices — Multiple voice options with emotional tones and gender variations including 4 proprietary voices and 1 public voice
  • Text Normalization — Built-in text normalization for handling numbers, abbreviations, and special characters for all languages.

Resources

Note about the demo:

  • Text normalization takes time to load, so if you want faster generation select "Do not apply TN."
  • Text normalization works for En, Es, De, Fr, It, Zh, Vi, Hi, Ja. Added support for Vi, Hi, Ja text normalization in V26.02 version of the checkpoint.
  • Text normalization is required for numbers to be processed and spoken.
  • When the model runs on ZeroGPU Hardware, expect slower generation because the model checkpoint is loaded everytime.
  • As the model's speakers are all native English speakers, expect accented speech in the other languages.
  • The current model can generate long-form speech (more than 20 seconds) in English. However, longer generations can cause timeout due to Huggingface timeout limit.
  • Loan words are not supported at this time. For example, English characters in Mandarin will lead to unexpected results.
  • For the enterprise offering, see the MagpieTTS NIM which includes additional native voices in the supported languages, emotional speech capabilities, and optimized batch and latency inference pipeline.
Target Language

Select the target language for the speech to be synthesized in

Target Speaker

Select the target speaker whose voice you would like the the speech to be synthesized in

Apply Text Normalization

Select if you want to apply text normalization to the input text