Skip to content


Voices on Amazon’s Alexa, Google Assistant, and other AI assistants are way ahead of old-school GPS devices, but they still lack the rhythms, intonation, and other qualities that make it sound. human speech. NVIDIA has unveiled new research and new tools that can capture these natural vocal qualities by allowing you to train the AI ​​system with your own voice, the company announced at the Interspeech 2021 conference.

To improve its AI-to-speech synthesis, NVIDIA’s Text-to-Speech research team developed a model called RAD-TTS, a winning entry in a NAB broadcast convention competition to develop the most realistic avatar. The system allows an individual to train a text-to-speech model with their own voice, including rhythm, tone, timbre and more.

Another feature of RAD-TTS is voice conversion, which allows a user to pronounce one speaker’s words using another person’s voice. This interface gives fine image-level control over the pitch, duration and energy of a synthesized voice.

Using this technology, NVIDIA researchers created a more conversational voice narration for its own I Am AI video series using synthesized rather than human voices. The goal was for the storytelling to match the tone and style of the videos, which has not been done well in many AI narrated videos to date. The results are still a bit robotic, but better than any AI storytelling I’ve ever heard.

“With this interface, our video producer could record himself playing the video script and then use the AI ​​model to convert his speech into the narrator’s voice. Using this basic storytelling, the producer could then lead the AI ​​like a voice actor – fine-tuning the synthesized speech to emphasize specific words and changing the pace of the storytelling to better express the tone of the video. NVIDIA wrote.

NVIDIA is distributing some of this research – optimized to run efficiently on NVIDIA GPUs, of course – to anyone who wants to try it out via open source via the NVIDIA NeMo Python Toolkit for GPU Accelerated Conversational AI, available on the company’s NGC container hub and other software.

“Several models are driven with tens of thousands of hours of audio data on NVIDIA DGX systems. Developers can refine any model for their use cases, speeding up training using mixed precision calculations on NVIDIA Tensor Core GPUs, ”the company wrote.

Editor’s Note: This post originally appeared on Engadget.


techcrunch Gt

Not all news on the site expresses the point of view of the site, but we transmit this news automatically and translate it through programmatic technology on the site and not from a human editor.