Azure AI Speech launches new zero-shot TTS models for Personal Voice

Qinying Liao · ‎Jan 31 2024

At the Ignite conference on Nov 15, 2023, we announced the public preview of Personal Voice, which is specifically designed to enable customers to build apps that allow their users to easily create and use their own AI voices (see the blog).

Today we're thrilled to announce that Azure AI Speech Service has upgraded its Personal Voice feature with new zero-shot TTS (text-to-speech) models. Compared to the initial model, these new models improve the naturalness of synthesized voices and better resemble the speech characteristics of the voice in the prompt.

In this blog, we'll explore how new zero-shot TTS models enable users to create a more natural sounding voice that captures their unique speech characteristics. We'll also provide a step-by-step guide on how to integrate the personal voice capability into your apps using the Personal Voice API with different zero-shot TTS models.

Zero-shot model upgrades

The Personal Voice capability in Azure AI Speech Service allows customers to create personalized synthetic voices for their users based on their unique speech characteristics. With Personal Voice, users can get AI replicating their voice in a few seconds by providing just a short speech sample as the audio prompt, and then use it to generate speech in any of the 100 languages supported. This feature can be used for various use cases, such as personalizing voice experience for a chatbot, or dubbing video content in different languages with the actor’s native voice.

Zero-shot TTS or foundation TTS models have evolved rapidly in the past year. The industry and academia have proposed various approaches to advance the technology, including Microsoft’s state-of-the-art research models such as VALL-E (X) , FoundationTTS, NaturalSpeech, etc. These models are typically trained with large amounts of speech data to cover different text content and voice characteristics, such as timbre, speech styles and accents. With that, the model can gain the zero-shot text-to-speech ability to clone a voice with very little data of target speakers through modules such as auto-regressive transformers or diffusion.

Every model has its strengths and weaknesses, and we understand each customer's needs are unique. We offer a variety of base models for Personal Voice customers to choose from based on their specific scenarios. Our latest addition, the “DragonLatestNeural” model, features cutting-edge technology that allows for more realistic prosody, higher fidelity, and personalized voices that mimic the nuances of the human speaker in the prompt, with various speech characteristics. This model is currently optimized for content generation scenarios where expressiveness is highly demanded, and latency is of less concern. Our updated “PhoenixLatestNeural” model also enhances the similarity of voice to the human speaker, while maintaining low-latency performance and higher pronunciation accuracy, making it ideal for real-time scenarios. Both models have undergone significant improvements and have been trained in 10x more data than the previous “PhoenixV2Neural” model.

Here are a few voice samples with different speaking styles, generated from the latest Dragon model:

Audio prompt (human voice)	Style	Generated speech and the script
	Voice assistant	Good morning! Today's weather is sunny with a high of 75 degrees. You have two meetings scheduled and a reminder to call your mom. How can I assist you further?
	News	In today's news, a major breakthrough in renewable energy has been achieved by researchers at the GreenTech Innovation Lab. The team, led by Dr. Emily Huang, announced the development of a new solar panel technology that promises to double the efficiency of current models. This significant advancement could lead to a substantial reduction in solar energy costs.
	Conversation	Hey, everyone, it's Lisa. So, about dinner tonight, I'm trying to decide whether to cook or maybe order something. Um, I'm thinking pasta with garlic bread sounds good, but then I saw this new Thai place nearby, and their menu looks really tempting.
	Whisper	Sure, here is the note. What else can I do for you?
	Excited	Guess what? I just won the lottery – we're going on a dream vacation!
	Shout	Watch out! The ball is heading right towards you!

Below are samples of two voices speaking different languages with zero-shot TTS:

	Female	Male
	Prompt	Prompt
朋友们，你们真是太给力了。昨天我发了一条视频，希望有人来关注我，结果真的有很多朋友关注我了。我内心非常激动，而且还有人私信我说喜欢我，这让我很感动，感受到了大伙们的友善，谢谢你们!
Okay, um, so, like, I was, uh, trying to, you know, explain this thing to, um, my friend the other day, and, well, I just couldn't, like, find the right, uh, words? And then, you know, I thought maybe, um, I was just, kind of, overthinking it or, um, something. It's just, uh, sometimes hard to, you know, put thoughts into, um, words, you know? How was that? Anything else on your mind?
Once upon a time, there was a little rabbit named Benny. Benny loved to hop around in the fields and nibble on carrots. One day, while he was out exploring, he stumbled upon a beautiful garden filled with all sorts of delicious vegetables.
C'era una volta un piccolo coniglio di nome Benny. A Benny piaceva saltellare nei campi e rosicchiare le carote. Un giorno, mentre esplorava, incappò in un bellissimo giardino pieno di verdure deliziose. Benny non riuscì a resistere e iniziò a mangiare la fresca lattuga e i pomodori succosi.

Customer case

GRUP MEDIAPRO, a global media company and the leader of the European audiovisual market, has partnered with Microsoft to respond to the profound transformations in the media sector brought about by AI. In a recent announcement at the ISE fair on Jan 30th, the company unveiled its Artificial Intelligence and Synthetic Media Laboratory, which has been developed in partnership with Microsoft, as part of its commitment to innovation (read the news here).

The lab leverages the latest advances in AI, including zero-shot TTS, to support research and solution development in fields such as personalization of audiovisual and digital content, voice cloning, and video processing. GRUP MEDIAPRO and Microsoft approach this collaboration with a people-centered focus, while remaining committed to legal commitments and ethical principles in the development, deployment, and use of Artificial Intelligence solutions.

You can listen to the story in the CEO's own personal voice, which was created using the Dragon zero-shot TTS model, in Chinese and Arabic. Tatxo Benet, the CEO, leads the way in showcasing the capabilities of this new technology, which has enabled him to reach to a global audience with the languages he doesn’t speak himself.

How to use it

As part of Microsoft's commitment to responsible AI, Personal Voice is designed with the intention of protecting the rights of individuals and society, fostering transparent human-computer interaction, and counteracting the proliferation of harmful deepfakes and misleading content. For this reason, Personal Voice is a Limited Access feature available by registration only, and only for certain use cases. To access the API and use the feature in your business applications, register your use case here and apply for the access.

Once you’ve got your access, you can start to build your personal voice project. Use the Projects_Create operation of the custom voice API, to create a personal voice project. Check out more instructions here.

Then you can follow these steps to create a personal voice using zero-shot TTS with the Dragon model.

First, you'll need to provide the user’s consent for creating a voice profile.

With the Personal Voice feature, it's required that every voice be created with explicit consent from the user. A recorded statement from the user is required acknowledging that the customer (Azure AI Speech resource owner) will create and use their voice. You can find the template of the verbal statement in different languages here.

Follow the samples here to add a consent from a file or from a URL.

Once you've given consent, you can record and upload your voice samples to create a speaker profile ID.

To use Personal Voice in your application, you need to get a speaker profile ID. The speaker profile ID is used to generate synthesized audio with the text input provided. You create a speaker profile ID based on the speaker's verbal consent statement and an audio prompt.

Follow the code samples here to create a speaker profile ID from a file or from a URL.

After the voice profile ID is created, you can use the zero-shot TTS feature with the selected base model to synthesize speech that matches your natural speaking style, intonation, and accent.

To get a list of supported base model voice names, use the BaseModels_List operation of the custom voice API. Note that DragonLastNeural and PhoenixLastNeural are evolving models. Their performance may vary with updates for ongoing improvements. PhoenixV2Neural is stable without further updates, ensuring a consistent performance. Select the base model that best meets your needs, and use the speaker profile ID to synthesize speech in any of the 100 languages supported.

A locale tag isn't required in the SSML when using zero-shot TTS. Personal Voice employs automatic language detection at the sentence level. Below is an SSML example using DragonLatestNeural to generate speech for your personal voice in different languages. More details are provided here.

<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts' xml:lang='en-US'>

    <voice name='DragonLatestNeural'>

    <mstts:ttsembedding speakerProfileId='your speaker profile ID here'>

    I'm happy to hear that you find me amazing and that I have made your trip planning easier and more fun. 我很高兴听到你觉得我很了不起，我让你的旅行计划更轻松、更有趣。Je suis heureux d'apprendre que vous me trouvez incroyable et que j'ai rendu la planification de votre voyage plus facile et plus amusante. 

    </mstts:ttsembedding>

    </voice>

</speak>

All customers must comply with the Guidelines for responsible deployment of synthetic voice technology and the code of conduct when using the service.

Get started

With the newly released zero-shot TTS models, the Personal Voice feature of Azure AI Speech is upgraded with higher quality. It generates speech that well captures the nuances of the user’s natural speech. Personal Voice is a Limited Access feature available by registration only, and only for certain use cases. To get started, register your use case here and apply for the access.

In addition to creating personal voices for your users, you can create a brand voice for your business with Custom Neural Voice’s professional voice feature. Azure AI Speech also offers over 400 neural voices covering more than 140 languages and locales. With these pre-built text-to-speech voices, you can quickly add read-aloud functionality for a more accessible app design or give a voice to chatbots to provide a richer conversational experience to your users.

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs