ElevenLabs is the best text-to-speech AI system

Table of Contents
ElevenLabs at time of publication is the best text to voice AI system you can use. I write “at time of publication” as things are moving especially in the AI world right, but this has been true at least for the last several months.
Cerebral Palsy Alliance - My Voice Library
Recently Cerebral Palsy Alliance (CPA) tasked Kablamo with the job of upgrading My Voice Library (MVL) to support additional languages starting with Italian.
Over 50% of children with cerebral palsy have dysarthria – difficulties with speech – creating barriers to participating in everyday life. My Voice Library is a world-first data-driven application that uses gamification to create a voice and facial library designed for the purpose of enabling real-time communication solutions for children with dysarthria.
Internationalisation efforts for the most part are a solved problem, albeit one that produces a lot of work, as you create a lookup table for every value. Outside of supporting right-to-left languages and dealing with text overflows however this is something that’s fairly predictable and easy to estimate.
During the initial development of MVL we initially had planned to have people on the development team record their voice for the voice-overs that are used extensively within it. This quickly ran into problems, with most people not having good audio recording equipment, good environments to make recordings, and that none of us are professionally trained voice actors.
As a result the decision was quickly made to use a professional studio to provide voice recordings for each of the 350 or so recordings we needed for each of the characters Jules and Sam in MVL.
The results above speak for themselves and really lifted the quality of the solution to a higher level, as well as being surprisingly affordable considering the amount of time it took.
I still believe that a human performing this work right now produces the best result, although I have no idea how long this will remain the case.
The Rise of AI
However that was done in early 2021, and the world of AI had since moved on. We all observed the rise of the transformer model. As such when quoting on adding Italian into MVL one of suggestions raised was to use an AI system to attempt to produce the voices. This was for a few reasons,
- The expected quantity of voices needed was going to be at least 2x what was previously needed, which would increase the voice recording costs.
- In theory we could save the voice for later use should we want to expand the solution, where-as professional voice recorders might change jobs, breaking continuity.
- It would save us the headache of finding and vetting a Italian speaking studio, sending the recordings, and waiting for them to come back, although this was the expected fall-back plan.
- AI voices have evolved considerably in the last few years, so even if lacking in one or two areas the long term bet on using them might be worth it.
The decision was made early on to “spike” out looking for a text-to-speech solution. Things we were looking for included the ability to add some emotion to the voices, avoid anything sounding robotic or un-natural and have a solution that was easily script-able.
As such we tried the following solutions, AWS Poly, Google Cloud Text-to-Speech, Azure Text-to-Speech, Murf.ai and ElevenLabs. While there are more solutions out there we had a strict time box in order to ensure we had time to to to market for our fallback of using people. For each solution some sample english text was run through the system. This allowed us to evaluate how easy it was to integrate with, and what the general quality was, as the team could evaluate it properly since english is their native language.
We then took some samples of italian text we would need, and where possible switched the model to one appropriate for the language and generated more samples. These samples were then anonymized and delivered to the project participants to evaluate.
In every single case ElevenLabs came out in front, either with a clear win or a close draw. The voices used were described as “natural”, “with emotion”, “sounds good”, compared to some of the other solutions described as “robotic” and “terrible”.
In addition to this positive result ElevenLabs was rated the easiest to use by the development team. While they were prepared to work with an annoying API given an excellent result, in the case of any draw Elevenlab’s ease of use was considered an advantage, since the only thing required was the use of an API key after creating an account.
Python Implementation
Scripting against ElevenLabs turned out to be very straight forward and we had an implementation working against it fairly quickly. API implementations for both Python and Node exist, however we elected to use Python out of personal preference.
The documentation is not perfect, however the AI assistant provided works fairly well and was able to plug the gaps in the documentation with working code examples.
As a result we were able to craft the below Python code which connects to ElevenLabs, generates the voice and saves it locally to disk. Compared to some other solutions which used an intermediate storage solution and async processing this was a huge advantage.
from elevenlabs.client import ElevenLabs
from elevenlabs import VoiceSettings
import os
class Voice:
def __init__(self, name, lab_name, voice_id, voice_settings):
self.name = name
self.lab_name = lab_name
self.voice_id = voice_id
self.voice_settings = voice_settings
voices = [
Voice(
name='Maria',
lab_name='Maria',
voice_id='ELEVENLABS_VOICE_ID',
voice_settings=VoiceSettings(
speed=1.1,
stability=0.7
),
)
]
def text_to_speech_and_save(client, voice: Voice, text, output_filename, model_id='eleven_multilingual_v2'):
"""Convert text to speech and save the audio data to a file."""
audio_data = client.text_to_speech.convert(
voice_id=voice.voice_id,
output_format="mp3_44100_128",
text=text,
model_id=model_id,
voice_settings=voice.voice_settings,
)
# save to disk
with open(output_filename, "wb") as f:
for chunk in audio_data:
f.write(chunk)
return output_filename
if __name__ == '__main__':
api_key = os.getenv('ELEVEN_LABS_KEY')
client = ElevenLabs(
api_key=api_key,
)
text_to_speech_and_save(client, voices[0], 'Puoi gonfiare le guance e tenerle così per 15 secondi?', 'output.mp3')
Lessons
Not everything was straight forward however. Some problems existed, and each one and the resolution we had is included.
ElevenLabs Personas
The first lesson learnt was to create good personas. You then use these to generate the voice in ElevenLabs. I don’t know the internals of the model, but I did notice that longer descriptions of what you are trying to achieve produced better results.
The below are some of the early persona’s created in order to evaluate the models back at the start of the project.
Studio-quality recording. Italian woman 25 years old, rich warm voice. Energetic, professional and fun. Strong Italian accent. Rhythmic and musical in pace. She should be excited and engaged when asking questions.
Studio-quality recording. Italian man 30 years old, rich warm voice. Friendly, professional and fun. Strong Italian accent. Rhythmic and stoic in pace. He should be excited and engaged when asking questions like a knowledgeable professor.
Cache all the Things
Somewhat un-surprisingly ElevenLabs is a business and would like to make money. They make money every time you request something to be generated through their system. As a result ensure you cache results where you know you don’t need to regenerate them.
While the cost to generate is low, you can quickly require to be bumped from a lower tier to a higher plan by ignoring this rule. This is especially annoying when you need those new samples 5 days before the account month rollover happens.
Phonetics and Sounds
The voices have a great level of difficulty with sounds such as /p/ /p/ when used individually. To get around this you can try phonetics such as “puh”, “ppp” however this is not ideal, and in some cases will never produce the sound you want.
The other option is to use a longer sentence, where given more context the models may produce the result you need. In our case this was not an option and we had to be flexible by swapping out sounds in this case with words which produced the sound we needed and use those.
Singing and Scales
Generated voices have no ability to sing (this problem also exists for humans). For example getting the model to produce,
Puoi cantare la scala Do-Re-Mi: Do-Re-Mi-Fa-Sol-La-Si-Do?
Did not work as well as one would have expected. I suspect even I could possibly do a passing version of this if I had to. Despite multiple attempts to emulate this the best we could get from the model is included below.
While good enough for our purposes is still less than ideal, and is one of the tippy top of backlog items to investigate as the models continue to improve.
Versioning
At time of writing the ElevenLabs v2 model is the most stable. We did try ElevenLabs v3 model, but it was so unstable as to add gunshots and other random background sounds into samples, and so there was very quick revert back to v2.
This was unfortunate as its ability to add text to indicate tone and liveliness would have been a massive advantage. However it brings us nicely to…
Emotion
The only way to add emotion to the v2 models is to add punctuation and lower the stability values. So for example Fantastic!
would be generated using FANTASTIC!!!
.
As a result the results can be inconsistent with you needed to regenerate some voices multiple times in order to get the tone and meaning you want.
As such ensure you have a way to rebuild the recorded voices quickly! In our case we treated the disk as a cache, and on finding an expected file to exist would skip the generation. So we could quickly delete the file to quickly regenerate.
Testing
You will have to listen to every sample at least once!
You will also have to listen to most of them multiple times!
This is probably the biggest thing we underestimated. Not purely from a time perspective but from moral as it can be very fatiguing listening to a sample multiple times in order to get a good one.
The models are generally very good, especially if you increase the stability, but trying to inject emotion into them, the quantity of sounds we were generating and the use of a language other than english means you will bump into a lot of edge cases with the models.
Side note, hearing “Ripetizione 1 di 2” and “Ripetizione 2 di 2” over and over resulted it a quick feature request from myself to add multiple renditions of repeated voices to help prevent my eventual decent into insanity.
Cloning
On other thing we explored was if we could clone the existing Jules and Sam voices used in the english version of MVL. We had no intention on using this, but given some time it seemed like it was worth exploring as this may be something required in the future.
Since we had studio quality recordings adding them to be trained against was simple. Similar to our initial tests, we took an existing recording of Jules, and then generated 2 additional versions of the same, mixed them up and sent them out to the team to determine if they could identify which one was real and which was generated.
Not one of the 5 people tested were able to identify the cloned voices. It’s #2 for those curious. The AI system seemed to clone the cadence of the voice exceptionally well. While we are aware thats possibly because it repeating the same text as a original recording, this would allow us in the future to expand the recordings from before and avoid the ear fatigue of listing to things like “Recording 1 of 2”.
Results
Thankfully despite some of the short-comings mentioned above the results are good enough for CPA and MVL, and both CPA and ourselves were delighted at how well they integrated.
The solution is now more or less finished and delivered (still bug fixes to go) and ready to hopefully improve the lives an child living with cerebral palsy and dysarthria. At Kablamo we have always loved working on software that exists only for the betterment of people and not just the drive for money, with CPA being right at the top of that.