Advanced FeatureIntegrating your own TTS provider means youâre responsible for speech generation issues.
Youâll need to work with your provider to ensure your requests will be fulfilled reliably and quickly
and will sound the way you want. If you encounter generation errors, see Debugging below.
Setting up your account
To use your own TTS provider, youâll need to add your API key for that provider to your Ultravox account. You can do that in the Ultravox console or using the Set TTS API keys endpoint.Generic optionYou can skip this step if youâre using the âgenericâ ExternalVoice integration.
Named Providers
When setting an ExternalVoice on your agent or call, there are a few different providers available. The named providers such as ElevenLabs and Cartesia have customized integrations that make their voices work as smoothly as possible with Ultravox. This typically means Ultravox uses their streaming API and takes advantage of audio timing information from the provider to synchronize transcripts. While youâll still need to work with the provider to ensure your agentâs requests will be fulfilled reliably and quickly, you can be confident that Ultravox knows how to interact with your provider. If youâd like to use some other TTS provider, you may be able to get by with our Generic TTS integration option.Cartesia
Cartesia also provides many high quality voices. We use their websocket API to stream text in and audio out in parallel. Cartesia provides word-level timing information interspersed with audio, helping to keep transcripts in sync with audio.Eleven Labs
Eleven Labs is our most commonly used provider (and as of May 2025 backs most of our internal voices). We use their websocket API to stream text in and audio out in parallel. Eleven Labs provides character-level timing information alongside audio, ensuring transcripts are kept in sync and conversation history accurately reflects what was spoken in the event of an interruption.SlurringEleven Labs seems much more likely to slur words or generally hallucinate audio in the past couple months.
Several of their customers (including Ultravox) have reported this and it is being worked on. In the meantime,
prompting your agent to avoid special characters like asterisks may be helpful. Alternatively, you could try
their more robust (but slower) multilingual model.
LMNT
LMNTâs âauroraâ model lags the other providers in terms of quality, though their experimental âblizzardâ model shows potential. LMNT has by far the simplest streaming integration (and the only SDK worth using). They also offer unlimited concurrency and no rate limits even on their $10/month plan. Like Eleven Labs and Cartesia, LMNT allows for streaming text in and audio out in parallel and provides audio timing information to help Ultravox align transcripts with speech.Generic TTS Options
The âgenericâ TTS route gives you much more flexibility to define requests to your provider. Any provider that accepts json post requests and that returns either WAV or raw PCM audio (including within JSON bodies) ought to work. Since generic integrations donât stream text in (but can stream audio out) Ultravox has to buffer input text before sending it to your provider, which means slightly higher agent response times and possible audio discontinuities at sentence boundaries. Additionally since generic integrations donât provide audio timing information, transcript timing must be approximated. Once Ultravox has a full generation (and therefore the true audio duration), it assumes each character requires the same duration and approximates transcripts based on that. While the first generation is still streaming, Ultravox relies on an estimated words-per-minute speaking rate to approximate transcripts.Deepgram
Deepgramâs latest Aura-2 model claims to be in line with other providers in terms of quality. However, it isnât supported in their streaming API yet and Ultravox has no special integration with it yet as a result. That said, you can use our âgenericâ ExternalVoice to give Deepgram a try now using their REST API.jsonAudioFieldPath in your generic ExternalVoice.
Inworld
Inworldâs TTS API returns json, so it requires an extrajsonAudioFieldPath field. To use the streaming endpoint, youâll also need to override the responseMimeType field so we know to treat the response as json lines.
OpenAI
OpenAI also has a TTS API you can use with our generic ExternalVoice option.Orpheus
Orpheus is an open-source TTS model with a Llama 3 backbone. Along with several similar models, Orpheus likely represents the next generation of realism for AI voices. Theyâve partnered with baseten to provide a simple self-hosting option you can set up for yourself. You can use a generic ExternalVoice with your self-hosted Orpheus instance:Resemble
Resemble also has a TTS API you can use with our generic ExternalVoice option.Rime
Rime provides a spell() tool to help nail the pronunciation of unique IDs, email addresses, etc.Sarvam
Sarvamâs TTS API returns json, so it requires an extrajsonAudioFieldPath field in your generic ExternalVoice.
Debugging
If you start a call with an external voice and donât hear anything from the agent, your external voice is probably misconfigured. You can figure out whatâs wrong using the call event API. Events are also visible when viewing the call in the Ultravox console. Here are some common issues and their resolutions:| Example error text | Provider | Resolution |
|---|---|---|
Requested output format pcm_44100 (PCM at 44100hz) is only allowed for Pro tier and above. | ElevenLabs | Your ElevenLabs subscription limits your generation sample rate. Find the maximum sample rate allowed for your subscription on their pricing page (youâll need to click âShow API detailsâ) and then set maxSampleRate on your voice to match. |
A model with requested ID does not exist | ElevenLabs | Your model name is wrong. See their model page for the correct ids. |
A voice with voice_id 2bNrEsM0omyhLiEyOwqY does not exist. | ElevenLabs | The voiceId you provided doesnât correspond to a voice in your ElevenLabs library. Make sure your ElevenLabs API key is what you expect and then add the voice to your library in Eleven. |
The API key you used is missing the permission text_to_speech to execute this operation. | ElevenLabs | Check your key and/or upgrade your account with ElevenLabs. |
This request exceeds your quota of 10000. You have 14 credits remaining, while 46 credits are required for this request. | ElevenLabs | Check your key and/or upgrade your account with ElevenLabs. |
Error initializing streaming TTS connection | ElevenLabs/Cartesia/LMNT | The provider rejected our attempt to create a streaming connection. This occurs most commonly with ElevenLabs and usually means your API key is incorrect. |
HTTP error: 500 Response:{"error": "Internal server error"} Request:{"text": "How can I help you?"} | Generic | This is the sort of error youâll get for generic external voices. You should be able to use the complete request and response to reproduce and debug the error with your provider. |