Ultravox Realtime allows you to bring your own TTS provider to have your agent sound however you like.
Advanced FeatureIntegrating your own TTS provider means you’re responsible for speech generation issues. You’ll need to work with your provider to ensure your requests will be fulfilled reliably and quickly and will sound the way you want. If you encounter generation errors, see Debugging below.

Setting up your account

To use your own TTS provider, you’ll need to add your API key for that provider to your Ultravox account. You can do that in the Ultravox console or using the Set TTS API keys endpoint.
Generic optionYou can skip this step if you’re using the “generic” ExternalVoice integration.

Providers

When setting an ExternalVoice on your agent or call, there are a few different providers available. The named providers such as ElevenLabs and Cartesia have customized integrations that make their voices work as smoothly as possible with Ultravox. This typically means Ultravox uses their streaming API and takes advantage of audio timing information from the provider to synchronize transcripts. While you’ll still need to work with the provider to ensure your agent’s requests will be fulfilled reliably and quickly, you can be confident that Ultravox knows how to interact with your provider. If you’d like to use some other TTS provider, you may be able to get by with our “generic” integration option. This route gives you much more flexibility to define requests to your provider. Any provider that accepts json post requests and that returns either WAV or raw PCM audio (including within JSON bodies) ought to work. Since generic integrations don’t stream text in (but can stream audio out) Ultravox has to buffer input text before sending it to your provider, which means slightly higher agent response times and possible audio discontinuities at sentence boundaries. Additionally since generic integrations don’t provide audio timing information, transcript timing must be approximated. Once Ultravox has a full generation (and therefore the true audio duration), it assumes each character requires the same duration and approximates transcripts based on that. While the first generation is still streaming, Ultravox relies on an estimated words-per-minute speaking rate to approximate transcripts.

Eleven Labs

Eleven Labs is our most commonly used provider (and as of May 2025 backs most of our internal voices). We use their websocket API to stream text in and audio out in parallel. Eleven Labs provides character-level timing information alongside audio, ensuring transcripts are kept in sync and conversation history accurately reflects what was spoken in the event of an interruption.
SlurringEleven Labs seems much more likely to slur words or generally hallucinate audio in the past couple months. Several of their customers (including Ultravox) have reported this and it is being worked on. In the meantime, prompting your agent to avoid special characters like asterisks may be helpful. Alternatively, you could try their more robust (but slower) multilingual model.
"externalVoice": {
  "elevenLabs": {
    "voiceId": "21m00Tcm4TlvDq8ikWAM",
    "model": "eleven_turbo_v2_5"
  }
}

Cartesia

Cartesia also provides many high quality voices. We use their websocket API to stream text in and audio out in parallel. Cartesia provides word-level timing information interspersed with audio, helping to keep transcripts in sync with audio.
"externalVoice": {
  "cartesia": {
    "voiceId": "af346552-54bf-4c2b-a4d4-9d2820f51b6c",
    "model": "sonic-2"
  }
}

PlayHT

PlayHT’s latest dialog model provides high quality voices, albeit with a bit higher latency. Like the generic option, PlayHT does not allow for input streaming and doesn’t provide audio timing information. Ultravox handles this as described for generic integrations above, though more experience with PlayHT has allowed for more accurate approximations.
"externalVoice": {
  "playHt": {
    "userId": "YOUR_PLAY_HT_USER_ID",
    "voiceId": "s3://voice-cloning-zero-shot/d9ff78ba-d016-47f6-b0ef-dd630f59414e/female-cs/manifest.json",
    "model": "PlayDialog"
  }
}

LMNT

LMNT’s “aurora” model lags the other providers in terms of quality, though their experimental “blizzard” model shows potential. LMNT has by far the simplest streaming integration (and the only SDK worth using). They also offer unlimited concurrency and no rate limits even on their $10/month plan. Like Eleven Labs and Cartesia, LMNT allows for streaming text in and audio out in parallel and provides audio timing information to help Ultravox align transcripts with speech.
"externalVoice": {
  "lmnt": {
    "voiceId": "lily",
    "model": "blizzard"
  }
}

Deepgram

Deepgram’s latest Aura-2 model claims to be in line with other providers in terms of quality. However, it isn’t supported in their streaming API yet and Ultravox has no special integration with it yet as a result. That said, you can use our “generic” ExternalVoice to give Deepgram a try now using their REST API.
"externalVoice": {
  "generic": {
    "url": "https://api.deepgram.com/v1/speak?model=aura-2-asteria-en&encoding=linear16&sample_rate=48000&container=none",
    "headers": {
      "Authorization": "Token YOUR_DEEPGRAM_API_KEY",
      "Content-Type": "application/json"
    },
    "body": {
      "text": "{text}"
    },
    "responseSampleRate": 48000
  }
}

Google

Google’s TTS API returns json, so it requires an extra jsonAudioFieldPath in your generic ExternalVoice.
"externalVoice": {
  "generic": {
    "url": "https://texttospeech.googleapis.com/v1/text:synthesize",
    "headers": {
      "Authorization": "Bearer YOUR_GOOGLE_SERVICE_ACCOUNT_CREDENTIALS",
      "Content-Type": "application/json"
    },
    "body": {
      "input": {"text": "{text}"},
      "voice": {
        "languageCode": "en-US",
        "name": "en-US-Chirp3-HD-Charon"
      },
      "audioConfig": {
        "audioEncoding": "LINEAR16",
        "speakingRate": 1.0,
        "sampleRateHertz": 48000
      }
    },
    "responseSampleRate": 48000,
    "jsonAudioFieldPath": "audioContent"
  }
}

OpenAI

OpenAI also has a TTS API you can use with our generic ExternalVoice option.
"externalVoice": {
  "generic": {
    "url": "https://api.openai.com/v1/audio/speech",
    "headers": {
      "Authorization": "Bearer YOUR_OPENAI_API_KEY",
      "Content-Type": "application/json",
    },
    "body": {
      "input": "{text}",
      "model": "gpt-4o-mini-tts",
      "voice": "shimmer",
      "response_format": "pcm",
      "speed": 1.0
    },
    "responseSampleRate": 24000
  }
}

Resemble

Resemble also has a TTS API you can use with our generic ExternalVoice option.
"externalVoice": {
  "generic": {
    "url": "https://p.cluster.resemble.ai/stream",
    "headers": {
      "Authorization": "Token YOUR_RESEMBLE_API_KEY",
      "Content-Type": "application/json",
      "Accept": "audio/wav"
    },
    "body": {
      "data": "{text}",
      "project_uuid": "YOUR_RESEMBLE_PROJECT_ID",
      "voice_uuid": "0842fdf9",
      "precision": "PCM_16",
      "sample_rate": 44100
    },
    "responseSampleRate": 44100
  }
}

Orpheus

Orpheus is an open-source TTS model with a Llama 3 backbone. Along with several similar models, Orpheus likely represents the next generation of realism for AI voices. They’ve partnered with baseten to provide a simple self-hosting option you can set up for yourself. You can use a generic ExternalVoice with your self-hosted Orpheus instance:
"externalVoice": {
  "generic": {
    "url": "YOUR_BASETEN_DEPLOYMENT_URL",
    "headers": {
      "Authorization": "Api-Key YOUR_BASETEN_API_KEY",
      "Content-Type": "application/json"
    },
    "body": {
      "prompt": "{text}",
      "request_id": "SOME_UUID",
      "max_tokens": 4096,
      "voice": "tara",
      "stop_token_ids": [128258, 128009]
    },
    "responseSampleRate": 24000,
    "responseMimeType": "audio/l16"
  }
}

Debugging

If you start a call with an external voice and don’t hear anything from the agent, your external voice is probably misconfigured. You can figure out what’s wrong using the call event API. Events are also visible when viewing the call in the Ultravox console. Here are some common issues and their resolutions:
Example error textProviderResolution
Requested output format pcm_44100 (PCM at 44100hz) is only allowed for Pro tier and above.ElevenLabsYour ElevenLabs subscription limits your generation sample rate. Find the maximum sample rate allowed for your subscription on their pricing page (you’ll need to click “Show API details”) and then set maxSampleRate on your voice to match.
A model with requested ID does not existElevenLabsYour model name is wrong. See their model page for the correct ids.
A voice with voice_id 2bNrEsM0omyhLiEyOwqY does not exist.ElevenLabsThe voiceId you provided doesn’t correspond to a voice in your ElevenLabs library. Make sure your ElevenLabs API key is what you expect and then add the voice to your library in Eleven.
The API key you used is missing the permission text_to_speech to execute this operation.ElevenLabsCheck your key and/or upgrade your account with ElevenLabs.
This request exceeds your quota of 10000. You have 14 credits remaining, while 46 credits are required for this request.ElevenLabsCheck your key and/or upgrade your account with ElevenLabs.
Error initializing streaming TTS connectionElevenLabs/Cartesia/LMNTThe provider rejected our attempt to create a streaming connection. This occurs most commonly with ElevenLabs and usually means your API key is incorrect.
HTTP error: 500 Response:{"error": "Internal server error"} Request:{"text": "How can I help you?"}GenericThis is the sort of error you’ll get for generic external voices. You should be able to use the complete request and response to reproduce and debug the error with your provider.
If you can’t figure out your issue from the call events or the common issues below, you can also try our Discord.