Ultravox Realtime allows you to bring your own TTS provider to have your agent sound however you like.

Advanced Feature

Integrating your own TTS provider means you’re responsible for speech generation issues. You’ll need to work with your provider to ensure your requests will be fulfilled reliably and quickly and will sound the way you want.

Setting up your account

To use your own TTS provider, you’ll need to add your API key for that provider to your Ultravox account. You can do that in the Ultravox console or using the Set TTS API keys endpoint.

Generic option

You can skip this step if you’re using the “generic” ExternalVoice integration.

Providers

When setting an ExternalVoice on your agent or call, there are a few different providers available. The named providers such as ElevenLabs and Cartesia have customized integrations that make their voices work as smoothly as possible with Ultravox. This typically means Ultravox uses their streaming API and takes advantage of audio timing information from the provider to synchronize transcripts. While you’ll still need to work with the provider to ensure your agent’s requests will be fulfilled reliably and quickly, you can be confident that Ultravox knows how to interact with your provider.

If you’d like to use some other TTS provider, you may be able to get by with our “generic” integration option. This route gives you much more flexibility to define requests to your provider. Any provider that accepts json post requests and that returns either WAV or raw PCM audio ought to workable. While very flexible, this means even more of the onus is on you to make it work!

Since generic integrations don’t stream text in (but can stream audio out) Ultravox has to buffer input text before sending it to your provider, which means slightly higher agent response times and possible audio discontinuities at sentence boundaries.

Additionally since generic integrations don’t provide audio timing information, transcript timing must be approximated. Once Ultravox has a full generation (and therefore the true audio duration), it assumes each character requires the same duration and approximates transcripts based on that. While the first generation is still streaming, Ultravox relies on an estimated words-per-minute speaking rate to approximate transcripts.

Eleven Labs

Eleven Labs is our most commonly used provider (and as of May 2025 backs most of our internal voices). We use their websocket API to stream text in and audio out in parallel. Eleven Labs provides character-level timing information alongside audio, ensuring transcripts are kept in sync and conversation history accurately reflects what was spoken in the event of an interruption.

Slurring

Eleven Labs seems much more likely to slur words or generally hallucinate audio in the past couple months. Several of their customers (including Ultravox) have reported this and it is being worked on. In the meantime, prompting your agent to avoid special characters like asterisks may be helpful. Alternatively, you could try their more robust (but slower) multilingual model.

"externalVoice": {
  "elevenLabs": {
    "voiceId": "21m00Tcm4TlvDq8ikWAM",
    "model": "eleven_turbo_v2_5"
  }
}

Cartesia

Cartesia also provides many high quality voices. We use their websocket API to stream text in and audio out in parallel. Cartesia provides word-level timing information interspersed with audio, helping to keep transcripts in sync with audio.

"externalVoice": {
  "cartesia": {
    "voiceId": "af346552-54bf-4c2b-a4d4-9d2820f51b6c",
    "model": "sonic-2"
  }
}

PlayHT

PlayHT’s latest dialog model provides high quality voices, albeit with a bit higher latency. Like the generic option, PlayHT does not allow for input streaming and doesn’t provide audio timing information. Ultravox handles this as described for generic integrations above, though more experience with PlayHT has allowed for more accurate approximations.

"externalVoice": {
  "playHt": {
    "userId": "YOUR_PLAY_HT_USER_ID",
    "voiceId": "s3://voice-cloning-zero-shot/d9ff78ba-d016-47f6-b0ef-dd630f59414e/female-cs/manifest.json",
    "model": "PlayDialog"
  }
}

LMNT

LMNT’s “aurora” model lags the other providers in terms of quality, though their experimental “blizzard” model shows potential. LMNT has by far the simplest streaming integration (and the only SDK worth using). They also offer unlimited concurrency and no rate limits even on their $10/month plan. Like Eleven Labs and Cartesia, LMNT allows for streaming text in and audio out in parallel and provides audio timing information to help Ultravox align transcripts with speech.

"externalVoice": {
  "lmnt": {
    "voiceId": "lily",
    "model": "blizzard"
  }
}

Deepgram

Deepgram’s latest Aura-2 model claims to be in line with other providers in terms of quality. However, it isn’t supported in their streaming API yet and Ultravox has no special integration with it yet as a result. That said, you can use our “generic” ExternalVoice to give Deepgram a try now using their REST API.

"externalVoice": {
  "generic": {
    "url": "https://api.deepgram.com/v1/speak?model=aura-2-asteria-en&encoding=linear16&sample_rate=48000&container=none",
    "headers": {
      "Authorization": "Token YOUR_DEEPGRAM_API_KEY",
      "Content-Type": "application/json"
    },
    "body": {
      "text": "{text}"
    },
    "responseSampleRate": 48000
  }
}

OpenAI

OpenAI also has a TTS API you can use with our generic ExternalVoice option.

"externalVoice": {
  "generic": {
    "url": "https://api.openai.com/v1/audio/speech",
    "headers": {
      "Authorization": "Bearer YOUR_OPENAI_API_KEY",
      "Content-Type": "application/json",
    },
    "body": {
      "input": "{text}",
      "model": "gpt-4o-mini-tts",
      "voice": "shimmer",
      "response_format": "pcm",
      "speed": 1.0
    },
    "responseSampleRate": 24000
  }
}

Resemble

Resemble also has a TTS API you can use with our generic ExternalVoice option.

"externalVoice": {
  "generic": {
    "url": "https://p.cluster.resemble.ai/stream",
    "headers": {
      "Authorization": "Token YOUR_RESEMBLE_API_KEY",
      "Content-Type": "application/json",
      "Accept": "audio/wav"
    },
    "body": {
      "data": "{text}",
      "project_uuid": "YOUR_RESEMBLE_PROJECT_ID",
      "voice_uuid": "0842fdf9",
      "precision": "PCM_16",
      "sample_rate": 44100
    },
    "responseSampleRate": 44100
  }
}

Orpheus

Orpheus is an open-source TTS model with a Llama 3 backbone. Along with several similar models, Orpheus likely represents the next generation of realism for AI voices. They’ve partnered with baseten to provide a simple self-hosting option you can set up for yourself.

You can use a generic ExternalVoice with your self-hosted Orpheus instance:

"externalVoice": {
  "generic": {
    "url": "YOUR_BASETEN_DEPLOYMENT_URL",
    "headers": {
      "Authorization": "Api-Key YOUR_BASETEN_API_KEY",
      "Content-Type": "application/json"
    },
    "body": {
      "prompt": "{text}",
      "request_id": "SOME_UUID",
      "max_tokens": 4096,
      "voice": "tara",
      "stop_token_ids": [128258, 128009]
    },
    "responseSampleRate": 24000
  }
}