⚠️ SIP Billing Starts November 10, 2025 - See Ultravox Pricing for details.
⚠️ SIP Billing Starts November 10, 2025 - See Ultravox Pricing for details.
Retrieves all available voices
curl --request GET \
--url https://api.ultravox.ai/api/voices \
--header 'X-API-Key: <api-key>'{
"results": [
{
"voiceId": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
"name": "<string>",
"previewUrl": "<string>",
"ownership": "public",
"billingStyle": "VOICE_BILLING_STYLE_INCLUDED",
"provider": "<string>",
"definition": {
"elevenLabs": {
"voiceId": "<string>",
"model": "<string>",
"speed": 123,
"useSpeakerBoost": true,
"style": 123,
"similarityBoost": 123,
"stability": 123,
"pronunciationDictionaries": [
{
"dictionaryId": "<string>",
"versionId": "<string>"
}
],
"optimizeStreamingLatency": 123,
"maxSampleRate": 123
},
"cartesia": {
"voiceId": "<string>",
"model": "<string>",
"speed": 123,
"emotion": "<string>",
"emotions": [
"<string>"
],
"generationConfig": {
"volume": 123,
"speed": 123,
"emotion": "<string>"
}
},
"lmnt": {
"voiceId": "<string>",
"model": "<string>",
"speed": 123,
"conversational": true
},
"google": {
"voiceId": "<string>",
"speakingRate": 123
},
"generic": {
"url": "<string>",
"headers": {},
"body": {},
"responseSampleRate": 123,
"responseWordsPerMinute": 123,
"responseMimeType": "<string>",
"jsonAudioFieldPath": "<string>",
"jsonByteEncoding": "JSON_BYTE_ENCODING_UNSPECIFIED"
}
},
"description": "<string>",
"primaryLanguage": "<string>"
}
],
"next": "http://api.example.org/accounts/?cursor=cD00ODY%3D\"",
"previous": "http://api.example.org/accounts/?cursor=cj0xJnA9NDg3",
"total": 123
}API key
The billing style used to filter results.
VOICE_BILLING_STYLE_INCLUDED - Voices with no additional charges beyond the cost of the callVOICE_BILLING_STYLE_EXTERNAL - Voices with costs billed directly by the TTS providerVOICE_BILLING_STYLE_INCLUDED, VOICE_BILLING_STYLE_EXTERNAL 1The pagination cursor value.
The ownership used to filter results.
private - Only private voicespublic - Only public voicesprivate, public 1Number of results to return per page.
The desired primary language for voice results using BCP47. Voices with different regions/scripts/variants but the same language tag may also be included but will be further down the results. If not provided, all languages are included.
1The providers used to filter results.
lmnt - LMNTcartesia - Cartesiagoogle - Googleeleven_labs - Eleven Labsinworld - Inworldlmnt, cartesia, google, eleven_labs, inworld The search string used to filter results.
1Show child attributes
40public, private How billing works for this voice. VOICE_BILLING_STYLE_INCLUDED - The cost of this voice is included in the call cost. There are no additional charges for it. VOICE_BILLING_STYLE_EXTERNAL - This voice requires an API key for its provider, who will bill for usage separately.
VOICE_BILLING_STYLE_INCLUDED, VOICE_BILLING_STYLE_EXTERNAL A voice not known to Ultravox Realtime that can nonetheless be used for a call. Such voices are significantly less validated than normal voices and you'll be responsible for your own TTS-related errors. Exactly one field must be set.
Show child attributes
A voice served by ElevenLabs.
Show child attributes
The ID of the voice in ElevenLabs.
The ElevenLabs model to use.
The speaking rate. Must be between 0.7 and 1.2. Defaults to 1. See https://elevenlabs.io/docs/api-reference/text-to-speech/convert#request.body.voice_settings.speed
The maximum sample rate Ultravox will try to use. ElevenLabs limits your allowed sample rate based on your tier. See https://elevenlabs.io/pricing#pricing-table (and click "Show API details")
A voice served by Cartesia.
Show child attributes
The ID of the voice in Cartesia.
The Cartesia model to use.
(Deprecated) The speaking rate. Must be between -1 and 1. Defaults to 0.
(Deprecated) Use generation_config.emotion instead.
(Deprecated) Use generation_config.emotion instead.
Configure the various attributes of the generated speech.
Show child attributes
Adjust the volume of the generated speech between 0.5x and 2.0x the original volume (default is 1.0x). Valid values are between [0.5, 2.0] inclusive.
Adjust the speed of the generated speech between 0.6x and 2.0x the original speed (default is 1.0x). Valid values are between [0.6, 1.5] inclusive.
The primary emotions are neutral, calm, angry, content, sad, scared. For more options, see Prompting Sonic-3.
A voice served by LMNT.
Show child attributes
The ID of the voice in LMNT.
The LMNT model to use.
The speaking rate. Must be between 0.25 and 2. Defaults to 1. See https://docs.lmnt.com/api-reference/speech/synthesize-speech-bytes#body-speed
A voice served by Google, using bidirectional streaming. (For non-streaming or output-only streaming, use generic.)
Show child attributes
The ID (name) of the voice in Google, e.g. "en-US-Chirp3-HD-Charon".
The speaking rate. Must be between 0.25 and 2. Defaults to 1. See https://cloud.google.com/python/docs/reference/texttospeech/latest/google.cloud.texttospeech_v1.types.StreamingAudioConfig
A voice served by a generic REST-based TTS API.
Show child attributes
The endpoint to which requests are sent.
The request body to send. Some field should include a placeholder for text represented as {text}. The placeholder will be replaced with the text to synthesize.
The sample rate of the audio returned by the API.
An estimate of the speaking rate of the returned audio in words per minute. This is used for transcript timing while audio is streamed in the response. (Once the response is complete, Ultravox Realtime uses the real audio duration to adjust the timing.) Defaults to 150 and is unused for non-streaming responses.
The real mime type of the content returned by the API. If unset, the Content-Type response header will be used. This is useful for APIs whose response bodies don't strictly adhere to what the API claims via header. For example, if your API claims to return audio/wav but omits the WAV header (thus really returning raw PCM), set this to audio/l16. Similarly, if your API claims to return JSON but actually streams JSON Lines, set this to application/jsonl.
For JSON responses, the path to the field containing base64-encoded audio data. The data must be PCM audio, optionally with a WAV header.
For JSON responses, how audio bytes are encoded into the json_audio_field_path string. Defaults to base64. Also supports hex.
JSON_BYTE_ENCODING_UNSPECIFIED, JSON_BYTE_ENCODING_BASE64, JSON_BYTE_ENCODING_HEX 240BCP47 language code for the primary language supported by this voice.
10"http://api.example.org/accounts/?cursor=cD00ODY%3D\""
"http://api.example.org/accounts/?cursor=cj0xJnA9NDg3"
123
curl --request GET \
--url https://api.ultravox.ai/api/voices \
--header 'X-API-Key: <api-key>'{
"results": [
{
"voiceId": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
"name": "<string>",
"previewUrl": "<string>",
"ownership": "public",
"billingStyle": "VOICE_BILLING_STYLE_INCLUDED",
"provider": "<string>",
"definition": {
"elevenLabs": {
"voiceId": "<string>",
"model": "<string>",
"speed": 123,
"useSpeakerBoost": true,
"style": 123,
"similarityBoost": 123,
"stability": 123,
"pronunciationDictionaries": [
{
"dictionaryId": "<string>",
"versionId": "<string>"
}
],
"optimizeStreamingLatency": 123,
"maxSampleRate": 123
},
"cartesia": {
"voiceId": "<string>",
"model": "<string>",
"speed": 123,
"emotion": "<string>",
"emotions": [
"<string>"
],
"generationConfig": {
"volume": 123,
"speed": 123,
"emotion": "<string>"
}
},
"lmnt": {
"voiceId": "<string>",
"model": "<string>",
"speed": 123,
"conversational": true
},
"google": {
"voiceId": "<string>",
"speakingRate": 123
},
"generic": {
"url": "<string>",
"headers": {},
"body": {},
"responseSampleRate": 123,
"responseWordsPerMinute": 123,
"responseMimeType": "<string>",
"jsonAudioFieldPath": "<string>",
"jsonByteEncoding": "JSON_BYTE_ENCODING_UNSPECIFIED"
}
},
"description": "<string>",
"primaryLanguage": "<string>"
}
],
"next": "http://api.example.org/accounts/?cursor=cD00ODY%3D\"",
"previous": "http://api.example.org/accounts/?cursor=cj0xJnA9NDg3",
"total": 123
}