Agents
Calls, Messages, Stages
Corpora, Query, Sources
- Corpus Service (RAG) Overview
- GETList Corpora
- POSTCreate Corpus
- GETGet Corpus
- PATCHUpdate Corpus
- DELDelete Corpus
- POSTQuery Corpus
- GETList Corpus Sources
- POSTCreate Corpus Source
- GETGet Corpus Source
- PATCHUpdate Corpus Source
- DELDelete Corpus Source
- GETList Corpus Source Documents
- GETGet Corpus Source Document
- POSTCreate Corpus File Upload
Webhooks
List Calls
Returns details for all calls
curl --request GET \
--url https://api.ultravox.ai/api/calls \
--header 'X-API-Key: <api-key>'
{
"next": "http://api.example.org/accounts/?cursor=cD00ODY%3D\"",
"previous": "http://api.example.org/accounts/?cursor=cj0xJnA9NDg3",
"results": [
{
"callId": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
"clientVersion": "<string>",
"created": "2023-11-07T05:31:56Z",
"joined": "2023-11-07T05:31:56Z",
"ended": "2023-11-07T05:31:56Z",
"endReason": "unjoined",
"firstSpeaker": "FIRST_SPEAKER_AGENT",
"firstSpeakerSettings": {
"user": {
"fallback": {
"delay": "<string>",
"text": "<string>",
"prompt": "<string>"
}
},
"agent": {
"uninterruptible": true,
"text": "<string>",
"prompt": "<string>",
"delay": "<string>"
}
},
"inactivityMessages": [
{
"duration": "<string>",
"message": "<string>",
"endBehavior": "END_BEHAVIOR_UNSPECIFIED"
}
],
"initialOutputMedium": "MESSAGE_MEDIUM_VOICE",
"joinTimeout": "30s",
"joinUrl": "<string>",
"languageHint": "<string>",
"maxDuration": "3600s",
"medium": {
"webRtc": {},
"twilio": {},
"serverWebSocket": {
"inputSampleRate": 123,
"outputSampleRate": 123,
"clientBufferSizeMs": 123
},
"telnyx": {},
"plivo": {},
"exotel": {},
"sip": {
"incoming": {},
"outgoing": {
"to": "<string>",
"from": "<string>",
"username": "<string>",
"password": "<string>"
}
}
},
"model": "fixie-ai/ultravox",
"recordingEnabled": false,
"systemPrompt": "<string>",
"temperature": 0,
"timeExceededMessage": "<string>",
"voice": "<string>",
"externalVoice": {
"elevenLabs": {
"voiceId": "<string>",
"model": "<string>",
"speed": 123,
"useSpeakerBoost": true,
"style": 123,
"similarityBoost": 123,
"stability": 123,
"pronunciationDictionaries": [
{
"dictionaryId": "<string>",
"versionId": "<string>"
}
],
"optimizeStreamingLatency": 123
},
"cartesia": {
"voiceId": "<string>",
"model": "<string>",
"speed": 123,
"emotion": "<string>",
"emotions": [
"<string>"
]
},
"playHt": {
"userId": "<string>",
"voiceId": "<string>",
"model": "<string>",
"speed": 123,
"quality": "<string>",
"temperature": 123,
"emotion": 123,
"voiceGuidance": 123,
"styleGuidance": 123,
"textGuidance": 123,
"voiceConditioningSeconds": 123
},
"lmnt": {
"voiceId": "<string>",
"model": "<string>",
"speed": 123,
"conversational": true
}
},
"transcriptOptional": true,
"errorCount": 0,
"vadSettings": {
"turnEndpointDelay": "<string>",
"minimumTurnDuration": "<string>",
"minimumInterruptionDuration": "<string>",
"frameActivationThreshold": 123
},
"shortSummary": "<string>",
"summary": "<string>",
"experimentalSettings": "<any>",
"metadata": {},
"initialState": {},
"requestContext": "<any>",
"dataConnectionConfig": {
"websocketUrl": "<string>",
"audioConfig": {
"sampleRate": 123,
"channelMode": "CHANNEL_MODE_UNSPECIFIED"
}
}
}
],
"total": 123
}
Authorizations
API key
Query Parameters
The pagination cursor value.
Maximum duration of calls
Minimum duration of calls
Start date (inclusive) for filtering calls by creation date
Filter calls by metadata. Use metadata.key=value to filter by specific key-value pairs.
Number of results to return per page.
The search string used to filter results
1
Which field to use when ordering the results.
End date (inclusive) for filtering calls by creation date
Filter calls by the associated voice ID
Response
The version of the client that joined this call.
The reason the call ended.
unjoined
- Client never joinedhangup
- Client hung upagent_hangup
- Agent hung uptimeout
- Call timed outconnection_error
- Connection errorsystem_error
- System error
unjoined
, hangup
, agent_hangup
, timeout
, connection_error
, system_error
Who was supposed to talk first when the call started. Typically set to FIRST_SPEAKER_USER for outgoing calls and left as the default (FIRST_SPEAKER_AGENT) otherwise.
FIRST_SPEAKER_AGENT
, FIRST_SPEAKER_USER
Settings for the initial message to get the call started.
If set, the user should speak first.
If set, the agent will start the conversation itself if the user doesn't start speaking within the given delay.
How long the agent should wait before starting the conversation itself.
A specific greeting the agent should say.
A prompt for the agent to generate a greeting.
If set, the agent should speak first.
Whether the user should be prevented from interrupting the agent's first message. Defaults to false (meaning the agent is interruptible as usual).
A specific greeting the agent should say.
A prompt for the agent to generate a greeting.
If set, the agent will wait this long before starting its greeting. This may be useful for ensuring the user is ready.
The medium used initially by the agent. May later be changed by the client.
MESSAGE_MEDIUM_VOICE
, MESSAGE_MEDIUM_TEXT
The number of errors in this call.
A short summary of the call.
A summary of the call.
Experimental settings for the call.
Optional metadata key-value pairs to associate with the call. All values must be strings.
The initial state of the call which is readable/writable by tools.
Messages spoken by the agent when the user is inactive for the specified duration. Durations are cumulative, so a message m > 1 with duration 30s will be spoken 30 seconds after message m-1.
A message the agent should say after some duration. The duration's meaning varies depending on the context.
The duration after which the message should be spoken.
The message to speak.
The behavior to exhibit when the message is finished being spoken.
END_BEHAVIOR_UNSPECIFIED
, END_BEHAVIOR_HANG_UP_SOFT
, END_BEHAVIOR_HANG_UP_STRICT
BCP47 language code that may be used to guide speech recognition.
16
Details about a call's protocol. By default, calls occur over WebRTC using the Ultravox client SDK. Setting a different call medium will prepare the server for a call using a different protocol. At most one call medium may be set.
The call will use WebRTC with the Ultravox client SDK. This is the default.
The call will use Twilio's "Media Streams" protocol. Once you have a join URL from starting a call, include it in your TwiML like so: <Connect><Stream url=${your-join-url} /></Connect> This works for both inbound and outbound calls.
The call will use a plain websocket connection. This is unlikely to yield an acceptable user experience if used from a browser or mobile client, but may be suitable for a server-to-server connection. This option provides a simple way to connect your own server to an Ultravox inference instance.
The sample rate for input (user) audio. Required.
The desired sample rate for output (agent) audio. If unset, defaults to the input_sample_rate.
The size of the client-side audio buffer in milliseconds. Smaller buffers allow for faster interruptions but may cause audio underflow if network latency fluctuates too greatly. For the best of both worlds, set this to some large value (e.g. 30000) and implement support for playback_clear_buffer messages. Defaults to 60.
The call will use Telnyx's media streaming protocol. Once you have a join URL from starting a call, include it in your TexML like so: <Connect><Stream url=${your-join-url} bidirectionalMode="rtp" /></Connect> This works for both inbound and outbound calls.
The call will use Plivo's AudioStreams protocol. Once you have a join URL from starting a call, include it in your Plivo XML like so: <Stream keepCallAlive="true" bidirectional="true" contentType="audio/x-l16;rate=16000">${your-join-url}</Stream> This works for both inbound and outbound calls.
The call will use Exotel's "Voicebot" protocol. Once you have a join URL from starting a call, provide it to Exotel as the wss target URL for your Voicebot (either directly or more likely dynamically from your own server).
The call will be connected using Session Initiation Protocol (SIP). Note that SIP incurs additional charges and must be enabled for your account.
Details for an incoming SIP call.
Details for an outgoing SIP call. Ultravox will initiate this call (and there will be no joinUrl).
The SIP URI to connect to. (Phone numbers are not allowed.)
The SIP URI to connect from. This is the "from" field in the SIP INVITE.
The SIP username to use for authentication.
The password for the specified username.
0 <= x <= 1
A voice not known to Ultravox Realtime that can nonetheless be used for a call. Such voices are significantly less validated than normal voices and you'll be responsible for your own TTS-related errors. Exactly one field must be set.
A voice served by ElevenLabs.
The ID of the voice in ElevenLabs.
The ElevenLabs model to use.
The speaking rate. Must be between 0.7 and 1.2. Defaults to 1. See https://elevenlabs.io/docs/api-reference/text-to-speech/convert#request.body.voice_settings.speed
A reference to a pronunciation dictionary within ElevenLabs.
A voice served by Cartesia.
The ID of the voice in Cartesia.
The Cartesia model to use.
The speaking rate. Must be between -1 and 1. Defaults to 0. See https://docs.cartesia.ai/api-reference/tts/tts#send.Generation%20Request.voice.Ttsrequest%20ID%20Specifier.__experimental_controls.speed
A voice served by PlayHT.
The "user id" for the PlayHT API. This must be the user who owns the Play API key associated with your Ultravox account.
The ID of the voice in PlayHT. Typically an s3 location.
The PlayHT model (aka "engine") to use.
The speaking rate. Must be between 0 and 5. Defaults to 1.
A voice served by LMNT.
The ID of the voice in LMNT.
The LMNT model to use.
The speaking rate. Must be between 0.25 and 2. Defaults to 1. See https://docs.lmnt.com/api-reference/speech/synthesize-speech-bytes#body-speed
Indicates whether a transcript is optional for the call.
VAD settings for the call.
The minimum amount of time the agent will wait to respond after the user seems to be done speaking. Increasing this value will make the agent less eager to respond, which may increase perceived response latency but will also make the agent less likely to jump in before the user is really done speaking.
Built-in VAD currently operates on 32ms frames, so only multiples of 32ms are meaningful. (Anything from 1ms to 31ms will produce the same result.)
Defaults to "0.384s" (384ms) as a starting point, but there's nothing special about this value aside from it corresponding to 12 VAD frames.
The minimum duration of user speech required to be considered a user turn. Increasing this value will cause the agent to ignore short user audio. This may be useful in particularly noisy environments, but it comes at the cost of possibly ignoring very short user responses such as "yes" or "no".
Defaults to "0s" meaning the agent considers all user audio inputs (that make it through built-in noise cancellation).
The minimum duration of user speech required to interrupt the agent. This works the same way as minimumTurnDuration, but allows for a higher threshold for interrupting the agent. (This value will be ignored if it is less than minimumTurnDuration.)
Defaults to "0.09s" (90ms) as a starting point, but there's nothing special about this value.
The threshold for the VAD to consider a frame as speech. This is a value between 0.1 and 1.
Miniumum value is 0.1, which is the default value.
Settings for exchanging data messages with an additional participant.
The websocket URL to which the session will connect to stream data messages.
Audio configuration for the data connection. If not set, no audio will be sent.
The sample rate of the audio stream. If not set, will default to 16000.
The audio channel mode to use. CHANNEL_MODE_MIXED will combine user and agent audio into a single mono output while CHANNEL_MODE_SEPARATED will result in stereo audio where user and agent are separated. The latter is the default.
CHANNEL_MODE_UNSPECIFIED
, CHANNEL_MODE_MIXED
, CHANNEL_MODE_SEPARATED
"http://api.example.org/accounts/?cursor=cD00ODY%3D\""
"http://api.example.org/accounts/?cursor=cj0xJnA9NDg3"
123
curl --request GET \
--url https://api.ultravox.ai/api/calls \
--header 'X-API-Key: <api-key>'
{
"next": "http://api.example.org/accounts/?cursor=cD00ODY%3D\"",
"previous": "http://api.example.org/accounts/?cursor=cj0xJnA9NDg3",
"results": [
{
"callId": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
"clientVersion": "<string>",
"created": "2023-11-07T05:31:56Z",
"joined": "2023-11-07T05:31:56Z",
"ended": "2023-11-07T05:31:56Z",
"endReason": "unjoined",
"firstSpeaker": "FIRST_SPEAKER_AGENT",
"firstSpeakerSettings": {
"user": {
"fallback": {
"delay": "<string>",
"text": "<string>",
"prompt": "<string>"
}
},
"agent": {
"uninterruptible": true,
"text": "<string>",
"prompt": "<string>",
"delay": "<string>"
}
},
"inactivityMessages": [
{
"duration": "<string>",
"message": "<string>",
"endBehavior": "END_BEHAVIOR_UNSPECIFIED"
}
],
"initialOutputMedium": "MESSAGE_MEDIUM_VOICE",
"joinTimeout": "30s",
"joinUrl": "<string>",
"languageHint": "<string>",
"maxDuration": "3600s",
"medium": {
"webRtc": {},
"twilio": {},
"serverWebSocket": {
"inputSampleRate": 123,
"outputSampleRate": 123,
"clientBufferSizeMs": 123
},
"telnyx": {},
"plivo": {},
"exotel": {},
"sip": {
"incoming": {},
"outgoing": {
"to": "<string>",
"from": "<string>",
"username": "<string>",
"password": "<string>"
}
}
},
"model": "fixie-ai/ultravox",
"recordingEnabled": false,
"systemPrompt": "<string>",
"temperature": 0,
"timeExceededMessage": "<string>",
"voice": "<string>",
"externalVoice": {
"elevenLabs": {
"voiceId": "<string>",
"model": "<string>",
"speed": 123,
"useSpeakerBoost": true,
"style": 123,
"similarityBoost": 123,
"stability": 123,
"pronunciationDictionaries": [
{
"dictionaryId": "<string>",
"versionId": "<string>"
}
],
"optimizeStreamingLatency": 123
},
"cartesia": {
"voiceId": "<string>",
"model": "<string>",
"speed": 123,
"emotion": "<string>",
"emotions": [
"<string>"
]
},
"playHt": {
"userId": "<string>",
"voiceId": "<string>",
"model": "<string>",
"speed": 123,
"quality": "<string>",
"temperature": 123,
"emotion": 123,
"voiceGuidance": 123,
"styleGuidance": 123,
"textGuidance": 123,
"voiceConditioningSeconds": 123
},
"lmnt": {
"voiceId": "<string>",
"model": "<string>",
"speed": 123,
"conversational": true
}
},
"transcriptOptional": true,
"errorCount": 0,
"vadSettings": {
"turnEndpointDelay": "<string>",
"minimumTurnDuration": "<string>",
"minimumInterruptionDuration": "<string>",
"frameActivationThreshold": 123
},
"shortSummary": "<string>",
"summary": "<string>",
"experimentalSettings": "<any>",
"metadata": {},
"initialState": {},
"requestContext": "<any>",
"dataConnectionConfig": {
"websocketUrl": "<string>",
"audioConfig": {
"sampleRate": 123,
"channelMode": "CHANNEL_MODE_UNSPECIFIED"
}
}
}
],
"total": 123
}