Understanding VAD

Voice Activity Detection (VAD) determines when a user is speaking and when they’ve finished their turn. Understanding how Ultravox’s VAD system works will help you make informed decisions about when and how to adjust settings.

How Ultravox VAD Works

Ultravox Realtime uses a multi-layered approach to VAD which results in low latency and an overall experience that feels fluid and natural.

Traditional VAD Model

The foundation is a classic VAD model that analyzes audio frames (32ms each) to predict “speechiness” - whether each frame contains human speech. This model is intentionally aggressive, meaning it has a low threshold for detecting potential speech.

Neural VAD

Our neural VAD model makes intelligent predictions about conversation state and turn-taking patterns. It understands context like:

Whether the user is likely finished speaking based on conversation flow
Typical pause patterns in natural dialogue
The difference between a thoughtful pause and the end of a turn

Noise Cancellation & Audio Processing

Additional models handle:

Background noise filtering
Echo cancellation
Audio quality enhancement
False positive reduction

VAD Parameters Explained

Ultravox exposes several VAD parameters that control this behavior. These parameters are exposed via the vadSettings object that can be set when creating new calls or call stages. The following settings are available:

`turnEndpointDelay`

turnEndpointDelay

string

default:"0.384s"

The minimum time the agent waits before responding after the user appears to stop speaking.Only multiples of 32ms are meaningful (anything from 1ms - 31ms produces the same result).Trade-offs:

Shorter delays → Leads to faster responses, but more likely to interrupt users mid-thought.
Longer delays → Makes agent less eager to respond (i.e. interrupt the user), but perceived as slower or less responsive.

Adjusting turnEndpointDelay

vadSettings: {
  turnEndpointDelay: "0.384s"  // 12 VAD frames at 32ms each
}

`minimumTurnDuration`

minimumTurnDuration

string

default:"0s"

The minimum duration of user speech required to be considered a valid turn.Trade-offs:

Shorter durations → Captures very short user audio.
Longer durations → Ignores very short user audio segments that might be noise but could ignore meaningful, short user responses like “yes” or “no”.

Adjusting minimumTurnDuration

vadSettings: {
  minimumTurnDuration: "0s"  // Consider all user audio
}

`minimumInterruptionDuration`

minimumInterruptionDuration

string

default:"0.09s"

The minimum duration of user speech required to interrupt the agent when it’s speaking. Similar to minimumTurnDuration but provides a higher threshold for interrupting the agent. Ignored if the value is less than minimumTurnDuration.Trade-offs:

Shorter durations → More sensitive to interruptions, may trigger on background noise.
Longer durations → Less sensitive to noise, but may miss legitimate interruption attempts.

Adjusting minimumInterruptionDuration

vadSettings: {
  minimumInterruptionDuration: "0.09s"
}

`frameActivationThreshold`

frameActivationThreshold

number

default:"0.1"

The threshold for considering an individual audio frame as containing speech (0.1 to 1.0).Trade-offs:

Lower thresholds → More sensitive to quiet speech, but more false positives.
Higher thresholds → Less sensitive to noise, but may miss quiet or distant speakers.

Adjusting frameActivationThreshold

vadSettings: {
  frameActivationThreshold: 0.1  // Very sensitive to potential speech
}

Best Practices for VAD Tuning

Start with DefaultsThe default VAD settings work well for most applications. Only adjust them if you have specific, tested issues that can’t be resolved through environmental or hardware improvements.

The Safest Parameter to Adjust

If you venture into adjusting VAD settings, turnEndpointDelay is the safest parameter to modify. As noted in the API reference, “there’s nothing special about this value” - it’s simply a starting point that works well in most scenarios.

Making Changes Safely

Change one parameter at a time → Isolate the effects of each adjustment.
Test thoroughly → Use real users and realistic environments.
Monitor the trade-offs → Every improvement in one area may cause issues in another.

Environmental Solutions First

Before adjusting VAD parameters, consider:

Audio quality → Better microphones reduce VAD complexity.
Network quality → Poor connections can affect VAD performance.

Remember: VAD tuning is a series of trade-offs. The system is designed to be intelligent and adaptive. Trust the defaults unless you have a compelling, well-tested reason to change them.

Getting Started

Agents & Calls

Telephony

Web, Apps, Websockets

Tools

Voices

Webhooks

Noise & VAD

Understanding VAD

How Ultravox VAD Works

Traditional VAD Model

Neural VAD

Noise Cancellation & Audio Processing

VAD Parameters Explained

`turnEndpointDelay`

`minimumTurnDuration`

`minimumInterruptionDuration`

`frameActivationThreshold`

Best Practices for VAD Tuning

The Safest Parameter to Adjust

Making Changes Safely

Environmental Solutions First

Getting Started

Agents & Calls

Telephony

Web, Apps, Websockets

Tools

Voices

Webhooks

Noise & VAD

​How Ultravox VAD Works

​Traditional VAD Model

​Neural VAD

​Noise Cancellation & Audio Processing

​VAD Parameters Explained

​turnEndpointDelay

​minimumTurnDuration

​minimumInterruptionDuration

​frameActivationThreshold

​Best Practices for VAD Tuning

​The Safest Parameter to Adjust

​Making Changes Safely

​Environmental Solutions First

How Ultravox VAD Works

Traditional VAD Model

Neural VAD

Noise Cancellation & Audio Processing

VAD Parameters Explained

`turnEndpointDelay`

`minimumTurnDuration`

`minimumInterruptionDuration`

`frameActivationThreshold`

Best Practices for VAD Tuning

The Safest Parameter to Adjust

Making Changes Safely

Environmental Solutions First