Grok Voice Agent API
Build interactive voice conversations with Grok models using WebSocket. The Grok Voice Agent API accepts audio and text inputs and creates text and audio responses in real-time.
WebSocket Endpoint: wss://api.x.ai/v1/realtime
Authentication
You can authenticate WebSocket connections using the xAI API key or an ephemeral token.
Important: Use ephemeral tokens for client-side authentication. If you use the API key directly in client-side code, it may be exposed.
Fetching Ephemeral Tokens
Set up a server endpoint to fetch ephemeral tokens from xAI:
Endpoint: POST https://api.x.ai/v1/realtime/client_secrets
Voice Options
The Grok Voice Agent API supports 5 different voice options:
| Voice | Type | Tone | Description |
|---|---|---|---|
| Ara | Female | Warm, friendly | Default voice, balanced and conversational |
| Rex | Male | Confident, clear | Professional and articulate |
| Sal | Neutral | Smooth, balanced | Versatile voice |
| Eve | Female | Energetic, upbeat | Engaging and enthusiastic |
| Leo | Male | Authoritative, strong | Decisive and commanding |
Audio Format
Supported Audio Formats
| Format | Encoding | Sample Rate |
|---|---|---|
audio/pcm | Linear16, Little-endian | Configurable (8000-48000 Hz) |
audio/pcmu | G.711 μ-law | 8000 Hz |
audio/pcma | G.711 A-law | 8000 Hz |
Default Audio Settings
- Sample Rate: 24kHz
- Channels: Mono
- Encoding: Base64
Client Events
| Event | Description |
|---|---|
session.update | Update session configuration (voice, audio format, instructions) |
input_audio_buffer.append | Append base64-encoded audio chunks |
conversation.item.commit | Create user message from audio buffer |
conversation.item.create | Create user message with text |
response.create | Request assistant response (manual VAD mode) |
Server Events
| Event | Description |
|---|---|
session.updated | Session configuration acknowledged |
conversation.created | Conversation session created |
input_audio_buffer.speech_started | VAD detected speech start |
input_audio_buffer.speech_stopped | VAD detected speech end |
response.output_audio.delta | Audio stream chunk |
response.output_audio_transcript.delta | Transcript chunk |
response.done | Response completed |
Using Tools
The Voice Agent supports:
- Collections Search (
file_search) - Search document collections - Web Search (
web_search) - Search the web - X Search (
x_search) - Search X posts - Custom Functions - Define function tools with JSON schemas
For complete API details, see the Voice API documentation.