Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions sdk/voicelive/azure-ai-voicelive/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,40 @@
# Release History

## 1.3.0b1 (Unreleased)

### Features Added

- **Azure Realtime Native Voice Support**: Added `AzureRealtimeNativeVoice` and
`AzureRealtimeNativeVoiceName`, and expanded `voice` fields to accept Azure realtime native voices.
- **WebRTC Call Negotiation Support**: Added `ClientEventRtcCallSdpCreate`, `ServerEventRtcCallSdpCreated`,
`ServerEventRtcCallError`, and `RtcCallErrorDetails` for SDP-based WebRTC call setup.
- **Input Text Streaming Support**: Added `ClientEventInputTextDelta` and `ClientEventInputTextDone`
for incrementally streaming text input into existing conversation items.
- **Hosted Agent Invocation Input**: Added `invoke_input` to `ResponseCreateParams` and
`ServerEventResponseInvocationDelta` for hosted agent invocation passthrough data.
- **Audio Playback Lifecycle Events**: Added `ServerEventOutputAudioBufferStarted` and
`ServerEventOutputAudioBufferStopped` to track model audio playback start and stop.
- **Echo Cancellation Configuration**: Added `EchoCancellationReferenceSource` and new
`reference_source` / `channels` options on `AudioEchoCancellation` to support both the default
server loopback reference path and client-provided stereo echo reference input.
- **Smart End-of-Turn Detection**: Added `SmartEndOfTurnDetection` as an audio-based end-of-turn
detection option.
- **Parallel Tool Call Control**: Added `parallel_tool_calls` to session models so callers can
control whether tool calls may run in parallel.

### Breaking Changes

- **Image Input Field Rename**: Renamed `RequestImageContentPart.url` to `image_url`. Update
image input construction to use `image_url=` instead of `url=`.
- **Default API Version Update**: Changed the SDK default API version from `2026-04-10` to
`2026-06-01-preview`. Pass `api_version="2026-04-10"` explicitly to keep the previous default
behavior.

### Bug Fixes

- **Deserialization Improvements**: Improved XML model deserialization and common scalar header
deserialization paths for better compatibility and lower overhead.

## 1.2.0 (2026-05-22)

### Features Added
Expand Down
61 changes: 41 additions & 20 deletions sdk/voicelive/azure-ai-voicelive/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ This package provides a **real-time, speech-to-speech** client for Azure AI Voic
It opens a WebSocket session to stream microphone audio to the service and receive
typed server events (including audio) for responsive, interruptible conversations.

> **Status:** General Availability (GA). This is a stable release suitable for production use.
> **Status:** Preview (`1.3.0b1`). This beta release includes the latest SDK and sample updates and may change before the next stable release.
Comment thread
xitzhang marked this conversation as resolved.

> **Important:** As of version 1.0.0, this SDK is **async-only**. The synchronous API has been removed to focus exclusively on async patterns. All examples and samples use `async`/`await` syntax.

Expand All @@ -16,34 +16,35 @@ Getting started

### Prerequisites

- **Python 3.9+**
- **Python 3.10+**
- An **Azure subscription**
- A **VoiceLive** resource and endpoint
- A working **microphone** and **speakers/headphones** if you run the voice samples

### Install

Install the stable GA version:
Install the latest preview version:

```bash
# Base install (core client only)
python -m pip install azure-ai-voicelive
python -m pip install --pre azure-ai-voicelive

# For asynchronous streaming (uses aiohttp)
python -m pip install "azure-ai-voicelive[aiohttp]"
python -m pip install --pre "azure-ai-voicelive[aiohttp]"

# For voice samples (includes audio processing)
# First install PyAudio dependencies for your platform:
# Linux: sudo apt-get install -y portaudio19-dev libasound2-dev
# macOS: brew install portaudio
python -m pip install azure-ai-voicelive[aiohttp] pyaudio python-dotenv
python -m pip install --pre "azure-ai-voicelive[aiohttp]" azure-identity pyaudio python-dotenv
```

The SDK provides async-only WebSocket connections using `aiohttp` for optimal performance and reliability.

### Authenticate

You can authenticate with an **API key** or an **Azure Active Directory (AAD) token**.
You can authenticate with an **API key** or a Microsoft Entra ID token.
The samples default to `DefaultAzureCredential`; for local development, `az login` is usually the simplest path.

#### API Key Authentication (Quick Start)

Expand All @@ -66,7 +67,7 @@ async def main():
async with connect(
endpoint="your-endpoint",
credential=AzureKeyCredential("your-api-key"),
model="gpt-4o-realtime-preview"
model="gpt-realtime"
) as connection:
# Your async code here
pass
Expand All @@ -76,7 +77,7 @@ asyncio.run(main())

#### AAD Token Authentication

For production applications, AAD authentication is recommended:
For production applications, Entra ID authentication is recommended:

```python
import asyncio
Expand All @@ -85,14 +86,17 @@ from azure.ai.voicelive import connect

async def main():
credential = DefaultAzureCredential()

async with connect(
endpoint="your-endpoint",
credential=credential,
model="gpt-4o-realtime-preview"
) as connection:
# Your async code here
pass

try:
async with connect(
endpoint="your-endpoint",
credential=credential,
model="gpt-realtime"
) as connection:
# Your async code here
pass
finally:
await credential.close()

asyncio.run(main())
```
Expand All @@ -107,13 +111,16 @@ Key concepts
- **SessionResource** – Update session parameters (voice, formats, VAD) with async methods
- **RequestSession** – Strongly-typed session configuration
- **ServerVad** – Configure voice activity detection
- **SmartEndOfTurnDetection** – Configure audio-based end-of-turn detection
- **AzureStandardVoice** – Configure voice settings
- **parallel_tool_calls** – Control whether tool calls may run in parallel for a session
- **Audio Handling**:
- **InputAudioBufferResource** – Manage audio input to the service with async methods
- **OutputAudioBufferResource** – Control audio output from the service with async methods
- **Conversation Management**:
- **ResponseResource** – Create or cancel model responses with async methods
- **ConversationResource** – Manage conversation items with async methods
- **ClientEventInputTextDelta / ClientEventInputTextDone** – Stream text input incrementally into an item
- **Error Handling**:
- **ConnectionError** – Base exception for WebSocket connection errors
- **ConnectionClosed** – Raised when WebSocket connection is closed
Expand Down Expand Up @@ -142,7 +149,7 @@ The Basic Voice Assistant sample demonstrates full-featured voice interaction wi
python samples/basic_voice_assistant_async.py

# With custom parameters
python samples/basic_voice_assistant_async.py --model gpt-4o-realtime-preview --voice alloy --instructions "You're a helpful assistant"
python samples/basic_voice_assistant_async.py --model gpt-realtime --voice alloy --instructions "You're a helpful assistant"
```

### Minimal example
Expand All @@ -152,12 +159,18 @@ import asyncio
from azure.core.credentials import AzureKeyCredential
from azure.ai.voicelive.aio import connect
from azure.ai.voicelive.models import (
RequestSession, Modality, InputAudioFormat, OutputAudioFormat, ServerVad, ServerEventType
AudioEchoCancellation,
RequestSession,
Modality,
InputAudioFormat,
OutputAudioFormat,
ServerVad,
ServerEventType,
)

API_KEY = "your-api-key"
ENDPOINT = "wss://your-endpoint.com/openai/realtime"
MODEL = "gpt-4o-realtime-preview"
MODEL = "gpt-realtime"

async def main():
async with connect(
Expand All @@ -170,6 +183,7 @@ async def main():
instructions="You are a helpful assistant.",
input_audio_format=InputAudioFormat.PCM16,
output_audio_format=OutputAudioFormat.PCM16,
input_audio_echo_cancellation=AudioEchoCancellation(),
turn_detection=ServerVad(
threshold=0.5,
prefix_padding_ms=300,
Expand All @@ -187,6 +201,13 @@ async def main():
asyncio.run(main())
```

`AudioEchoCancellation` now supports both the default server loopback reference path and a
client-provided stereo echo reference. Use `reference_source="client"` with `channels=2` only when
your application sends stereo PCM16 input with the microphone on channel 0 and the echo reference
signal on channel 1.

For image inputs, `RequestImageContentPart` uses the `image_url` field name.

Available Voice Options
-----------------------

Expand Down
4 changes: 2 additions & 2 deletions sdk/voicelive/azure-ai-voicelive/_metadata.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"apiVersion": "2026-04-10",
"apiVersion": "2026-06-01-preview",
"apiVersions": {
"VoiceLive": "2026-04-10"
"VoiceLive": "2026-06-01-preview"
Comment thread
xitzhang marked this conversation as resolved.
}
}
42 changes: 26 additions & 16 deletions sdk/voicelive/azure-ai-voicelive/apiview-properties.json
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
"azure.ai.voicelive.models.AzureAvatarVoiceSyncVoice": "VoiceLive.AzureAvatarVoiceSyncVoice",
"azure.ai.voicelive.models.AzureCustomVoice": "VoiceLive.AzureCustomVoice",
"azure.ai.voicelive.models.AzurePersonalVoice": "VoiceLive.AzurePersonalVoice",
"azure.ai.voicelive.models.AzureRealtimeNativeVoice": "VoiceLive.AzureRealtimeNativeVoice",
"azure.ai.voicelive.models.EouDetection": "VoiceLive.EouDetection",
"azure.ai.voicelive.models.AzureSemanticDetection": "VoiceLive.AzureSemanticDetection",
"azure.ai.voicelive.models.AzureSemanticDetectionEn": "VoiceLive.AzureSemanticDetectionEn",
Expand Down Expand Up @@ -45,6 +46,7 @@
"azure.ai.voicelive.models.ClientEventOutputAudioBufferClear": "VoiceLive.ClientEventOutputAudioBufferClear",
"azure.ai.voicelive.models.ClientEventResponseCancel": "VoiceLive.ClientEventResponseCancel",
"azure.ai.voicelive.models.ClientEventResponseCreate": "VoiceLive.ClientEventResponseCreate",
"azure.ai.voicelive.models.ClientEventRtcCallSdpCreate": "VoiceLive.ClientEventRtcCallSdpCreate",
"azure.ai.voicelive.models.ClientEventSessionAvatarConnect": "VoiceLive.ClientEventSessionAvatarConnect",
"azure.ai.voicelive.models.ClientEventSessionUpdate": "VoiceLive.ClientEventSessionUpdate",
"azure.ai.voicelive.models.ContentPart": "VoiceLive.ContentPart",
Expand Down Expand Up @@ -92,6 +94,7 @@
"azure.ai.voicelive.models.ResponseSession": "VoiceLive.ResponseSession",
"azure.ai.voicelive.models.ResponseTextContentPart": "VoiceLive.ResponseTextContentPart",
"azure.ai.voicelive.models.ResponseWebSearchCallItem": "VoiceLive.ResponseWebSearchCallItem",
"azure.ai.voicelive.models.RtcCallErrorDetails": "VoiceLive.RtcCallErrorDetails",
"azure.ai.voicelive.models.Scene": "VoiceLive.Scene",
"azure.ai.voicelive.models.ServerEvent": "VoiceLive.ServerEvent",
"azure.ai.voicelive.models.ServerEventConversationItemCreated": "VoiceLive.ServerEventConversationItemCreated",
Expand All @@ -111,6 +114,8 @@
"azure.ai.voicelive.models.ServerEventMcpListToolsFailed": "VoiceLive.ServerEventMcpListToolsFailed",
"azure.ai.voicelive.models.ServerEventMcpListToolsInProgress": "VoiceLive.ServerEventMcpListToolsInProgress",
"azure.ai.voicelive.models.ServerEventOutputAudioBufferCleared": "VoiceLive.ServerEventOutputAudioBufferCleared",
"azure.ai.voicelive.models.ServerEventOutputAudioBufferStarted": "VoiceLive.ServerEventOutputAudioBufferStarted",
"azure.ai.voicelive.models.ServerEventOutputAudioBufferStopped": "VoiceLive.ServerEventOutputAudioBufferStopped",
"azure.ai.voicelive.models.ServerEventResponseAnimationBlendshapeDelta": "VoiceLive.ServerEventResponseAnimationBlendshapeDelta",
"azure.ai.voicelive.models.ServerEventResponseAnimationBlendshapeDone": "VoiceLive.ServerEventResponseAnimationBlendshapeDone",
"azure.ai.voicelive.models.ServerEventResponseAnimationVisemeDelta": "VoiceLive.ServerEventResponseAnimationVisemeDelta",
Expand All @@ -131,6 +136,7 @@
"azure.ai.voicelive.models.ServerEventResponseFileSearchCallSearching": "VoiceLive.ServerEventResponseFileSearchCallSearching",
"azure.ai.voicelive.models.ServerEventResponseFunctionCallArgumentsDelta": "VoiceLive.ServerEventResponseFunctionCallArgumentsDelta",
"azure.ai.voicelive.models.ServerEventResponseFunctionCallArgumentsDone": "VoiceLive.ServerEventResponseFunctionCallArgumentsDone",
"azure.ai.voicelive.models.ServerEventResponseInvocationDelta": "VoiceLive.ServerEventResponseInvocationDelta",
"azure.ai.voicelive.models.ServerEventResponseMcpCallArgumentsDelta": "VoiceLive.ServerEventResponseMcpCallArgumentsDelta",
"azure.ai.voicelive.models.ServerEventResponseMcpCallArgumentsDone": "VoiceLive.ServerEventResponseMcpCallArgumentsDone",
"azure.ai.voicelive.models.ServerEventResponseMcpCallCompleted": "VoiceLive.ServerEventResponseMcpCallCompleted",
Expand All @@ -144,6 +150,8 @@
"azure.ai.voicelive.models.ServerEventResponseWebSearchCallCompleted": "VoiceLive.ServerEventResponseWebSearchCallCompleted",
"azure.ai.voicelive.models.ServerEventResponseWebSearchCallInProgress": "VoiceLive.ServerEventResponseWebSearchCallInProgress",
"azure.ai.voicelive.models.ServerEventResponseWebSearchCallSearching": "VoiceLive.ServerEventResponseWebSearchCallSearching",
"azure.ai.voicelive.models.ServerEventRtcCallError": "VoiceLive.ServerEventRtcCallError",
"azure.ai.voicelive.models.ServerEventRtcCallSdpCreated": "VoiceLive.ServerEventRtcCallSdpCreated",
"azure.ai.voicelive.models.ServerEventSessionAvatarConnecting": "VoiceLive.ServerEventSessionAvatarConnecting",
"azure.ai.voicelive.models.ServerEventSessionAvatarSwitchToIdle": "VoiceLive.ServerEventSessionAvatarSwitchToIdle",
"azure.ai.voicelive.models.ServerEventSessionAvatarSwitchToSpeaking": "VoiceLive.ServerEventSessionAvatarSwitchToSpeaking",
Expand All @@ -165,35 +173,37 @@
"azure.ai.voicelive.models.VideoParams": "VoiceLive.VideoParams",
"azure.ai.voicelive.models.VideoResolution": "VoiceLive.VideoResolution",
"azure.ai.voicelive.models.VoiceLiveErrorDetails": "VoiceLive.VoiceLiveErrorDetails",
"azure.ai.voicelive.models.ClientEventType": "VoiceLive.ClientEventType",
"azure.ai.voicelive.models.ItemType": "VoiceLive.ItemType",
"azure.ai.voicelive.models.ItemParamStatus": "VoiceLive.ItemParamStatus",
"azure.ai.voicelive.models.MessageRole": "VoiceLive.MessageRole",
"azure.ai.voicelive.models.ContentPartType": "VoiceLive.ContentPartType",
"azure.ai.voicelive.models.Modality": "VoiceLive.Modality",
"azure.ai.voicelive.models.AnimationOutputType": "VoiceLive.AnimationOutputType",
"azure.ai.voicelive.models.OpenAIVoiceName": "VoiceLive.OAIVoice",
"azure.ai.voicelive.models.AzureVoiceType": "VoiceLive.AzureVoiceType",
"azure.ai.voicelive.models.PersonalVoiceModels": "VoiceLive.PersonalVoiceModels",
"azure.ai.voicelive.models.OutputAudioFormat": "VoiceLive.OutputAudioFormat",
"azure.ai.voicelive.models.AzureRealtimeNativeVoiceName": "VoiceLive.AzureRealtimeNativeVoiceName",
"azure.ai.voicelive.models.EouThresholdLevel": "VoiceLive.EouThresholdLevel",
"azure.ai.voicelive.models.TurnDetectionType": "VoiceLive.TurnDetectionType",
"azure.ai.voicelive.models.EchoCancellationReferenceSource": "VoiceLive.EchoCancellationReferenceSource",
"azure.ai.voicelive.models.AvatarConfigTypes": "VoiceLive.AvatarConfigTypes",
"azure.ai.voicelive.models.PhotoAvatarBaseModes": "VoiceLive.PhotoAvatarBaseModes",
"azure.ai.voicelive.models.AvatarOutputProtocol": "VoiceLive.AvatarOutputProtocol",
"azure.ai.voicelive.models.ToolType": "VoiceLive.ToolType",
"azure.ai.voicelive.models.MCPApprovalType": "VoiceLive.MCPApprovalType",
"azure.ai.voicelive.models.ReasoningEffort": "VoiceLive.ReasoningEffort",
"azure.ai.voicelive.models.InterimResponseConfigType": "VoiceLive.InterimResponseConfigType",
"azure.ai.voicelive.models.InterimResponseTrigger": "VoiceLive.InterimResponseTrigger",
"azure.ai.voicelive.models.AnimationOutputType": "VoiceLive.AnimationOutputType",
"azure.ai.voicelive.models.Modality": "VoiceLive.Modality",
"azure.ai.voicelive.models.InputAudioFormat": "VoiceLive.InputAudioFormat",
"azure.ai.voicelive.models.TurnDetectionType": "VoiceLive.TurnDetectionType",
"azure.ai.voicelive.models.EouThresholdLevel": "VoiceLive.EouThresholdLevel",
"azure.ai.voicelive.models.AvatarConfigTypes": "VoiceLive.AvatarConfigTypes",
"azure.ai.voicelive.models.PhotoAvatarBaseModes": "VoiceLive.PhotoAvatarBaseModes",
"azure.ai.voicelive.models.AvatarOutputProtocol": "VoiceLive.AvatarOutputProtocol",
"azure.ai.voicelive.models.OutputAudioFormat": "VoiceLive.OutputAudioFormat",
"azure.ai.voicelive.models.AudioTimestampType": "VoiceLive.AudioTimestampType",
"azure.ai.voicelive.models.ToolChoiceLiteral": "VoiceLive.ToolChoiceLiteral",
"azure.ai.voicelive.models.ReasoningEffort": "VoiceLive.ReasoningEffort",
"azure.ai.voicelive.models.SessionIncludeOption": "VoiceLive.SessionIncludeOption",
"azure.ai.voicelive.models.ClientEventType": "VoiceLive.ClientEventType",
"azure.ai.voicelive.models.ItemType": "VoiceLive.ItemType",
"azure.ai.voicelive.models.ItemParamStatus": "VoiceLive.ItemParamStatus",
"azure.ai.voicelive.models.MessageRole": "VoiceLive.MessageRole",
"azure.ai.voicelive.models.ContentPartType": "VoiceLive.ContentPartType",
"azure.ai.voicelive.models.ResponseStatus": "VoiceLive.ResponseStatus",
"azure.ai.voicelive.models.ResponseItemStatus": "VoiceLive.ResponseItemStatus",
"azure.ai.voicelive.models.RequestImageContentPartDetail": "VoiceLive.RequestImageContentPartDetail",
"azure.ai.voicelive.models.ResponseItemStatus": "VoiceLive.ResponseItemStatus",
"azure.ai.voicelive.models.ServerEventType": "VoiceLive.ServerEventType"
},
"CrossLanguageVersion": "4f7c08a38aa5"
"CrossLanguageVersion": "d4391398f022"
}
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@

if TYPE_CHECKING:
from . import models as _models
Voice = Union[str, "_models.OpenAIVoiceName", "_models.OpenAIVoice", "_models.AzureVoice"]
InterimResponseConfig = Union["_models.StaticInterimResponseConfig", "_models.LlmInterimResponseConfig"]
Voice = Union[
str, "_models.OpenAIVoiceName", "_models.OpenAIVoice", "_models.AzureVoice", "_models.AzureRealtimeNativeVoice"
]
ToolChoice = Union[str, "_models.ToolChoiceLiteral", "_models.ToolChoiceSelection"]
InterimResponseConfig = Union["_models.StaticInterimResponseConfig", "_models.LlmInterimResponseConfig"]
Loading
Loading