File: BASIC_VOICE_ASSISTANT.md

package info (click to toggle)
python-azure 20250829%2Bgit-3
  • links: PTS, VCS
  • area: main
  • in suites: forky
  • size: 756,824 kB
  • sloc: python: 6,224,989; ansic: 804; javascript: 287; makefile: 198; sh: 195; xml: 109
file content (193 lines) | stat: -rw-r--r-- 6,586 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
# Basic Voice Assistant Sample

This sample demonstrates the fundamental capabilities of the Azure VoiceLive SDK by creating a basic voice assistant that can engage in natural conversation with proper interruption handling.

## Features

- **Real-time Voice Conversation**: Full duplex audio communication with the AI assistant
- **Server VAD**: Automatic turn detection using server-side voice activity detection
- **Interruption Handling**: Graceful handling of user interruptions during assistant responses
- **Cross-platform Audio**: PCM16, 24kHz, mono audio format with platform-appropriate handling
- **Error Recovery**: Robust connection management with exponential backoff retry logic
- **Environment Configuration**: Easy setup using .env files

## Prerequisites

- Python 3.8 or later
- Azure subscription with access to Azure AI VoiceLive
- Microphone and speakers/headphones
- Required Python packages (see installation below)

## Installation

1. **Install required packages**:
   ```bash
   pip install azure-ai-voicelive pyaudio python-dotenv
   ```

2. **Set up environment variables**:
   
   Copy the `.env.template` file to `.env` and fill in your values:
   ```bash
   cp ../.env.template .env
   ```
   
   Or set environment variables directly:
   ```bash
   export AZURE_VOICELIVE_KEY="your-api-key"
   export AZURE_VOICELIVE_ENDPOINT="wss://api.voicelive.com/v1"
   export VOICELIVE_MODEL="gpt-4o-realtime-preview"
   export VOICELIVE_VOICE="en-US-AvaNeural"
   ```

## Usage

### Basic Usage

Run the voice assistant with default settings:

```bash
python basic_voice_assistant.py
```

### Advanced Usage

Customize the voice assistant with command-line options:

```bash
python basic_voice_assistant.py \
    --model gpt-4o-realtime-preview \
    --voice en-US-JennyNeural \
    --instructions "You are a helpful coding assistant. Keep responses technical but friendly."
```

### Available Options

- `--api-key`: Azure VoiceLive API key (or use AZURE_VOICELIVE_KEY env var)
- `--endpoint`: VoiceLive endpoint URL
- `--model`: Model to use (default: gpt-4o-realtime-preview)
- `--voice`: Voice for the assistant (alloy, echo, fable, onyx, nova, shimmer, en-US-AvaNeural, etc.)
- `--instructions`: Custom system instructions for the AI
- `--use-token-credential`: Use Azure token authentication instead of API key
- `--verbose`: Enable detailed logging

## How It Works

### 1. Session Setup
The assistant establishes a WebSocket connection and configures the session using strongly-typed objects:
- Bidirectional audio streaming (PCM16, 24kHz, mono)
- Server-side voice activity detection with configurable thresholds
- Strongly-typed session configuration (RequestSession)
- Azure voice configuration (AzureStandardVoice) or OpenAI voices

### 2. Audio Processing
Real-time audio capture and playback using PyAudio:
- Microphone input is captured in 1024-sample chunks
- Audio is base64-encoded and streamed to VoiceLive
- Response audio is decoded and played back immediately

### 3. Event Handling
Strongly-typed event handling for real-time communication:
- `session.updated`: Session ready, start audio capture
- `input_audio_buffer.speech_started`: User speaking, stop playback
- `input_audio_buffer.speech_stopped`: User finished, process input  
- `response.audio.delta`: Stream response audio to speakers
- `response.audio.done`: Assistant finished speaking

### 4. Interruption Support
Graceful handling of user interruptions:
- Automatic detection when user starts speaking
- Immediate cancellation of ongoing assistant responses
- Seamless resumption of conversation flow

## Audio Requirements

- **Input**: Microphone capable of 24kHz sampling
- **Output**: Speakers or headphones
- **Format**: PCM16, 24kHz, mono
- **Latency**: Optimized for real-time conversation (< 300ms end-to-end)

## Troubleshooting

### Audio Issues
- **No microphone detected**: Check microphone permissions and connections
- **No audio output**: Verify speaker/headphone setup and volume levels
- **Audio quality issues**: Ensure quiet environment and quality microphone

### Connection Issues
- **Authentication errors**: Verify API key and endpoint configuration
- **Network timeouts**: Check internet connection and firewall settings
- **Connection drops**: SDK handles reconnection internally with appropriate retry logic

### Dependencies
- **PyAudio installation**: On some systems, may require additional setup:
  ```bash
  # Ubuntu/Debian
  sudo apt-get install portaudio19-dev
  pip install pyaudio
  
  # macOS
  brew install portaudio
  pip install pyaudio
  
  # Windows
  pip install pyaudio
  ```

## Key Learning Outcomes

After running this sample, developers will understand:

1. **SDK Integration**: How to integrate VoiceLive SDK into applications
2. **Strongly-Typed Configuration**: Using typed objects for session setup instead of raw JSON
3. **Event-Driven Architecture**: Handling real-time events with strong typing
4. **Audio Processing**: Platform-specific audio handling through SDK abstractions
5. **Connection Management**: Letting the SDK handle connection retry logic internally
6. **Real-Time Applications**: Building responsive voice-enabled applications

### Strongly-Typed Configuration Example

The sample demonstrates proper use of typed objects:

```python
# Strongly-typed voice configuration
voice_config = AzureStandardVoice(
    name="en-US-AvaNeural",
    type="azure-standard"
)

# Strongly-typed turn detection
turn_detection_config = ServerVad(
    threshold=0.5,
    prefix_padding_ms=300,
    silence_duration_ms=500
)

# Strongly-typed session configuration
session_config = RequestSession(
    modalities=[Modality.TEXT, Modality.AUDIO],
    instructions=instructions,
    voice=voice_config,
    input_audio_format=AudioFormat.PCM16,
    output_audio_format=AudioFormat.PCM16,
    turn_detection=turn_detection_config
)

await connection.session.update(session=session_config)
```

## Next Steps

This foundational sample can be extended with:

- **Custom voice selection**: Different voice options and characteristics
- **Audio controls**: Volume adjustment and quality settings
- **Session persistence**: Conversation history and context management
- **Multi-language support**: Language detection and switching
- **Custom instructions**: Dynamic instruction updates during conversation

## Related Samples

- `audio_streaming_sample.py`: Low-level audio streaming patterns
- `typed_event_handling_sample.py`: Advanced event handling techniques
- `session_management_sample.py`: Session configuration and management