I’ve built a real-time AI voice assistant using Twilio Media Streams, OpenAI GPT, and ElevenLabs, with audio handled in Python using Quart + Hypercorn. The app connects, transcribes voices from the LL, generates a response, and streams it back via WebSocket—yet we hear nothing, when we call ourselves to test it. Zero audio.
What I'm Trying to Do: Enable two-way voice calls where:
- The caller speaks → Whisper transcribes
- GPT-4 replies → ElevenLabs generates audio
- The response is streamed back to the caller using Twilio’s media stream
I want a natural-sounding, live conversation between customer and AI receptionist.
What Does Work
- Twilio connects the call and initiates media stream
- /media WebSocket receives media and events correctly
- Whisper accurately transcribes caller's speech
- GPT generates a correct response
- ElevenLabs returns a valid MP3 file
- ffmpeg converts that MP3 to µ-law @8000Hz .raw file
- The stream_audio() function sends 160-byte frames over WebSocket (20ms µ-law chunks)
- Logs confirm the audio stream is being sent frame-by-frame
- Call stays connected (with pause), status logs fire
- Whisper proves Twilio is hearing audio
- But... the caller hears nothing. No greeting. No response. Just eerie silence.
What I’ve Tried (Extensive Debug History)
Confirmed µ-law conversion via:
ffmpeg -y -i file.mp3 -f mulaw -acodec pcm_mulaw -ar 8000 -ac 1 file.raw
Used 160-byte chunks for 20ms @ 8kHz
Injected 1 full second of µ-law silence to prime the buffer
Added 1s asyncio.sleep() before and after greeting audio
Used correct track: "inbound" for audio frames
Set Content-Type as audio/mulaw inside: TwiML used for outbound media stream:
<Start> <Stream url="wss://chat.example.net/media" track="both_tracks"> <Parameter name="Content-Type" value="audio/mulaw" /> </Stream> </Start>Verified that .mp3 and .raw audio sound perfect when played back locally
Tried record TwiML to listen for what the caller might be hearing (still silent)
Confirmed all Whisper and GPT responses are correct and timely
WebSocket /media endpoint works with wscat and logs all events correctly
No firewall, NGINX issues, or SSL problems (wss:// confirmed working)
Relevant Python (streaming audio to Twilio):
async def stream_audio(ws, stream_sid: str, audio_path: str):
raw_path = audio_path.rsplit('.', 1)[0] + ".raw"
subprocess.run([
"ffmpeg", "-y", "-i", audio_path,
"-f", "mulaw", "-acodec", "pcm_mulaw",
"-ar", "8000", "-ac", "1", raw_path
])
with open(raw_path, "rb") as f:
while chunk := f.read(160):
msg = {
"event": "media",
"streamSid": stream_sid,
"media": {
"track": "inbound",
"payload": base64.b64encode(chunk).decode("utf-8")
}
}
await ws.send(json.dumps(msg))
await asyncio.sleep(0.02)
We also have full @app.websocket('/media') logic, TwiML handler, and async startup via Hypercorn.
Suspicions:
- Twilio is receiving the audio—but silently discarding it
- We are matching the documentation exactly, yet it’s not being heard
- Maybe Twilio expects additional headers or handshake data?
- Possibly some encoding mismatch despite using µ-law @ 8000Hz
- Or timing/track issues even with correct pacing and silence priming
What I’m Looking For:
- If you’ve ever gotten a Python app (not Node.js) working with:
- Twilio Media Streams (two-way)
- Whisper / GPT for conversation
- ElevenLabs for TTS responses …and the caller actually hears the audio, please share anything you learned!
Or: if you’ve hit this same "everything works but no sound" wall—what fixed it for you?
Environment
- Python 3.10 (Quart + Hypercorn)
- AWS Lightsail (Ubuntu 22)
- Twilio Programmable Voice
- Whisper (OpenAI)
- ElevenLabs API
- FFMPEG v5.x