Twilio Media Streams + ElevenLabs + OpenAI (Python): Call Connects, Transcripts Work, But Caller Hears Silence

Question

I’ve built a real-time AI voice assistant using Twilio Media Streams, OpenAI GPT, and ElevenLabs, with audio handled in Python using Quart + Hypercorn. The app connects, transcribes voices from the LL, generates a response, and streams it back via WebSocket—yet we hear nothing, when we call ourselves to test it. Zero audio.

What I'm Trying to Do: Enable two-way voice calls where:

The caller speaks → Whisper transcribes
GPT-4 replies → ElevenLabs generates audio
The response is streamed back to the caller using Twilio’s media stream

I want a natural-sounding, live conversation between customer and AI receptionist.

What Does Work

Twilio connects the call and initiates media stream
/media WebSocket receives media and events correctly
Whisper accurately transcribes caller's speech
GPT generates a correct response
ElevenLabs returns a valid MP3 file
ffmpeg converts that MP3 to µ-law @8000Hz .raw file
The stream_audio() function sends 160-byte frames over WebSocket (20ms µ-law chunks)
Logs confirm the audio stream is being sent frame-by-frame
Call stays connected (with pause), status logs fire
Whisper proves Twilio is hearing audio
But... the caller hears nothing. No greeting. No response. Just eerie silence.

What I’ve Tried (Extensive Debug History)

Confirmed µ-law conversion via:

ffmpeg -y -i file.mp3 -f mulaw -acodec pcm_mulaw -ar 8000 -ac 1 file.raw
Used 160-byte chunks for 20ms @ 8kHz
Injected 1 full second of µ-law silence to prime the buffer
Added 1s asyncio.sleep() before and after greeting audio
Used correct track: "inbound" for audio frames

Set Content-Type as audio/mulaw inside: TwiML used for outbound media stream:

&lt;Start&gt;
  &lt;Stream url="wss://chat.example.net/media" track="both_tracks"&gt;
    &lt;Parameter name="Content-Type" value="audio/mulaw" /&gt;
  &lt;/Stream&gt;
&lt;/Start&gt;

Verified that .mp3 and .raw audio sound perfect when played back locally
Tried record TwiML to listen for what the caller might be hearing (still silent)
Confirmed all Whisper and GPT responses are correct and timely
WebSocket /media endpoint works with wscat and logs all events correctly
No firewall, NGINX issues, or SSL problems (wss:// confirmed working)

Relevant Python (streaming audio to Twilio):

async def stream_audio(ws, stream_sid: str, audio_path: str):
    raw_path = audio_path.rsplit('.', 1)[0] + ".raw"
    subprocess.run([
        "ffmpeg", "-y", "-i", audio_path,
        "-f", "mulaw", "-acodec", "pcm_mulaw",
        "-ar", "8000", "-ac", "1", raw_path
    ])
    with open(raw_path, "rb") as f:
        while chunk := f.read(160):
            msg = {
                "event": "media",
                "streamSid": stream_sid,
                "media": {
                    "track": "inbound",
                    "payload": base64.b64encode(chunk).decode("utf-8")
                }
            }
            await ws.send(json.dumps(msg))
            await asyncio.sleep(0.02)

We also have full @app.websocket('/media') logic, TwiML handler, and async startup via Hypercorn.

Suspicions:

Twilio is receiving the audio—but silently discarding it
We are matching the documentation exactly, yet it’s not being heard
Maybe Twilio expects additional headers or handshake data?
Possibly some encoding mismatch despite using µ-law @ 8000Hz
Or timing/track issues even with correct pacing and silence priming

What I’m Looking For:

If you’ve ever gotten a Python app (not Node.js) working with:
Twilio Media Streams (two-way)
Whisper / GPT for conversation
ElevenLabs for TTS responses …and the caller actually hears the audio, please share anything you learned!

Or: if you’ve hit this same "everything works but no sound" wall—what fixed it for you?

Environment

Python 3.10 (Quart + Hypercorn)
AWS Lightsail (Ubuntu 22)
Twilio Programmable Voice
Whisper (OpenAI)
ElevenLabs API
FFMPEG v5.x

lapiceroazul4 · Accepted Answer · 2025-07-13 16:35:21Z

0

Im working on something similar and even when I couldn't test this myself because of the time (it takes to replicate it), I'm just comparing what you did versus what I did

Looking at your code, I suspect the main issue is with the track parameter. When you're sending audio back to the caller, you're using:

"track": "inbound"  # ❌ This is incorrect for sending audio TO the callerT

"track": "outbound"  # ✅ Use this when sending audio TO the caller

Remember:

"inbound" = audio FROM the caller TO your app
"outbound" = audio FROM your app TO the caller

Since you want the caller to hear the AI response, you need "outbound".

answered Jul 13 at 16:35

lapiceroazul4

448 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Twilio Media Streams + ElevenLabs + OpenAI (Python): Call Connects, Transcripts Work, But Caller Hears Silence

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related