1

I’ve built a real-time AI voice assistant using Twilio Media Streams, OpenAI GPT, and ElevenLabs, with audio handled in Python using Quart + Hypercorn. The app connects, transcribes voices from the LL, generates a response, and streams it back via WebSocket—yet we hear nothing, when we call ourselves to test it. Zero audio.

What I'm Trying to Do: Enable two-way voice calls where:

  • The caller speaks → Whisper transcribes
  • GPT-4 replies → ElevenLabs generates audio
  • The response is streamed back to the caller using Twilio’s media stream

I want a natural-sounding, live conversation between customer and AI receptionist.

What Does Work

  • Twilio connects the call and initiates media stream
  • /media WebSocket receives media and events correctly
  • Whisper accurately transcribes caller's speech
  • GPT generates a correct response
  • ElevenLabs returns a valid MP3 file
  • ffmpeg converts that MP3 to µ-law @8000Hz .raw file
  • The stream_audio() function sends 160-byte frames over WebSocket (20ms µ-law chunks)
  • Logs confirm the audio stream is being sent frame-by-frame
  • Call stays connected (with pause), status logs fire
  • Whisper proves Twilio is hearing audio
  • But... the caller hears nothing. No greeting. No response. Just eerie silence.

What I’ve Tried (Extensive Debug History)

  • Confirmed µ-law conversion via:

    ffmpeg -y -i file.mp3 -f mulaw -acodec pcm_mulaw -ar 8000 -ac 1 file.raw

  • Used 160-byte chunks for 20ms @ 8kHz

  • Injected 1 full second of µ-law silence to prime the buffer

  • Added 1s asyncio.sleep() before and after greeting audio

  • Used correct track: "inbound" for audio frames

  • Set Content-Type as audio/mulaw inside: TwiML used for outbound media stream:

    <Start>
      <Stream url="wss://chat.example.net/media" track="both_tracks">
        <Parameter name="Content-Type" value="audio/mulaw" />
      </Stream>
    </Start>
    
    
  • Verified that .mp3 and .raw audio sound perfect when played back locally

  • Tried record TwiML to listen for what the caller might be hearing (still silent)

  • Confirmed all Whisper and GPT responses are correct and timely

  • WebSocket /media endpoint works with wscat and logs all events correctly

  • No firewall, NGINX issues, or SSL problems (wss:// confirmed working)

Relevant Python (streaming audio to Twilio):

async def stream_audio(ws, stream_sid: str, audio_path: str):
    raw_path = audio_path.rsplit('.', 1)[0] + ".raw"
    subprocess.run([
        "ffmpeg", "-y", "-i", audio_path,
        "-f", "mulaw", "-acodec", "pcm_mulaw",
        "-ar", "8000", "-ac", "1", raw_path
    ])
    with open(raw_path, "rb") as f:
        while chunk := f.read(160):
            msg = {
                "event": "media",
                "streamSid": stream_sid,
                "media": {
                    "track": "inbound",
                    "payload": base64.b64encode(chunk).decode("utf-8")
                }
            }
            await ws.send(json.dumps(msg))
            await asyncio.sleep(0.02)

We also have full @app.websocket('/media') logic, TwiML handler, and async startup via Hypercorn.

Suspicions:

  • Twilio is receiving the audio—but silently discarding it
  • We are matching the documentation exactly, yet it’s not being heard
  • Maybe Twilio expects additional headers or handshake data?
  • Possibly some encoding mismatch despite using µ-law @ 8000Hz
  • Or timing/track issues even with correct pacing and silence priming

What I’m Looking For:

  • If you’ve ever gotten a Python app (not Node.js) working with:
  • Twilio Media Streams (two-way)
  • Whisper / GPT for conversation
  • ElevenLabs for TTS responses …and the caller actually hears the audio, please share anything you learned!

Or: if you’ve hit this same "everything works but no sound" wall—what fixed it for you?

Environment

  • Python 3.10 (Quart + Hypercorn)
  • AWS Lightsail (Ubuntu 22)
  • Twilio Programmable Voice
  • Whisper (OpenAI)
  • ElevenLabs API
  • FFMPEG v5.x

1 Answer 1

0

Im working on something similar and even when I couldn't test this myself because of the time (it takes to replicate it), I'm just comparing what you did versus what I did

Looking at your code, I suspect the main issue is with the track parameter. When you're sending audio back to the caller, you're using:

"track": "inbound"  # ❌ This is incorrect for sending audio TO the callerT
"track": "outbound"  # ✅ Use this when sending audio TO the caller

Remember:

  • "inbound" = audio FROM the caller TO your app

  • "outbound" = audio FROM your app TO the caller

Since you want the caller to hear the AI response, you need "outbound".

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.