I'm building a phone call application using Twilio Media Streams.
The workflow is as follows:
Twilio Media Stream → Google STT (Streaming) → LLM → TTS
I'm using the sample code from the following GitHub repository: https://github.com/twilio/media-streams/tree/master/python/realtime-transcriptions
I've modified the on_transcription_response function as shown below:
def on_transcription_response(response):
if not response.results:
return
result = response.results[0]
if not result.alternatives:
return
transcription = result.alternatives[0].transcript
print("Transcription: " + transcription + " is_final: " + str(result.is_final))
The issue is that result.is_final never returns True, which prevents me from sending the transcription to the LLM.
I tried adding an is_silence function to pause when silence is detected, but is_final still always returns False.
import audioop
def is_silence(buffer, threshold=500):
pcm = audioop.ulaw2lin(buffer, 2) # Convert to 16-bit PCM
rms = audioop.rms(pcm, 2) # Calculate root mean square amplitude
return rms < threshold
def add_request(self, buffer):
if is_silence(buffer):
print("Skipping silence based on amplitude")
return
self._queue.put(bytes(buffer), block=False)
Additionally, I need to continuously recognize speech with language_code="yue-Hant-HK", as the caller may speak at any time during the call. I’m not looking to stop recognition after a single utterance—the STT should stay active and detect complete sentences dynamically.
Any suggestions on how to handle this with Google STT streaming while keeping is_final working properly?
cheers