0

I’m building a system where a bot joins a Google Meet call and extracts live transcription.
Right now, I’m injecting JavaScript into the Meet tab (through a browser automation bot) and scraping the DOM captions. This works, but the transcription quality is very poor:

  • Many words are wrong/missing

  • Google Meet system messages (join/leave/prompts) appear inside the transcript

  • Sometimes only partial captions appear

  • The accuracy is far below what Google Meet itself shows to users

Google Meet does not provide any official API for captions, speaker labels, or meeting audio, and WebRTC restrictions prevent directly capturing tab audio through JavaScript for a non-human bot.

What I want to know

Is there any reliable / free / open-source method to capture high-quality audio or transcripts from Google Meet when a bot joins the call?

Details about my environment

  • The bot is running on an Ubuntu VM (Civo cloud)

  • I can run a headful Chrome instance (via Puppeteer or Selenium)

  • I’m okay with recording system/tab audio if possible

  • I want to avoid paid APIs (e.g., Vexa, paid STT APIs)

  • Goal is to feed the audio into a local STT engine (Whisper, WhisperX, etc.)

What I’ve already tried

  1. DOM scraping of captions → poor quality, noisy, system messages mixed with speech

  2. Exploring Chrome getDisplayMedia → cannot auto-grant permissions from a bot; fails due to user-gesture requirement

  3. Investigating WebRTC internals → Seems impossible to intercept audio tracks of other participants from JS

  4. Searching for Meet API → none exists for transcripts/audio

My questions

  1. Is there a technically feasible way to capture Google Meet tab/system audio on a Linux VM using a bot?

    • e.g., using PulseAudio monitor, null-sinks, Chrome flags, or tabCapture
  2. Has anyone successfully implemented a Google Meet bot → audio capture → local transcription (Whisper) pipeline?

  3. Are there any reliable open-source approaches, or is the only stable method to record system audio at the OS level and bypass Meet entirely?

  4. Any known limitations with Chrome/Puppeteer + Meet that I should be aware of?

My goal

I’m not trying to break security — I just want to implement a bot that can hear the meeting audio (similar to human attendees), transcribe it locally, and avoid the low-quality DOM caption scrape.

What is the best technical approach to achieve this?

New contributor
varaha is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.