Skip to content

Conversation

@ramen
Copy link
Contributor

@ramen ramen commented Dec 29, 2024

This change adds a new ASR engine that uses whisper.cpp via the pywhispercpp library. This enables GPU-accelerated transcription on Apple Silicon. Fixes #125

To use whisper.cpp, set the ASR_ENGINE=whisper_cpp environment variable when starting the service or Docker container. Note that GPU acceleration is only available outside of Docker, since it requires access to Apple's libraries.

Example:
ASR_ENGINE=whisper_cpp poetry run python3.10 app/run.py --build-reascripts

There are a few differences with the whisper.cpp engine:

  • Support for word-level transcripts is different; there is no current way to get a segment/word hierarchy. Instead, there is a mode that essentially returns a segment per word. This doesn't mesh well with a number of ReaSpeech's features, so those features are disabled for this engine.
  • Resource consumption is lower. I was able to process 1.5 hours of audio using the small model on my M1 Mac mini with 8GB RAM, and with GPU acceleration, it only took about 6 minutes. This would previously have caused an out of memory condition on this hardware.
  • Timestamps seem anecdotally less accurate compared to faster-whisper
  • There is no Voice Activity Detection (VAD) - though the latter might be possible with the use of a library called "webrtcvad"

This change includes an improvement to run.py that prevents Ctrl-C from being sent to subprocesses and enables a graceful shutdown when running ReaSpeech outside of Docker.

@ramen ramen requested review from mikeylove and smrl December 29, 2024 17:33
@ramen
Copy link
Contributor Author

ramen commented Dec 29, 2024

whisper-cpp-apple-gpu

Copy link
Contributor

@mikeylove mikeylove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is awesome! after a bit of initial confusion, my test file now transcribes in less than a second (on my M4 Pro Mini). one small diff i noticed between small.en and small is that the former had a trailing segment that said [BLANK AUDIO].

re: that initial confusion, the "loading model" phase was being reported on the reaspeech side as "transcribing." before i realized this, i was puzzled why it seemed to be 1) reporting very small but regular progress updates, 2) taking forever and 3) being nearly instant on subsequent transcriptions.

not sure how to think about the "split on words" option. the output from running with this turned on is (obviously) pretty bulky. i'm having a hard time coming up with a hypothetical situation where words-as-top-level would be useful. i'm also quite aware that this is very possibly only a limit of my own imagination. 😂

my individual comments here are mostly to identify issues/updates to pursue outside of this pr.

Comment on lines 61 to 62
if not segment.text:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious about what the purpose of a segment without text is 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was due to the first segment being a "beginning of text" sort of token with no actual content. I think that I have a better way to solve this now.

Comment on lines +90 to +93
if output == "json":
json.dump(result, file)
else:
return 'Please select an output method!'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the other engines we offer a bunch of output formats...but do we ever use anything but json in those either? the whole backend interface is json-based. maybe this was useful using the web interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't - this design came from whisper-asr-webservice, and was to support the web interface but also the API, which included export functionality. We don't use this part of the API, since we build our exports in Lua. We could copy over (or make reusable) the export code from the faster-whisper engine, but since we don't use it, it seemed like a waste of effort. But this is a bit of a funky design at the moment.

Comment on lines +134 to +135
- `ASR_ENGINE`: The ASR engine to use. Options are `faster_whisper` (default),
`openai_whisper`, and `whisper_cpp`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not in the scope of this pr but wondering if we should provide some basic information (and link to project) about these engines. my selection process was "works on my old macbook outside of docker" and that was always openai_whisper because i could never resolve the library conflicts causing faster_whisper to crash.

hard to say what the right way to describe this is to someone interested in development lol

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree this should be documented somewhere. I think picking the right engine is not a problem that most users should have to solve, but in this particular case (for Mac users), it's the difference between GPU acceleration and not. That seems important enough that users should understand how and why to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reacts with "nod" emoji

Comment on lines 18 to 23
# Start all services
poetry run python3.10 app/run.py
# Start all services except for Redis
poetry run python3.10 app/run.py --no-start-redis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this section call out why one might want to run the --no-start-redis version or is it obvious enough? feels like maybe the user who needs the option would understand the difference but another less developer-minded might not. 🤔🤔🤔🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. I'll update it. The motivation was that in some cases, Redis is installed using an OS package (Debian package, Homebrew, etc.), and the service is started and managed by the OS infrastructure. I'd really like to make Redis optional - feature #95

@ramen
Copy link
Contributor Author

ramen commented Jan 2, 2025

this is awesome! after a bit of initial confusion, my test file now transcribes in less than a second (on my M4 Pro Mini). one small diff i noticed between small.en and small is that the former had a trailing segment that said [BLANK AUDIO].

I have noticed more of this type of thing as well - I've also seen things like "[Laughs]" which I don't recall seeing with any other engine. In some cases, these are special tokens, and I have a way to filter those out, which I'll incorporate into this PR.

re: that initial confusion, the "loading model" phase was being reported on the reaspeech side as "transcribing." before i realized this, i was puzzled why it seemed to be 1) reporting very small but regular progress updates, 2) taking forever and 3) being nearly instant on subsequent transcriptions.

Yes, I think the way that this library (pywhispercpp) reports progress is odd, and basically reports the model loading progress but not the transcription. At least there's some visual feedback about the work it's doing.

not sure how to think about the "split on words" option. the output from running with this turned on is (obviously) pretty bulky. i'm having a hard time coming up with a hypothetical situation where words-as-top-level would be useful. i'm also quite aware that this is very possibly only a limit of my own imagination. 😂

So, good news! I figured out how to get segments with words, and I can remove this option. It turns out that pywhispercpp's high-level Model class hides some of whisper.cpp's functionality, and it's actually possible to merge tokens into words by detecting word boundaries, which are indicated by the first character of the token being a space. Stay tuned...

@ramen ramen merged commit 8febe95 into main Jan 2, 2025
2 checks passed
@ramen ramen deleted the whisper-cpp branch January 2, 2025 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feature]: GPU acceleration for Apple Silicon

3 participants