-
Notifications
You must be signed in to change notification settings - Fork 4
Add whisper.cpp backend, enable GPU support on Apple Silicon #132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This reverts commit 8695c6f.
mikeylove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is awesome! after a bit of initial confusion, my test file now transcribes in less than a second (on my M4 Pro Mini). one small diff i noticed between small.en and small is that the former had a trailing segment that said [BLANK AUDIO].
re: that initial confusion, the "loading model" phase was being reported on the reaspeech side as "transcribing." before i realized this, i was puzzled why it seemed to be 1) reporting very small but regular progress updates, 2) taking forever and 3) being nearly instant on subsequent transcriptions.
not sure how to think about the "split on words" option. the output from running with this turned on is (obviously) pretty bulky. i'm having a hard time coming up with a hypothetical situation where words-as-top-level would be useful. i'm also quite aware that this is very possibly only a limit of my own imagination. 😂
my individual comments here are mostly to identify issues/updates to pursue outside of this pr.
app/whisper_cpp/core.py
Outdated
| if not segment.text: | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious about what the purpose of a segment without text is 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was due to the first segment being a "beginning of text" sort of token with no actual content. I think that I have a better way to solve this now.
| if output == "json": | ||
| json.dump(result, file) | ||
| else: | ||
| return 'Please select an output method!' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the other engines we offer a bunch of output formats...but do we ever use anything but json in those either? the whole backend interface is json-based. maybe this was useful using the web interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't - this design came from whisper-asr-webservice, and was to support the web interface but also the API, which included export functionality. We don't use this part of the API, since we build our exports in Lua. We could copy over (or make reusable) the export code from the faster-whisper engine, but since we don't use it, it seemed like a waste of effort. But this is a bit of a funky design at the moment.
| - `ASR_ENGINE`: The ASR engine to use. Options are `faster_whisper` (default), | ||
| `openai_whisper`, and `whisper_cpp`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not in the scope of this pr but wondering if we should provide some basic information (and link to project) about these engines. my selection process was "works on my old macbook outside of docker" and that was always openai_whisper because i could never resolve the library conflicts causing faster_whisper to crash.
hard to say what the right way to describe this is to someone interested in development lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree this should be documented somewhere. I think picking the right engine is not a problem that most users should have to solve, but in this particular case (for Mac users), it's the difference between GPU acceleration and not. That seems important enough that users should understand how and why to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reacts with "nod" emoji
| # Start all services | ||
| poetry run python3.10 app/run.py | ||
| # Start all services except for Redis | ||
| poetry run python3.10 app/run.py --no-start-redis | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this section call out why one might want to run the --no-start-redis version or is it obvious enough? feels like maybe the user who needs the option would understand the difference but another less developer-minded might not. 🤔🤔🤔🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. I'll update it. The motivation was that in some cases, Redis is installed using an OS package (Debian package, Homebrew, etc.), and the service is started and managed by the OS infrastructure. I'd really like to make Redis optional - feature #95
I have noticed more of this type of thing as well - I've also seen things like "[Laughs]" which I don't recall seeing with any other engine. In some cases, these are special tokens, and I have a way to filter those out, which I'll incorporate into this PR.
Yes, I think the way that this library (pywhispercpp) reports progress is odd, and basically reports the model loading progress but not the transcription. At least there's some visual feedback about the work it's doing.
So, good news! I figured out how to get segments with words, and I can remove this option. It turns out that pywhispercpp's high-level Model class hides some of whisper.cpp's functionality, and it's actually possible to merge tokens into words by detecting word boundaries, which are indicated by the first character of the token being a space. Stay tuned... |

This change adds a new ASR engine that uses whisper.cpp via the pywhispercpp library. This enables GPU-accelerated transcription on Apple Silicon. Fixes #125
To use whisper.cpp, set the ASR_ENGINE=whisper_cpp environment variable when starting the service or Docker container. Note that GPU acceleration is only available outside of Docker, since it requires access to Apple's libraries.
Example:
ASR_ENGINE=whisper_cpp poetry run python3.10 app/run.py --build-reascriptsThere are a few differences with the whisper.cpp engine:
This change includes an improvement to run.py that prevents Ctrl-C from being sent to subprocesses and enables a graceful shutdown when running ReaSpeech outside of Docker.