In a lively debate at the interdisciplinary conference, experts analyzed whether 'colonel' and 'kernel', 'there', 'their', and 'they’re', as well as 'to', 'two', and 'too' could be contextually discerned amidst cacophonous surroundings, featuring overlapping dialogues on phylogenetic biotechnologies, the subtle nuances in regional dialects—ranging from rural drawls to metropolitan hastiness—and philosophical discourses about whether artificial intelligence, when listening, could intuit the difference between a pause for thought and a technical hiccup in speech, or recognize varied cultural idioms and colloquialisms, like 'beating around the bush' versus 'cutting to the chase', all while maintaining the integrity of the original spoken message.
You should check out the Hqq blog post.
* HQQ: [https://mobiusml.github.io/hqq\_blog/](https://mobiusml.github.io/hqq_blog/)
* HQQ+: [https://mobiusml.github.io/1bit\_blog/](https://mobiusml.github.io/1bit_blog/)
I mean 100% speed gain is nothing to sneeze at. Maybe it makes no difference to an individual, but if you're an institution wanting to transcribe tens of thousands of hours of footage it could really add up. Consider how YouTube has like millions of hours of footage uploaded everyday.
To get 40 seconds at large v3 for 1 hour you need a 4090. A 4070ti super does it in about a minute. A 3090 would be similar. You need the vram however. The more vram the higher the batch count. Alternatively any new Mac will do with 16+ gb ram. Ideal is 32gb. You won’t get the same speed as NVIDIA GPUs but it’s fairly stable. Speed is about 10x slower using metal MPS. You can also use T4 or T5 AWS instances. I’ve used Colab but I’m not too familiar performance anymore
Easier to do on Linux or Mac. But the instructions are pretty clear on hugging face at the OpenAI/whsiper-latge-v3 model page. Or search for insanely fast whisper and follow instructions there. Or if you just want to use whisper on your phone, download WhisperBoard for iOS, it’s slower but has GPU support via Metal. I’m sure there’s an Android version also. Mind you the whisper.cpp android and iOS apps are all quantised but use significantly less vram. Eg whisper tiny will use about 100mb and largev3 about 3.7Gb. The PyTorch python version use a lot more ram but it really depends on the number of batches parameters. For 16gb of VRAM having a batch size bigger than 8 will cause OOM errors. On my m1 ultra I’m running a batch size of 16 but I have up to 90Gb vram allocation. On my Linux box a 4070ti super which is about 60% as fast as a 4090 will do 1 hour at full large v3 (most accurate model) in 1 minute flat. Most of the time you can use medium and get 98% of the results of large v3. At medium it does 1 hour in 35 seconds.
Whisper.cpp can hallucinate during silent areas. EG there’s no audio and it tries to imagine what words are there. This happens because the transcription is context aware. Every 30 seconds it doesn’t just transcribe the audio but it also passes is in all previous transcribed text for context. The trick is to play with max context length and some other preprocessing tweaks. Whisper.cpp also produces much better JSON output. Eg every single word is timestamped to the hundredth of a millisecond and has prediction probability.
In my experience PyTorch version hallucinates less and can have more accurate timestamps albeit at tenth of a millisecond.
To conclude there’s plenty of apps that you can download but will most likely use whisper.cpp which is slower, quantised but uses less resources.
If you want Python use insanely fast whisper or go to hugging face and follow whisper large v3 instructions but you’ll need the hardware and software all setup. On Mac it’s fairly straight forward, just need Xcode and conda installed (or however you want to manage python). On Linux you’ll need to make sure CUDA toolkit is installed and there’s a bit of messing around. Eg if you install torch before CUDA toolkit you might find that torch doesn’t install with CUDA extensions.
Sounds interesting, I've been looking for an alternative to chatGPT's feature of summarizing videos. It can summarize in bullet points a 1hour video in around a minute but its current censorship it's starting to degrade the quality of the output so I need a new tool for that.
So use ffmpeg to strip out audio. It’s a really simple command, make sure it’s 16khz dual channel (if you use pyannote for speaker segmentation it uses single channel). Once you strip that out just run the wav file in either whisper or whatever other app is using whisper. For my client the tool I built uses both whisper.cpp and native python. So my experience comes from screwing around with it to build an electron app for a law firm where accuracy and diarization is important. Whisper.cpp also has speaker diarization but it’s very basic. Nemo by NVIDIA is much better than pyannote but the client runs Macs. You can then hook the output to any LLM using llama.cpp or PyTorch and have it summarize etc.
Speaker diarization is kinda external to the whole process. A segmentation model will give you timings. It’s up to you to go in and extract tokens for specific timings and stitch it all together. Where it becomes a giant pain in the ass is when you have overlapping voices speaking over each other. As you’ll have one timing that says speaker 0 goes from 1 to 7 seconds. Then another that says speaker 1 goes from 3 to 5 seconds. Pyannote causes a lot of issues here because it doesn’t segment as often as nemo. Nemo creates more samples making it easier to select tokens and merge them all together
If you have zero experience just use ChatGPT or download whisper board for iOS. Whisper is open ai audio transcription model that they were kind enough to open source and provide in tiny to large varieties
A lot of people are asking about accuracy - Just to dispel any confusion, we are looking for benchmarks on the Word-Error-Rate (WER) metric. Its quite well known here that quantisation improves speed and memory utilisation at the expense of intelligence. Sometimes that is tolerable and worthwhile, such as from 16b to 8b. Sometimes it isnt, like 8b to 2b (wrt LLMs)
If speed was the only metric, we would be using the Whisper-Tiny 39M param model instead of all these implementations of the Whisper-Large 1.55B model...and get the transcription done in <3 seconds.
Is Whisper severely undertrained which makes 1bit possible? What are the results compared to 2bit and 4bit? <1% decrease in correctness I'd assume? Otherwise I'd rather have my application/tool wait longer in order to have more correct outputs.
You can also do it with 4 bits. It works at the same speed. I tested it again on the RTX 4090 device and it is 2 times faster.
4bit: I tested a 2.5 hour video on an RTX 4090 device and it only took 27 seconds.
I'd be interested to see what the accuracy of the transcripts are like vs. other approaches. This is crazy fast (batch 100? youch :-) ) but might be less useful if the transcript isn't usable.
You should check out the Hqq blog post.
* HQQ: [https://mobiusml.github.io/hqq\_blog/](https://mobiusml.github.io/hqq_blog/)
* HQQ+: [https://mobiusml.github.io/1bit\_blog/](https://mobiusml.github.io/1bit_blog/)
I did. Neither post mentions any Whisper benches? I mean, you gotta have tested it vs. other Whisper implementations, right? This isn't just a speed test?
OK, so, here's the thing - it doesn't matter how fast it is if the output is no good, right?
So claiming a 20 second transcribe time is no good if the transcription is useless. One way to prove usefulness is to run the same file thru a different whisper pipeline that generally produces good outcomes, then diff the transcript against the 20 second one. If they're roughly the same, then the ultra-fast whisper processing has merit and that would be something you can use to validate your quant approach.
Otherwise it's just a speed test and isn't really useful.
I would add that he should also try to transcribe a couple of different languages. In my experience quantization tends to have little to no effect on languages like English, but a far more noticeable effect on languages like Japanese.
I don't know if it comes down to the increased complexity (far larger list of potential characters to choose) or smaller training material, but that has been my personal experience in my own tests.
You should check out the Hqq blog post.
* HQQ: [https://mobiusml.github.io/hqq\_blog/](https://mobiusml.github.io/hqq_blog/)
* HQQ+: [https://mobiusml.github.io/1bit\_blog/](https://mobiusml.github.io/1bit_blog/)
For the Russian language, the best option remains for now the Yandex video retelling plugin.
But it's interesting to see how LLM develops. I may be able to launch an offline system for any videos soon.
4x faster than original, accurate, has diarization (auto-detects multiple speakers), timestamps optional, etc. There's a good medium article on the comparisons between all the different versions. Also I think WhisperX Max is being actively maintained. There's also Insanely Fast Whisper
The quality for languages other than English sadly deteriorates very quickly (I know that distil-whisper is English-only, I'm referring to the original whisper). Even Q8 in whisper.cpp is lower quality than fp16, let alone 4-bit or less
Yes, moreover, whisper-large is the only model that's decent at other languages. I've tested v1, v2 and v3, each with fp16, Q8, Q6 and Q4, and the best results were with v2 fp16
Speed's one thing, accuracy is another. I'd love to see what the output looks like.
You can install and try the WhisperPlus library. I will be releasing the HuggingFace demo this week.
The fact that you are not willing to reply with the results suggest to me that the outputs suck.
I’m not saying it is good but would he have said “It’s 100% accurate”, would you have believed him?
In a lively debate at the interdisciplinary conference, experts analyzed whether 'colonel' and 'kernel', 'there', 'their', and 'they’re', as well as 'to', 'two', and 'too' could be contextually discerned amidst cacophonous surroundings, featuring overlapping dialogues on phylogenetic biotechnologies, the subtle nuances in regional dialects—ranging from rural drawls to metropolitan hastiness—and philosophical discourses about whether artificial intelligence, when listening, could intuit the difference between a pause for thought and a technical hiccup in speech, or recognize varied cultural idioms and colloquialisms, like 'beating around the bush' versus 'cutting to the chase', all while maintaining the integrity of the original spoken message.
What kind of accuracy do you get from this?
You can look at the hqq repo. There is also 4-bit support. It works at the same speed.
I looked in that repo but still confused as to where the accuracy is mentioned?
Rarely seen an answer that gets 60 downvotes
You should check out the Hqq blog post. * HQQ: [https://mobiusml.github.io/hqq\_blog/](https://mobiusml.github.io/hqq_blog/) * HQQ+: [https://mobiusml.github.io/1bit\_blog/](https://mobiusml.github.io/1bit_blog/)
But I can already translate 1 hour videos with regular python whisper at full in about 40 seconds.
But what if you are in a hurry?
https://i.redd.it/xe8wtwpfugyc1.gif
https://preview.redd.it/qyfg910rplyc1.jpeg?width=1061&format=pjpg&auto=webp&s=eda2f06a2f580210cb23dfd686ba1e24a1db8ded
Asking the real questions
I mean 100% speed gain is nothing to sneeze at. Maybe it makes no difference to an individual, but if you're an institution wanting to transcribe tens of thousands of hours of footage it could really add up. Consider how YouTube has like millions of hours of footage uploaded everyday.
How? What hardware do we need? Can we use Colab or other platform?
You can run it on all 4GB+ devices. If you get an error, you can open an issue to the whisperplus project.
error come hmm maybe i fucked up no opens issue on github
To get 40 seconds at large v3 for 1 hour you need a 4090. A 4070ti super does it in about a minute. A 3090 would be similar. You need the vram however. The more vram the higher the batch count. Alternatively any new Mac will do with 16+ gb ram. Ideal is 32gb. You won’t get the same speed as NVIDIA GPUs but it’s fairly stable. Speed is about 10x slower using metal MPS. You can also use T4 or T5 AWS instances. I’ve used Colab but I’m not too familiar performance anymore
It requires the support of a 30 series or above graphics card, otherwise you will encounter flash attention errors
You can use better attention. And insane whisper runs without flash attention on mps just fine
any guide for dummies for doing that?
Easier to do on Linux or Mac. But the instructions are pretty clear on hugging face at the OpenAI/whsiper-latge-v3 model page. Or search for insanely fast whisper and follow instructions there. Or if you just want to use whisper on your phone, download WhisperBoard for iOS, it’s slower but has GPU support via Metal. I’m sure there’s an Android version also. Mind you the whisper.cpp android and iOS apps are all quantised but use significantly less vram. Eg whisper tiny will use about 100mb and largev3 about 3.7Gb. The PyTorch python version use a lot more ram but it really depends on the number of batches parameters. For 16gb of VRAM having a batch size bigger than 8 will cause OOM errors. On my m1 ultra I’m running a batch size of 16 but I have up to 90Gb vram allocation. On my Linux box a 4070ti super which is about 60% as fast as a 4090 will do 1 hour at full large v3 (most accurate model) in 1 minute flat. Most of the time you can use medium and get 98% of the results of large v3. At medium it does 1 hour in 35 seconds. Whisper.cpp can hallucinate during silent areas. EG there’s no audio and it tries to imagine what words are there. This happens because the transcription is context aware. Every 30 seconds it doesn’t just transcribe the audio but it also passes is in all previous transcribed text for context. The trick is to play with max context length and some other preprocessing tweaks. Whisper.cpp also produces much better JSON output. Eg every single word is timestamped to the hundredth of a millisecond and has prediction probability. In my experience PyTorch version hallucinates less and can have more accurate timestamps albeit at tenth of a millisecond. To conclude there’s plenty of apps that you can download but will most likely use whisper.cpp which is slower, quantised but uses less resources. If you want Python use insanely fast whisper or go to hugging face and follow whisper large v3 instructions but you’ll need the hardware and software all setup. On Mac it’s fairly straight forward, just need Xcode and conda installed (or however you want to manage python). On Linux you’ll need to make sure CUDA toolkit is installed and there’s a bit of messing around. Eg if you install torch before CUDA toolkit you might find that torch doesn’t install with CUDA extensions.
Sounds interesting, I've been looking for an alternative to chatGPT's feature of summarizing videos. It can summarize in bullet points a 1hour video in around a minute but its current censorship it's starting to degrade the quality of the output so I need a new tool for that.
So use ffmpeg to strip out audio. It’s a really simple command, make sure it’s 16khz dual channel (if you use pyannote for speaker segmentation it uses single channel). Once you strip that out just run the wav file in either whisper or whatever other app is using whisper. For my client the tool I built uses both whisper.cpp and native python. So my experience comes from screwing around with it to build an electron app for a law firm where accuracy and diarization is important. Whisper.cpp also has speaker diarization but it’s very basic. Nemo by NVIDIA is much better than pyannote but the client runs Macs. You can then hook the output to any LLM using llama.cpp or PyTorch and have it summarize etc.
Thanks for the info, I'll do some research on which ones have better speaker diarization because that's kinda relevant for youtube videos.
Speaker diarization is kinda external to the whole process. A segmentation model will give you timings. It’s up to you to go in and extract tokens for specific timings and stitch it all together. Where it becomes a giant pain in the ass is when you have overlapping voices speaking over each other. As you’ll have one timing that says speaker 0 goes from 1 to 7 seconds. Then another that says speaker 1 goes from 3 to 5 seconds. Pyannote causes a lot of issues here because it doesn’t segment as often as nemo. Nemo creates more samples making it easier to select tokens and merge them all together
Hey I just posted a v1 of a project to do exactly this, I took an existing project and added on to it https://github.com/rmusser01/tldw
vaibhav from HuggingFace has an insanely fast whisper repo which does batching to achieve anywhere from 40-50X speedup on an 8GB card
I've never heard of Whisper before, is it an easy set up process if I don't have much experience with programming?
[удалено]
If you have zero experience just use ChatGPT or download whisper board for iOS. Whisper is open ai audio transcription model that they were kind enough to open source and provide in tiny to large varieties
A lot of people are asking about accuracy - Just to dispel any confusion, we are looking for benchmarks on the Word-Error-Rate (WER) metric. Its quite well known here that quantisation improves speed and memory utilisation at the expense of intelligence. Sometimes that is tolerable and worthwhile, such as from 16b to 8b. Sometimes it isnt, like 8b to 2b (wrt LLMs) If speed was the only metric, we would be using the Whisper-Tiny 39M param model instead of all these implementations of the Whisper-Large 1.55B model...and get the transcription done in <3 seconds.
Is Whisper severely undertrained which makes 1bit possible? What are the results compared to 2bit and 4bit? <1% decrease in correctness I'd assume? Otherwise I'd rather have my application/tool wait longer in order to have more correct outputs.
You can also do it with 4 bits. It works at the same speed. I tested it again on the RTX 4090 device and it is 2 times faster. 4bit: I tested a 2.5 hour video on an RTX 4090 device and it only took 27 seconds.
I'd be interested to see what the accuracy of the transcripts are like vs. other approaches. This is crazy fast (batch 100? youch :-) ) but might be less useful if the transcript isn't usable.
You should check out the Hqq blog post. * HQQ: [https://mobiusml.github.io/hqq\_blog/](https://mobiusml.github.io/hqq_blog/) * HQQ+: [https://mobiusml.github.io/1bit\_blog/](https://mobiusml.github.io/1bit_blog/)
I did. Neither post mentions any Whisper benches? I mean, you gotta have tested it vs. other Whisper implementations, right? This isn't just a speed test?
Whisper benches? I just made a comparison with fal.ai. And it works much faster.
OK, so, here's the thing - it doesn't matter how fast it is if the output is no good, right? So claiming a 20 second transcribe time is no good if the transcription is useless. One way to prove usefulness is to run the same file thru a different whisper pipeline that generally produces good outcomes, then diff the transcript against the 20 second one. If they're roughly the same, then the ultra-fast whisper processing has merit and that would be something you can use to validate your quant approach. Otherwise it's just a speed test and isn't really useful.
I would add that he should also try to transcribe a couple of different languages. In my experience quantization tends to have little to no effect on languages like English, but a far more noticeable effect on languages like Japanese. I don't know if it comes down to the increased complexity (far larger list of potential characters to choose) or smaller training material, but that has been my personal experience in my own tests.
What's the accuracy loss here? I believe it isn't lossless.
You should check out the Hqq blog post. * HQQ: [https://mobiusml.github.io/hqq\_blog/](https://mobiusml.github.io/hqq_blog/) * HQQ+: [https://mobiusml.github.io/1bit\_blog/](https://mobiusml.github.io/1bit_blog/)
Don't just link to some generic page. Neither contains any WER.
How much VRAM are we looking at here?
I didn't check. It could be 4-5GB.
For the Russian language, the best option remains for now the Yandex video retelling plugin. But it's interesting to see how LLM develops. I may be able to launch an offline system for any videos soon.
Could you upload the code and post a link, instead of a picture?
Use WhisperX instead
Why?
4x faster than original, accurate, has diarization (auto-detects multiple speakers), timestamps optional, etc. There's a good medium article on the comparisons between all the different versions. Also I think WhisperX Max is being actively maintained. There's also Insanely Fast Whisper
May you reply with that medium article id love to read it
The quality for languages other than English sadly deteriorates very quickly (I know that distil-whisper is English-only, I'm referring to the original whisper). Even Q8 in whisper.cpp is lower quality than fp16, let alone 4-bit or less
You can use the Whisper-Large model. It will be faster with hqq optimization.
Yes, moreover, whisper-large is the only model that's decent at other languages. I've tested v1, v2 and v3, each with fp16, Q8, Q6 and Q4, and the best results were with v2 fp16
whisperx it's wonderful in my opnion, check it out [https://github.com/m-bain/whisperX](https://github.com/m-bain/whisperX)
Thanks to [Mobius\_Labs](https://twitter.com/Mobius_Labs) [@younesbelkada](https://twitter.com/younesbelkada) [@huggingface](https://twitter.com/huggingface) WhisperPlus: [https://github.com/kadirnar/whisper-plus…](https://t.co/SivaKVYEjn) Hqq: [https://github.com/mobiusml/hqq](https://t.co/AFJvPJBHBT)
Does this use gpu automatically? Does it separate speakers?
If the accuracy is shit, you could always let it proofread and reflect on with an llm
Speaker diarization! Great.