feat!: major performance & accuracy improvements in speech-to-text module#1132
feat!: major performance & accuracy improvements in speech-to-text module#1132
Conversation
…ware-mansion/react-native-executorch into @is/speech-to-text-ultimate
| WHISPER_SMALL_EN, | ||
| TranscriptionResult, | ||
| SpeechToTextProps, | ||
| WHISPER_SMALL_EN_COREML, |
There was a problem hiding this comment.
why this is added after TranscriptionResult and SpeechToTextProps? ;p
| "react": "19.2.5", | ||
| "react-native": "0.83.4", | ||
| "react-native-audio-api": "0.12.0", | ||
| "react-native-audio-api": "0.11.5", |
There was a problem hiding this comment.
hey, why is that? We virtually never want to downgrade packages in demo apps.
There was a problem hiding this comment.
audio-api 0.12.0 causes build fails on iOS, and I think it's the same issue @benITo47 had when testing the 1.2.0 binaries some time ago.
There was a problem hiding this comment.
Could you please name when you have these fails? I don't have any on iOS simulator.
There was a problem hiding this comment.
Yeah, I even have 26.4.
| namespace rnexecutorch::models::speech_to_text { | ||
|
|
||
| /** | ||
| * Basically a different representation of token, |
There was a problem hiding this comment.
| * Basically a different representation of token, | |
| * Different representation of token, |
| for (size_t i = 1; i < sequenceIds.size(); ++i) { | ||
| std::span<uint64_t> single(sequenceIds.data() + i, 1); | ||
| logitsTensor = this->decode(single, encoderFeatures, startPos); | ||
| ++startPos; |
There was a problem hiding this comment.
| for (size_t i = 1; i < sequenceIds.size(); ++i) { | |
| std::span<uint64_t> single(sequenceIds.data() + i, 1); | |
| logitsTensor = this->decode(single, encoderFeatures, startPos); | |
| ++startPos; | |
| for (size_t i = 1; i < sequenceIds.size(); ++i, ++startPos) { | |
| std::span<uint64_t> single(sequenceIds.data() + i, 1); | |
| logitsTensor = this->decode(single, encoderFeatures, startPos); |
|
|
||
| return {.committed = move_to_vector(committed), | ||
| .nonCommitted = move_to_vector(nonCommitted)}; | ||
| // Return the results |
There was a problem hiding this comment.
| // Return the results |
| // Because of step 1, we know that if the last EOS exist in eos_, | ||
| // then it must be the last entry. | ||
| if (eos_.empty() || eos_.back().position != lastEosIndex) { | ||
| // Register last EOS entry |
There was a problem hiding this comment.
| // Register last EOS entry |
| std::vector<Segment> transcriptions = asr_->transcribe(input, options); | ||
|
|
||
| // Flatten segments into a single word sequence. | ||
| // This is basically our 'nonCommitted' part for now. |
There was a problem hiding this comment.
| // This is basically our 'nonCommitted' part for now. | |
| // This is our 'nonCommitted' part for now. |
| return std::vector<Word>(std::make_move_iterator(container.begin()), | ||
| std::make_move_iterator(container.end())); | ||
| OnlineASR::OnlineASR(const ASR *asr) : asr_(asr) { | ||
| // Reserve an expected amount of memory for audio buffer. |
There was a problem hiding this comment.
| // Reserve an expected amount of memory for audio buffer. |
|
|
||
| // Last-tick committed delta + whatever never made it past the commit | ||
| // threshold. | ||
| std::vector<Word> residual = std::move(result.committed); |
There was a problem hiding this comment.
| std::vector<Word> residual = std::move(result.committed); | |
| std::vector<Word> residual{std::move(result.committed)}; |
| @@ -1325,14 +1338,17 @@ | |||
| STYLE_TRANSFER_UDNIE, | |||
There was a problem hiding this comment.
Ok, so from 0.9 we will effectively drop support from our URL to original models (neither xnnpack nor coreml), right?
There was a problem hiding this comment.
I don't get it - the original models are XNNPACK ones, so they will still be available.
There was a problem hiding this comment.
WHISPER_TINY_EN_QUANTIZED is quantized xnnpack, WHISPER_TINY_EN is I guess full precision xnnpack, since there is no WHISPER_TINY_EN_QUANTIZED we dropped something, what exactly?
There was a problem hiding this comment.
Well, I just think the quantized models are pointless - they weigh only a little bit less than standard float32 models, there do not bring any significant inference speed up compared to baseline, and no one really downloads them on HF. I believe their existance just introduces unnecessary noise to the module.
There was a problem hiding this comment.
I see, I'm ok with removing some of those, now the only question is what should we remove, quantized or non-quantized. If they are just a bit smaller and just a bit faster, they are still better than original one, aren't they?
There was a problem hiding this comment.
Well, float32 baseline models are well tested and surely at least as accurate as quantized (and probably more accurate). If performance difference is minimal (or frankly not existing) then I don't like the idea of risking accuracy drops for some type of inputs.
There was a problem hiding this comment.
Sure thing, that explanation is absolutely fine for me, I mostly asked because I wanted to be on the same page :))
|
Also if this PR adds breaking change, please describe it directly below |

Description
This PR introduces several changes to the speech-to-text module based on Whisper models:
Introduces a breaking change?
Type of change
Tested on
Testing instructions
Run demo app to test the live streaming mode.
Screenshots
Related issues
#1124
Checklist
Additional notes
I am still trying to figure out a way to export Whisper efficiently to Vulkan backend after some initial failures, to cover Android devices as well.