Skip to content

feat!: major performance & accuracy improvements in speech-to-text module#1132

Open
IgorSwat wants to merge 6 commits intomainfrom
@is/speech-to-text-ultimate
Open

feat!: major performance & accuracy improvements in speech-to-text module#1132
IgorSwat wants to merge 6 commits intomainfrom
@is/speech-to-text-ultimate

Conversation

@IgorSwat
Copy link
Copy Markdown
Contributor

@IgorSwat IgorSwat commented May 8, 2026

Description

This PR introduces several changes to the speech-to-text module based on Whisper models:

  • CoreML integration - models re-exported to CoreML backend, bringing significant performance upgrade for iOS devices.
  • New streaming algorithm - eliminates duplicates in streaming output, resulting in a major quality improvement of the live streaming mode.
  • Changes in demo apps: removed faulty 'voice mode' screen in LLM demo app, refactored speech to text screen in 'speech' app by adding new CoreML models to selection bar and changing the default model for iOS devices.
  • Minor code improvements in speech-to-text module

Introduces a breaking change?

  • Yes
  • No

Type of change

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Documentation update (improves or adds clarity to existing documentation)
  • Other (chores, tests, code style improvements etc.)

Tested on

  • iOS
  • Android

Testing instructions

Run demo app to test the live streaming mode.

Screenshots

Related issues

#1124

Checklist

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings

Additional notes

I am still trying to figure out a way to export Whisper efficiently to Vulkan backend after some initial failures, to cover Android devices as well.

@IgorSwat IgorSwat requested review from benITo47, chmjkb and msluszniak May 8, 2026 08:26
@IgorSwat IgorSwat added model Issues related to exporting, improving, fixing ML models improvement PRs or issues focused on improvements in the current codebase labels May 8, 2026
WHISPER_SMALL_EN,
TranscriptionResult,
SpeechToTextProps,
WHISPER_SMALL_EN_COREML,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this is added after TranscriptionResult and SpeechToTextProps? ;p

Comment thread apps/speech/package.json
"react": "19.2.5",
"react-native": "0.83.4",
"react-native-audio-api": "0.12.0",
"react-native-audio-api": "0.11.5",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey, why is that? We virtually never want to downgrade packages in demo apps.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

audio-api 0.12.0 causes build fails on iOS, and I think it's the same issue @benITo47 had when testing the 1.2.0 binaries some time ago.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please name when you have these fails? I don't have any on iOS simulator.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Tested on physical device.

Copy link
Copy Markdown
Member

@msluszniak msluszniak May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdydek could you look at this one? Maybe you have an intuition behind this error? @IgorSwat do you have iOS 26.2 on your physical device?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I even have 26.4.

namespace rnexecutorch::models::speech_to_text {

/**
* Basically a different representation of token,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Basically a different representation of token,
* Different representation of token,

Comment on lines +273 to +276
for (size_t i = 1; i < sequenceIds.size(); ++i) {
std::span<uint64_t> single(sequenceIds.data() + i, 1);
logitsTensor = this->decode(single, encoderFeatures, startPos);
++startPos;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (size_t i = 1; i < sequenceIds.size(); ++i) {
std::span<uint64_t> single(sequenceIds.data() + i, 1);
logitsTensor = this->decode(single, encoderFeatures, startPos);
++startPos;
for (size_t i = 1; i < sequenceIds.size(); ++i, ++startPos) {
std::span<uint64_t> single(sequenceIds.data() + i, 1);
logitsTensor = this->decode(single, encoderFeatures, startPos);


return {.committed = move_to_vector(committed),
.nonCommitted = move_to_vector(nonCommitted)};
// Return the results
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Return the results

// Because of step 1, we know that if the last EOS exist in eos_,
// then it must be the last entry.
if (eos_.empty() || eos_.back().position != lastEosIndex) {
// Register last EOS entry
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Register last EOS entry

std::vector<Segment> transcriptions = asr_->transcribe(input, options);

// Flatten segments into a single word sequence.
// This is basically our 'nonCommitted' part for now.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This is basically our 'nonCommitted' part for now.
// This is our 'nonCommitted' part for now.

return std::vector<Word>(std::make_move_iterator(container.begin()),
std::make_move_iterator(container.end()));
OnlineASR::OnlineASR(const ASR *asr) : asr_(asr) {
// Reserve an expected amount of memory for audio buffer.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Reserve an expected amount of memory for audio buffer.


// Last-tick committed delta + whatever never made it past the commit
// threshold.
std::vector<Word> residual = std::move(result.committed);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<Word> residual = std::move(result.committed);
std::vector<Word> residual{std::move(result.committed)};

Comment on lines 1317 to 1325
@@ -1325,14 +1338,17 @@
STYLE_TRANSFER_UDNIE,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so from 0.9 we will effectively drop support from our URL to original models (neither xnnpack nor coreml), right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it - the original models are XNNPACK ones, so they will still be available.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WHISPER_TINY_EN_QUANTIZED is quantized xnnpack, WHISPER_TINY_EN is I guess full precision xnnpack, since there is no WHISPER_TINY_EN_QUANTIZED we dropped something, what exactly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I just think the quantized models are pointless - they weigh only a little bit less than standard float32 models, there do not bring any significant inference speed up compared to baseline, and no one really downloads them on HF. I believe their existance just introduces unnecessary noise to the module.

Copy link
Copy Markdown
Member

@msluszniak msluszniak May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I'm ok with removing some of those, now the only question is what should we remove, quantized or non-quantized. If they are just a bit smaller and just a bit faster, they are still better than original one, aren't they?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, float32 baseline models are well tested and surely at least as accurate as quantized (and probably more accurate). If performance difference is minimal (or frankly not existing) then I don't like the idea of risking accuracy drops for some type of inputs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, that explanation is absolutely fine for me, I mostly asked because I wanted to be on the same page :))

@IgorSwat IgorSwat changed the title feat: major performance & accuracy improvements in speech-to-text module feat!: major performance & accuracy improvements in speech-to-text module May 8, 2026
@msluszniak
Copy link
Copy Markdown
Member

Also if this PR adds breaking change, please describe it directly below Introduces a breaking change? section in PR body.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement PRs or issues focused on improvements in the current codebase model Issues related to exporting, improving, fixing ML models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants