Skip to content

webgpu: use naive reduction#28174

Open
xhcao wants to merge 2 commits intomicrosoft:mainfrom
xhcao:use-naive-reduction
Open

webgpu: use naive reduction#28174
xhcao wants to merge 2 commits intomicrosoft:mainfrom
xhcao:use-naive-reduction

Conversation

@xhcao
Copy link
Copy Markdown
Contributor

@xhcao xhcao commented Apr 22, 2026

Use naive reduction when the output size of ReduceMean is far greater than reduce size. Shared reduction method may need to transpose input, which costs much time.

Description

Motivation and Context

Use naive reduction when the output size of ReduceMean is far greater
than reduce size. Shared reduction method may need to transpose input,
which costs much time.
@xhcao
Copy link
Copy Markdown
Contributor Author

xhcao commented Apr 22, 2026

When I run florence-2-base-vision-encoder-fp16-0.5-fp32 with input [batch_size:1, height:768, width:768], 10 ReduceMean nodes using shared reduction method cost much time,
ReduceMean|Transpose 6.51 "inputs[0] = {1,128,36864} outputs[0] = {1,36864,128} "
ReduceMean 0.30 "inputs[0] = {1,36864,128} outputs[0] = {1,1,36864} "

It is heavy to transpose the input. If we use the naive method,
ReduceMean 0.11 "inputs[0] = {1,128,36864} outputs[0] = {1,1,36864} "

The model's total time is from ~340ms to ~270ms.

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Apr 22, 2026
@guschmue guschmue requested a review from Copilot April 23, 2026 16:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the WebGPU reduction kernel dispatch heuristic to prefer the naive reduction implementation for certain ReduceMean shapes, aiming to avoid transpose overhead that can dominate runtime when the reduced dimension is small but the output is very large.

Changes:

  • Adjusts the use_naive_reduction selection logic in ReduceKernel::ComputeInternal.
  • Adds a ReduceMean-specific threshold to route some cases to ReduceNaiveProgram.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +375 to +376
bool use_naive_reduction = name_ == "ArgMin" || name_ == "ArgMax" || (reduce_size < 32 && output_size > 1024) ||
(name_ == "ReduceMean" && reduce_size <= 128 && output_size > 20000) ||
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new ReduceMean heuristic forces the naive path purely based on (reduce_size/output_size) thresholds, but it will also trigger when the reduce axes are already innermost (so the shared path would not need the expensive transpose mentioned in the PR description). To avoid potential performance regressions in those cases, consider computing are_axes_innermost before this decision and gating this ReduceMean special-case on !are_axes_innermost (or otherwise documenting why naive is preferable even without transpose). Also consider hoisting 128/20000 into named constants with a brief rationale so the tuning knobs are easier to maintain.

Suggested change
bool use_naive_reduction = name_ == "ArgMin" || name_ == "ArgMax" || (reduce_size < 32 && output_size > 1024) ||
(name_ == "ReduceMean" && reduce_size <= 128 && output_size > 20000) ||
// Heuristic tuning knobs for preferring the naive ReduceMean path when the shared path
// would otherwise pay extra overhead (for example, transposing non-innermost reduce axes).
constexpr size_t kReduceMeanNaiveMaxReduceSize = 128;
constexpr size_t kReduceMeanNaiveMinOutputSize = 20000;
bool reduce_axes_are_innermost = true;
bool seen_reduce_axis = false;
for (size_t i = input_tensor->Shape().NumDimensions(); i > 0; --i) {
if (reduce_axes[i - 1] == 1) {
seen_reduce_axis = true;
} else if (seen_reduce_axis) {
reduce_axes_are_innermost = false;
break;
}
}
bool use_naive_reduction = name_ == "ArgMin" || name_ == "ArgMax" || (reduce_size < 32 && output_size > 1024) ||
(name_ == "ReduceMean" && !reduce_axes_are_innermost &&
reduce_size <= kReduceMeanNaiveMaxReduceSize &&
output_size > kReduceMeanNaiveMinOutputSize) ||

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my data, ignoring transpose operation, the shared ReduceMean kernel is still worse than the naive kernel.
But I cannot ensure other platforms have the same results when the reducing axes are innermost. so I addressed the Copilot comments, did not change the code path when the reducing axes are innermost.

@xhcao
Copy link
Copy Markdown
Contributor Author

xhcao commented Apr 24, 2026

@jchen10 PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants