webgpu: use naive reduction by xhcao · Pull Request #28174 · microsoft/onnxruntime

xhcao · 2026-04-22T07:58:58Z

Use naive reduction when the output size of ReduceMean is far greater than reduce size. Shared reduction method may need to transpose input, which costs much time.

Description

Motivation and Context

Use naive reduction when the output size of ReduceMean is far greater than reduce size. Shared reduction method may need to transpose input, which costs much time.

xhcao · 2026-04-22T08:11:14Z

When I run florence-2-base-vision-encoder-fp16-0.5-fp32 with input [batch_size:1, height:768, width:768], 10 ReduceMean nodes using shared reduction method cost much time,
ReduceMean|Transpose 6.51 "inputs[0] = {1,128,36864} outputs[0] = {1,36864,128} "
ReduceMean 0.30 "inputs[0] = {1,36864,128} outputs[0] = {1,1,36864} "

It is heavy to transpose the input. If we use the naive method,
ReduceMean 0.11 "inputs[0] = {1,128,36864} outputs[0] = {1,1,36864} "

The model's total time is from ~340ms to ~270ms.

Copilot

Pull request overview

Updates the WebGPU reduction kernel dispatch heuristic to prefer the naive reduction implementation for certain ReduceMean shapes, aiming to avoid transpose overhead that can dominate runtime when the reduced dimension is small but the output is very large.

Changes:

Adjusts the use_naive_reduction selection logic in ReduceKernel::ComputeInternal.
Adds a ReduceMean-specific threshold to route some cases to ReduceNaiveProgram.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-23T16:20:36Z

+  bool use_naive_reduction = name_ == "ArgMin" || name_ == "ArgMax" || (reduce_size < 32 && output_size > 1024) ||
+                             (name_ == "ReduceMean" && reduce_size <= 128 && output_size > 20000) ||


The new ReduceMean heuristic forces the naive path purely based on (reduce_size/output_size) thresholds, but it will also trigger when the reduce axes are already innermost (so the shared path would not need the expensive transpose mentioned in the PR description). To avoid potential performance regressions in those cases, consider computing are_axes_innermost before this decision and gating this ReduceMean special-case on !are_axes_innermost (or otherwise documenting why naive is preferable even without transpose). Also consider hoisting 128/20000 into named constants with a brief rationale so the tuning knobs are easier to maintain.

Suggested change

bool use_naive_reduction = name_ == "ArgMin" || name_ == "ArgMax" || (reduce_size < 32 && output_size > 1024) ||

(name_ == "ReduceMean" && reduce_size <= 128 && output_size > 20000) ||

// Heuristic tuning knobs for preferring the naive ReduceMean path when the shared path

// would otherwise pay extra overhead (for example, transposing non-innermost reduce axes).

constexpr size_t kReduceMeanNaiveMaxReduceSize = 128;

constexpr size_t kReduceMeanNaiveMinOutputSize = 20000;

bool reduce_axes_are_innermost = true;

bool seen_reduce_axis = false;

for (size_t i = input_tensor->Shape().NumDimensions(); i > 0; --i) {

if (reduce_axes[i - 1] == 1) {

seen_reduce_axis = true;

} else if (seen_reduce_axis) {

reduce_axes_are_innermost = false;

break;

}

}

bool use_naive_reduction = name_ == "ArgMin" || name_ == "ArgMax" || (reduce_size < 32 && output_size > 1024) ||

(name_ == "ReduceMean" && !reduce_axes_are_innermost &&

reduce_size <= kReduceMeanNaiveMaxReduceSize &&

output_size > kReduceMeanNaiveMinOutputSize) ||

From my data, ignoring transpose operation, the shared ReduceMean kernel is still worse than the naive kernel.
But I cannot ensure other platforms have the same results when the reducing axes are innermost. so I addressed the Copilot comments, did not change the code path when the reducing axes are innermost.

xhcao · 2026-04-24T08:28:47Z

@jchen10 PTAL

webgpu: use naive reduction

275b24e

Use naive reduction when the output size of ReduceMean is far greater than reduce size. Shared reduction method may need to transpose input, which costs much time.

guschmue added the ep:WebGPU ort-web webgpu provider label Apr 22, 2026

guschmue requested a review from Copilot April 23, 2026 16:13

Copilot started reviewing on behalf of guschmue April 23, 2026 16:17 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Keep to use shared ReduceMean when axes are innermost

aadeda2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

webgpu: use naive reduction#28174

webgpu: use naive reduction#28174
xhcao wants to merge 2 commits intomicrosoft:mainfrom
xhcao:use-naive-reduction

xhcao commented Apr 22, 2026

Uh oh!

xhcao commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

xhcao Apr 24, 2026

Uh oh!

xhcao commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		bool use_naive_reduction = name_ == "ArgMin" \|\| name_ == "ArgMax" \|\| (reduce_size < 32 && output_size > 1024) \|\|
		(name_ == "ReduceMean" && reduce_size <= 128 && output_size > 20000) \|\|

-  bool use_naive_reduction = name_ == "ArgMin" || name_ == "ArgMax" || (reduce_size < 32 && output_size > 1024) ||
-                             (name_ == "ReduceMean" && reduce_size <= 128 && output_size > 20000) ||
+  // Heuristic tuning knobs for preferring the naive ReduceMean path when the shared path
+  // would otherwise pay extra overhead (for example, transposing non-innermost reduce axes).
+  constexpr size_t kReduceMeanNaiveMaxReduceSize = 128;
+  constexpr size_t kReduceMeanNaiveMinOutputSize = 20000;
+  bool reduce_axes_are_innermost = true;
+  bool seen_reduce_axis = false;
+  for (size_t i = input_tensor->Shape().NumDimensions(); i > 0; --i) {
+    if (reduce_axes[i - 1] == 1) {
+      seen_reduce_axis = true;
+    } else if (seen_reduce_axis) {
+      reduce_axes_are_innermost = false;
+      break;
+    }
+  }
+  bool use_naive_reduction = name_ == "ArgMin" || name_ == "ArgMax" || (reduce_size < 32 && output_size > 1024) ||
+                             (name_ == "ReduceMean" && !reduce_axes_are_innermost &&
+                              reduce_size <= kReduceMeanNaiveMaxReduceSize &&
+                              output_size > kReduceMeanNaiveMinOutputSize) ||

Conversation

xhcao commented Apr 22, 2026

Description

Motivation and Context

Uh oh!

xhcao commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

xhcao Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

xhcao commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants