Conversation
Use naive reduction when the output size of ReduceMean is far greater than reduce size. Shared reduction method may need to transpose input, which costs much time.
|
When I run florence-2-base-vision-encoder-fp16-0.5-fp32 with input [batch_size:1, height:768, width:768], 10 ReduceMean nodes using shared reduction method cost much time, It is heavy to transpose the input. If we use the naive method, The model's total time is from ~340ms to ~270ms. |
There was a problem hiding this comment.
Pull request overview
Updates the WebGPU reduction kernel dispatch heuristic to prefer the naive reduction implementation for certain ReduceMean shapes, aiming to avoid transpose overhead that can dominate runtime when the reduced dimension is small but the output is very large.
Changes:
- Adjusts the
use_naive_reductionselection logic inReduceKernel::ComputeInternal. - Adds a
ReduceMean-specific threshold to route some cases toReduceNaiveProgram.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| bool use_naive_reduction = name_ == "ArgMin" || name_ == "ArgMax" || (reduce_size < 32 && output_size > 1024) || | ||
| (name_ == "ReduceMean" && reduce_size <= 128 && output_size > 20000) || |
There was a problem hiding this comment.
The new ReduceMean heuristic forces the naive path purely based on (reduce_size/output_size) thresholds, but it will also trigger when the reduce axes are already innermost (so the shared path would not need the expensive transpose mentioned in the PR description). To avoid potential performance regressions in those cases, consider computing are_axes_innermost before this decision and gating this ReduceMean special-case on !are_axes_innermost (or otherwise documenting why naive is preferable even without transpose). Also consider hoisting 128/20000 into named constants with a brief rationale so the tuning knobs are easier to maintain.
| bool use_naive_reduction = name_ == "ArgMin" || name_ == "ArgMax" || (reduce_size < 32 && output_size > 1024) || | |
| (name_ == "ReduceMean" && reduce_size <= 128 && output_size > 20000) || | |
| // Heuristic tuning knobs for preferring the naive ReduceMean path when the shared path | |
| // would otherwise pay extra overhead (for example, transposing non-innermost reduce axes). | |
| constexpr size_t kReduceMeanNaiveMaxReduceSize = 128; | |
| constexpr size_t kReduceMeanNaiveMinOutputSize = 20000; | |
| bool reduce_axes_are_innermost = true; | |
| bool seen_reduce_axis = false; | |
| for (size_t i = input_tensor->Shape().NumDimensions(); i > 0; --i) { | |
| if (reduce_axes[i - 1] == 1) { | |
| seen_reduce_axis = true; | |
| } else if (seen_reduce_axis) { | |
| reduce_axes_are_innermost = false; | |
| break; | |
| } | |
| } | |
| bool use_naive_reduction = name_ == "ArgMin" || name_ == "ArgMax" || (reduce_size < 32 && output_size > 1024) || | |
| (name_ == "ReduceMean" && !reduce_axes_are_innermost && | |
| reduce_size <= kReduceMeanNaiveMaxReduceSize && | |
| output_size > kReduceMeanNaiveMinOutputSize) || |
There was a problem hiding this comment.
From my data, ignoring transpose operation, the shared ReduceMean kernel is still worse than the naive kernel.
But I cannot ensure other platforms have the same results when the reducing axes are innermost. so I addressed the Copilot comments, did not change the code path when the reducing axes are innermost.
|
@jchen10 PTAL |
Use naive reduction when the output size of ReduceMean is far greater than reduce size. Shared reduction method may need to transpose input, which costs much time.
Description
Motivation and Context