Skip to content

Commit 3f74b3c

Browse files
sushraja-msftgithub-actions[bot]yuslepukhin
authored
Update worker thread pool to use time based wait. (#27916)
# Make thread pool spin duration configurable via session option ## Problem The ORT Eigen thread pool's `SpinPause` loop uses a fixed iteration count (`1 << 20` = ~1M iterations) before blocking. The actual wall-clock spin duration varies dramatically by CPU architecture: | Pause Instruction | Architecture | Spin Duration (1M iterations) | |---|---|---| | `_mm_pause` | Pre-Skylake | ~3ms | | `_mm_pause` | Skylake+ @ 3 GHz | ~47ms | | `_tpause` | 3 GHz base | ~333ms | | `_tpause` | 2 GHz base | ~500ms | For client/on-device workloads (e.g., Whisper in Edge), this causes high CPU utilization visible in profilers and Task Manager, even though the CPU is in a low-power spin state. So 1M iterations at 3 GHz: - **Pre-Skylake:** 1M × 10 / 3G ≈ **3.3ms** - **Skylake @ 3 GHz:** 1M × 140 / 3G ≈ **47ms** - **Skylake @ 5 GHz (turbo):** 1M × 140 / 5G ≈ **28ms** - **AMD Zen @ 4 GHz:** 1M × 65 / 4G ≈ **16ms** The total duration scaled inversely with clock speed and varied dramatically across microarchitectures. The Skylake 14x increase was specifically because Intel found that the short pause was causing too much power waste and memory bus contention in spin loops. ### `_tpause` `_tpause(0x0, __rdtsc() + 1000)` waits for a fixed number of TSC ticks. TSC frequency is typically fixed at the processor's base frequency (not turbo), so: - **3 GHz base:** 1000 ticks ≈ 333ns per iteration → 1M iterations ≈ **333ms** - **2 GHz base:** 1000 ticks ≈ 500ns per iteration → 1M iterations ≈ **500ms** The per-iteration time is more predictable than `_mm_pause` (TSC is constant-rate on modern CPUs), but still scales with TSC frequency. The total spin is much longer because each iteration is ~333ns vs ~28–47ns for `_mm_pause` on Skylake+. ### Profiler visibility Both `_tpause` and `_mm_pause` are treated as **CPU busy** in Task Manager and ETW sampling profilers, even though these are low-power CPU states. This ends up looking like Edge consuming all the CPU during speech recognition. ## Solution This PR makes the thread pool spin behavior configurable while **preserving the default (original) behavior** for backward compatibility: - **Default (`-1`)**: Uses the original iteration-count-based spin loop (1M iterations). Unchanged throughput characteristics. - **`0`**: Disables spinning entirely (threads block immediately). - **`> 0`**: Enables time-based spinning for the specified duration in microseconds using `std::chrono::steady_clock`. Recommended for power-sensitive workloads. ### Session option usage ```cpp // Use time-based spinning with 1ms duration (recommended for on-device/client workloads) session_options.AddConfigEntry("session.intra_op.spin_duration_us", "1000"); // Disable spinning entirely session_options.AddConfigEntry("session.intra_op.spin_duration_us", "0"); ``` Both intra-op and inter-op thread pools are independently configurable via `session.intra_op.spin_duration_us` and `session.inter_op.spin_duration_us`. ## Changes ### Core thread pool (EigenNonBlockingThreadPool.h) - `WorkerLoop` now has two spin paths selected by `spin_duration_us_`: - Negative (default): original iteration-count loop, identical to `main` - Positive: time-based spin using `steady_clock` with power-of-2 bitmask optimizations for steal interval and clock-read frequency - Constructor parameter changed from `bool allow_spinning` → `int spin_duration_us` - `ComputeTimeCheckMask()`: dynamically computes clock-read frequency based on spin duration (clamped to [128, 4096] iterations) to keep overhead under 1% ### Configuration plumbing - New session config keys: `session.intra_op.spin_duration_us`, `session.inter_op.spin_duration_us` - `OrtThreadPoolParams.spin_duration_us` field with sentinel default `-1` - `ParseSpinDurationUs()` helper using `TryParseStringWithClassicLocale` for safe parsing - `allow_spinning` and `spin_duration_us` merged at `CreateThreadPoolHelper`: when `allow_spinning=false`, spin duration is forced to `0` ### Test updates - All 8 internal call sites passing `bool true` updated to `concurrency::kSpinDurationDefault` to avoid silent implicit bool-to-int conversion - `onnxruntime_perf_test` supports `--spin_duration_us` CLI flag - Thread pool benchmarks use `kSpinDurationDefault` ## Key design decisions 1. **Default preserves original behavior**: No performance regression for existing users. Benchmarks confirmed the iteration-count path matches `main`. 2. **`steady_clock` over `high_resolution_clock`**: Monotonic guarantee prevents spin-deadline issues from clock jumps. 3. **`unsigned int` loop counter**: Prevents signed overflow in the unbounded time-based spin loop. 4. **Power-of-2 bitmask optimization**: Steal every 128 iterations (`& 0x7F`), clock checks at a separate frequency computed from spin duration — avoids modulo operations in the hot loop. # Results <img width="3838" height="1478" alt="image" src="https://github.com/user-attachments/assets/265a0af0-4ed7-46ae-8263-96553bb592b2" /> LHS shows the problem where 85% of CPU time is spent in SpinWait. RHS shows the same trace with the fix, 50% lower CPU utilization the length of the usage spikes drop from 527ms to 130ms. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com>
1 parent 4dd5d36 commit 3f74b3c

File tree

15 files changed

+256
-34
lines changed

15 files changed

+256
-34
lines changed

include/onnxruntime/core/common/spin_pause.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,5 +9,11 @@ namespace concurrency {
99
// Intrinsic to use in spin-loops
1010
void SpinPause();
1111

12+
// Measure the average duration of a single SpinPause() call in nanoseconds.
13+
// Runs exactly once per process (thread-safe via function-local static init).
14+
// Used to convert a user-specified spin duration in microseconds into an
15+
// iteration count, avoiding clock reads in the hot spin loop.
16+
int CalibrateSpinPauseNs();
17+
1218
} // namespace concurrency
1319
} // namespace onnxruntime

include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h

Lines changed: 36 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@
3737
#pragma warning(disable : 4127)
3838
#pragma warning(disable : 4805)
3939
#endif
40+
#include <chrono>
4041
#include <memory>
4142
#include "unsupported/Eigen/CXX11/ThreadPool"
4243

@@ -864,12 +865,13 @@ class ThreadPoolTempl : public onnxruntime::concurrency::ExtendedThreadPoolInter
864865

865866
typedef RunQueue<CallbackPolicy, Tag, 1024> Queue;
866867

867-
ThreadPoolTempl(const CHAR_TYPE* name, int num_threads, bool allow_spinning, Environment& env,
868-
const ThreadOptions& thread_options)
868+
ThreadPoolTempl(const CHAR_TYPE* name, int num_threads, int spin_duration_us,
869+
Environment& env, const ThreadOptions& thread_options)
869870
: profiler_(num_threads, name),
870871
env_(env),
871872
num_threads_(num_threads),
872-
allow_spinning_(allow_spinning),
873+
spin_count_(ComputeSpinCount(spin_duration_us)),
874+
steal_interval_(std::max(spin_count_ / 100, 1)),
873875
set_denormal_as_zero_(thread_options.set_denormal_as_zero),
874876
callback_policy_(thread_options),
875877
worker_data_(num_threads),
@@ -1598,9 +1600,30 @@ class ThreadPoolTempl : public onnxruntime::concurrency::ExtendedThreadPoolInter
15981600
std::condition_variable cv;
15991601
};
16001602

1603+
// Measure the average duration of a single SpinPause() call in nanoseconds.
1604+
// Runs exactly once per process (thread-safe via function-local static init).
1605+
// The result is used to convert a user-specified spin duration in microseconds
1606+
// into an iteration count, avoiding clock reads in the hot spin loop.
1607+
static int CalibrateSpinPauseNs() {
1608+
return onnxruntime::concurrency::CalibrateSpinPauseNs();
1609+
}
1610+
1611+
// Convert spin_duration_us into an iteration count for the spin loop.
1612+
// -1 (default): use the original fixed iteration count (1 << 20).
1613+
// 0: no spinning.
1614+
// >0: calibrate SpinPause() latency and compute the corresponding count.
1615+
static int ComputeSpinCount(int spin_duration_us) {
1616+
if (spin_duration_us == 0) return 0;
1617+
if (spin_duration_us < 0) return 1 << 20; // ~1M iterations (original default)
1618+
int ns_per_iter = CalibrateSpinPauseNs();
1619+
auto count = static_cast<int64_t>(spin_duration_us) * 1000 / ns_per_iter;
1620+
return static_cast<int>(std::min<int64_t>(count, 1 << 30));
1621+
}
1622+
16011623
Environment& env_;
16021624
const unsigned num_threads_;
1603-
const bool allow_spinning_;
1625+
const int spin_count_; // Number of SpinPause iterations before blocking (0 = no spin)
1626+
const int steal_interval_; // Attempt work steal every steal_interval_ iterations
16041627
const bool set_denormal_as_zero_;
16051628
CallbackPolicy callback_policy_;
16061629
Eigen::MaxSizeVector<WorkerData> worker_data_;
@@ -1642,25 +1665,26 @@ class ThreadPoolTempl : public onnxruntime::concurrency::ExtendedThreadPoolInter
16421665

16431666
assert(td.GetStatus() == WorkerData::ThreadStatus::Spinning);
16441667

1645-
constexpr int log2_spin = 20;
1646-
const int spin_count = allow_spinning_ ? (1ull << log2_spin) : 0;
1647-
const int steal_count = spin_count / 100;
1648-
16491668
SetDenormalAsZero(set_denormal_as_zero_);
16501669
profiler_.LogThreadId(thread_id);
16511670

16521671
while (!should_exit) {
16531672
Work w = q.PopFront();
16541673
if (!w) {
1655-
// Spin waiting for work.
1656-
for (int i = 0; i < spin_count && !done_; i++) {
1657-
if (((i + 1) % steal_count == 0)) {
1674+
// Spin waiting for work. spin_count_ is determined at construction:
1675+
// default (-1): 1<<20 iterations (original behavior)
1676+
// 0: no spinning (skip loop entirely)
1677+
// >0 us: iteration count derived from one-time SpinPause() calibration
1678+
// steal_interval_ = max(spin_count_/100, 1), yielding ~100 steal attempts per spin window.
1679+
int steal_countdown = steal_interval_;
1680+
for (int i = 0; i < spin_count_ && !done_; i++) {
1681+
if (--steal_countdown == 0) {
16581682
w = Steal(StealAttemptKind::TRY_ONE);
1683+
steal_countdown = steal_interval_;
16591684
} else {
16601685
w = q.PopFront();
16611686
}
16621687
if (w) break;
1663-
16641688
if (spin_loop_status_.load(std::memory_order_relaxed) == SpinLoopStatus::kIdle) {
16651689
break;
16661690
}

include/onnxruntime/core/platform/threadpool.h

Lines changed: 30 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,11 @@ struct TensorOpCost {
129129

130130
namespace concurrency {
131131

132+
// Sentinel value for spin_duration_us indicating the default iteration-count-based
133+
// spinning behavior. This preserves the original spin loop performance characteristics
134+
// where the spin duration varies by architecture depending on pause instruction latency.
135+
static constexpr int kSpinDurationDefault = -1;
136+
132137
template <typename Environment, typename CallbackPolicy>
133138
class ThreadPoolTempl;
134139

@@ -145,20 +150,39 @@ class ThreadPool {
145150
#endif
146151
// Constructs a pool for running with with "degree_of_parallelism" threads with
147152
// specified "name". env->StartThread() is used to create individual threads
148-
// with the given ThreadOptions. If "low_latency_hint" is true the thread pool
149-
// implementation may use it as a hint that lower latency is preferred at the
150-
// cost of higher CPU usage, e.g. by letting one or more idle threads spin
151-
// wait. Conversely, if the threadpool is used to schedule high-latency
152-
// operations like I/O the hint should be set to false.
153+
// with the given ThreadOptions. "spin_duration_us" controls idle thread spin behavior:
154+
// -1 (kSpinDurationDefault) = use default iteration-count-based spinning (best throughput,
155+
// but spin duration varies by CPU architecture and pause instruction latency)
156+
// 0 = disable spinning entirely (threads block immediately when idle)
157+
// >0 = calibrated iteration-based spinning for the specified duration in microseconds
158+
// (best-effort duration via one-time SpinPause() calibration; actual spin time
159+
// may vary with CPU frequency changes)
160+
//
161+
// Note: The OrtThreadPoolParams.allow_spinning flag (controlled by the
162+
// session.intra_op.allow_spinning / session.inter_op.allow_spinning config keys or
163+
// the C API) takes priority. When allow_spinning is false, spin_duration_us is forced
164+
// to 0 by CreateThreadPoolHelper regardless of the value passed here.
153165
//
154166
// REQUIRES: degree_of_parallelism > 0
155167
ThreadPool(Env* env,
156168
const ThreadOptions& thread_options,
157169
const NAME_CHAR_TYPE* name,
158170
int degree_of_parallelism,
159-
bool low_latency_hint,
171+
int spin_duration_us = kSpinDurationDefault,
160172
bool force_hybrid = false);
161173

174+
// Backward-compatible overload: maps the legacy bool parameter to the new
175+
// spin_duration_us semantics so that external callers passing true/false
176+
// don't silently get implicit bool-to-int conversion (true -> 1us).
177+
ThreadPool(Env* env,
178+
const ThreadOptions& thread_options,
179+
const NAME_CHAR_TYPE* name,
180+
int degree_of_parallelism,
181+
bool allow_spinning,
182+
bool force_hybrid = false)
183+
: ThreadPool(env, thread_options, name, degree_of_parallelism,
184+
allow_spinning ? kSpinDurationDefault : 0, force_hybrid) {}
185+
162186
// Waits until all scheduled work has finished and then destroy the
163187
// set of threads.
164188
~ThreadPool();

include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,6 +154,22 @@ static const char* const kOrtSessionOptionsUseDeviceAllocatorForInitializers = "
154154
static const char* const kOrtSessionOptionsConfigAllowInterOpSpinning = "session.inter_op.allow_spinning";
155155
static const char* const kOrtSessionOptionsConfigAllowIntraOpSpinning = "session.intra_op.allow_spinning";
156156

157+
// Configure the duration in microseconds that threads spin waiting for work before blocking.
158+
// This setting is subordinate to the allow_spinning flags (session.intra_op.allow_spinning /
159+
// session.inter_op.allow_spinning). When allow_spinning is "0", spinning is disabled and
160+
// the spin duration is forced to 0 regardless of this setting.
161+
// By default (when this option is not set), the thread pool uses an iteration-count-based spin loop
162+
// whose wall-clock duration varies by CPU architecture and pause instruction latency. This provides
163+
// the best throughput but may result in high CPU utilization.
164+
// Setting a positive value switches to calibrated iteration-based spinning that targets
165+
// the specified duration. The actual spin time is a best-effort approximation based on a
166+
// one-time measurement of the pause instruction latency; it may vary with CPU frequency
167+
// changes. Recommended for power-sensitive or client/on-device workloads.
168+
// Common values: 500-2000 (0.5-2ms).
169+
// Setting to "0" with spinning enabled effectively disables spinning (equivalent to allow_spinning = false).
170+
static const char* const kOrtSessionOptionsConfigIntraOpSpinDurationUs = "session.intra_op.spin_duration_us";
171+
static const char* const kOrtSessionOptionsConfigInterOpSpinDurationUs = "session.inter_op.spin_duration_us";
172+
157173
// Key for using model bytes directly for ORT format
158174
// If a session is created using an input byte array contains the ORT format model data,
159175
// By default we will copy the model bytes at the time of session creation to ensure the model bytes

onnxruntime/core/common/spin_pause.cc

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,11 @@
33

44
#include "core/common/spin_pause.h"
55

6+
#include <algorithm>
7+
#include <atomic>
8+
#include <chrono>
9+
#include <cstdint>
10+
611
#if defined(_M_AMD64)
712
#include <intrin.h>
813
#endif
@@ -39,8 +44,60 @@ void SpinPause() {
3944
} else {
4045
_mm_pause();
4146
}
47+
#elif defined(__aarch64__) || defined(_M_ARM64) || defined(_M_ARM64EC)
48+
// ARM64 hint that yields the pipeline without descheduling the thread.
49+
// Emitted as a non-inline asm statement so the optimizer cannot elide it
50+
// from the calibration loop in CalibrateSpinPauseNs().
51+
__asm__ __volatile__("yield" ::: "memory");
52+
#elif defined(__arm__)
53+
__asm__ __volatile__("yield" ::: "memory");
54+
#else
55+
// Generic fallback: a compiler barrier. This prevents the optimizer from
56+
// collapsing the SpinPause() calls in the calibration loop into nothing.
57+
// It is intentionally much cheaper than std::this_thread::yield() so that
58+
// callers in worker spin loops do not pay scheduler overhead.
59+
std::atomic_signal_fence(std::memory_order_seq_cst);
4260
#endif
4361
}
4462

63+
// Measure the average wall-clock cost of one SpinPause() call in nanoseconds.
64+
// This is intentionally done once per process via function-local static init.
65+
//
66+
// Caveats (documented so callers set the right expectations):
67+
// * On heterogeneous architectures (Intel P/E cores, ARM big.LITTLE) the
68+
// calibration runs on whichever core first hits this function. Other
69+
// cores may see a different per-iteration cost, so any value derived
70+
// from this number is best-effort across worker threads.
71+
// * On platforms where SpinPause() has no architecture-specific pause
72+
// instruction we emit a compiler barrier instead, which means the
73+
// measured cost tracks loop + barrier overhead rather than the hardware
74+
// pause latency. This is still the correct quantity to use for scaling
75+
// an iteration count because the worker spin loop executes the same
76+
// SpinPause() call.
77+
int CalibrateSpinPauseNs() {
78+
static const int ns_per_iter = []() {
79+
constexpr int kWarmupIters = 256;
80+
constexpr int kCalibrationIters = 1024;
81+
// Use a volatile sink so the optimizer cannot conclude SpinPause() is
82+
// side-effect-free and delete the calibration loops. This is belt-and-
83+
// suspenders on top of the fallback barrier inside SpinPause() above.
84+
[[maybe_unused]] volatile int sink = 0;
85+
for (int i = 0; i < kWarmupIters; i++) {
86+
SpinPause();
87+
sink = sink + 1;
88+
}
89+
auto start = std::chrono::steady_clock::now();
90+
for (int i = 0; i < kCalibrationIters; i++) {
91+
SpinPause();
92+
sink = sink + 1;
93+
}
94+
auto elapsed_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(
95+
std::chrono::steady_clock::now() - start)
96+
.count();
97+
return static_cast<int>(std::max<int64_t>(elapsed_ns / kCalibrationIters, 1));
98+
}();
99+
return ns_per_iter;
100+
}
101+
45102
} // namespace concurrency
46103
} // namespace onnxruntime

onnxruntime/core/common/threadpool.cc

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -374,7 +374,7 @@ ThreadPool::ThreadPool(Env* env,
374374
const ThreadOptions& thread_options,
375375
const NAME_CHAR_TYPE* name,
376376
int degree_of_parallelism,
377-
bool low_latency_hint,
377+
int spin_duration_us,
378378
bool force_hybrid)
379379
: thread_options_(thread_options), force_hybrid_(force_hybrid) {
380380
// In the current implementation, a thread pool with degree_of_parallelism==1 uses
@@ -396,7 +396,9 @@ ThreadPool::ThreadPool(Env* env,
396396
using PoolType = ThreadPoolTempl<Env, WorkNoCallbackPolicy>;
397397
#endif
398398
extended_eigen_threadpool_ =
399-
std::make_unique<PoolType>(name, threads_to_create, low_latency_hint, *env, thread_options_);
399+
std::make_unique<PoolType>(name, threads_to_create,
400+
spin_duration_us,
401+
*env, thread_options_);
400402
underlying_threadpool_ = extended_eigen_threadpool_.get();
401403
}
402404
}

onnxruntime/core/session/inference_session.cc

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,29 @@ using namespace onnxruntime::common;
103103

104104
namespace onnxruntime {
105105
namespace {
106+
107+
// Parse a spin duration config value (in microseconds) from a string.
108+
// Returns kSpinDurationDefault (-1) if the config is not explicitly set.
109+
// Returns the parsed value (>= -1) if valid. Logs a warning and returns
110+
// kSpinDurationDefault on parse failure.
111+
constexpr int kSpinDurationWarnThresholdUs = 10000; // 10ms — warn above this
112+
int ParseSpinDurationUs(std::string_view str, const char* config_key,
113+
const logging::Logger& logger) {
114+
int spin_us = concurrency::kSpinDurationDefault;
115+
if (!TryParseStringWithClassicLocale(str, spin_us) || spin_us < -1) {
116+
LOGS(logger, WARNING) << "Invalid value for " << config_key
117+
<< ": \"" << str << "\", using default spin duration setting";
118+
return concurrency::kSpinDurationDefault;
119+
}
120+
if (spin_us > kSpinDurationWarnThresholdUs) {
121+
LOGS(logger, WARNING) << config_key << " is set to " << spin_us
122+
<< "us (>" << kSpinDurationWarnThresholdUs
123+
<< "us). Large spin durations increase CPU/power usage. "
124+
<< "Typical values are 500-2000us.";
125+
}
126+
return spin_us;
127+
}
128+
106129
template <typename T>
107130
const T* GetDateFormatString();
108131

@@ -455,6 +478,9 @@ void InferenceSession::ConstructorCommon(const SessionOptions& session_options,
455478
// If the thread pool can use all the processors, then
456479
// we set affinity of each thread to each processor.
457480
to.allow_spinning = allow_intra_op_spinning;
481+
to.spin_duration_us = ParseSpinDurationUs(
482+
session_options_.config_options.GetConfigOrDefault(kOrtSessionOptionsConfigIntraOpSpinDurationUs, "-1"),
483+
kOrtSessionOptionsConfigIntraOpSpinDurationUs, *session_logger_);
458484
to.dynamic_block_base_ = std::stoi(session_options_.config_options.GetConfigOrDefault(kOrtSessionOptionsConfigDynamicBlockBase, "0"));
459485
LOGS(*session_logger_, INFO) << "Dynamic block base set to " << to.dynamic_block_base_;
460486

@@ -502,6 +528,9 @@ void InferenceSession::ConstructorCommon(const SessionOptions& session_options,
502528
to.name = inter_thread_pool_name_.c_str();
503529
to.set_denormal_as_zero = set_denormal_as_zero;
504530
to.allow_spinning = allow_inter_op_spinning;
531+
to.spin_duration_us = ParseSpinDurationUs(
532+
session_options_.config_options.GetConfigOrDefault(kOrtSessionOptionsConfigInterOpSpinDurationUs, "-1"),
533+
kOrtSessionOptionsConfigInterOpSpinDurationUs, *session_logger_);
505534
to.dynamic_block_base_ = std::stoi(session_options_.config_options.GetConfigOrDefault(kOrtSessionOptionsConfigDynamicBlockBase, "0"));
506535

507536
// Set custom threading functions

onnxruntime/core/util/thread_utils.cc

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ std::ostream& operator<<(std::ostream& os, const OrtThreadPoolParams& params) {
1919
os << " thread_pool_size: " << params.thread_pool_size;
2020
os << " auto_set_affinity: " << params.auto_set_affinity;
2121
os << " allow_spinning: " << params.allow_spinning;
22+
os << " spin_duration_us: " << params.spin_duration_us;
2223
os << " dynamic_block_base_: " << params.dynamic_block_base_;
2324
os << " stack_size: " << params.stack_size;
2425
os << " affinity_str: " << params.affinity_str;
@@ -162,8 +163,9 @@ CreateThreadPoolHelper(Env* env, OrtThreadPoolParams options) {
162163
}
163164
#endif
164165

165-
return std::make_unique<ThreadPool>(env, to, options.name, options.thread_pool_size,
166-
options.allow_spinning);
166+
// Clamp so that invalid negatives (e.g. -5) are treated as the default (-1).
167+
const int spin_us = options.allow_spinning ? std::max(options.spin_duration_us, -1) : 0;
168+
return std::make_unique<ThreadPool>(env, to, options.name, options.thread_pool_size, spin_us);
167169
}
168170

169171
std::unique_ptr<ThreadPool>

onnxruntime/core/util/thread_utils.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,14 @@ struct OrtThreadPoolParams {
2727
bool allow_spinning = false;
2828
#endif
2929

30+
// Duration in microseconds that threads spin waiting for work before blocking.
31+
// Subordinate to allow_spinning: when allow_spinning is false, this value is
32+
// ignored and spinning is disabled (equivalent to spin_duration_us = 0).
33+
// -1 (kSpinDurationDefault) = use default iteration-count-based spinning
34+
// 0 = disable spinning (equivalent to allow_spinning = false)
35+
// >0 = calibrated iteration-based spinning for specified duration (best-effort)
36+
int spin_duration_us = onnxruntime::concurrency::kSpinDurationDefault;
37+
3038
// It it is non-negative, thread pool will split a task by a decreasing block size
3139
// of remaining_of_total_iterations / (num_of_threads * dynamic_block_base_)
3240
int dynamic_block_base_ = 0;

onnxruntime/test/onnx/microbenchmark/tptest.cc

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,15 +12,15 @@ using namespace onnxruntime::concurrency;
1212

1313
// Thread pool configuration to test.
1414
constexpr int NUM_THREADS = 8;
15-
constexpr bool ALLOW_SPINNING = true;
15+
constexpr int SPIN_DURATION_US = kSpinDurationDefault;
1616

1717
static void BM_CreateThreadPool(benchmark::State& state) {
1818
for (auto _ : state) {
1919
ThreadPool tp(&onnxruntime::Env::Default(),
2020
onnxruntime::ThreadOptions(),
2121
ORT_TSTR(""),
2222
NUM_THREADS,
23-
ALLOW_SPINNING);
23+
SPIN_DURATION_US);
2424
}
2525
}
2626
BENCHMARK(BM_CreateThreadPool)
@@ -53,7 +53,7 @@ static void BM_ThreadPoolParallelFor(benchmark::State& state) {
5353
auto tp = std::make_unique<ThreadPool>(&onnxruntime::Env::Default(),
5454
onnxruntime::ThreadOptions(),
5555
nullptr,
56-
NUM_THREADS, ALLOW_SPINNING);
56+
NUM_THREADS, SPIN_DURATION_US);
5757
for (auto _ : state) {
5858
ThreadPool::TryParallelFor(tp.get(), len, cost, SimpleForLoop);
5959
}
@@ -98,7 +98,7 @@ static void BM_ThreadPoolSimpleParallelFor(benchmark::State& state) {
9898
auto tp = std::make_unique<ThreadPool>(&onnxruntime::Env::Default(),
9999
onnxruntime::ThreadOptions(),
100100
nullptr,
101-
num_threads, ALLOW_SPINNING);
101+
num_threads, SPIN_DURATION_US);
102102
for (auto _ : state) {
103103
for (int j = 0; j < 100; j++) {
104104
ThreadPool::TrySimpleParallelFor(tp.get(), len, [&](size_t) {

0 commit comments

Comments
 (0)