Update worker thread pool to use time based wait. (#27916)

sushraja-msft · github-actions[bot] · yuslepukhin · web-flow · commit 3f74b3cfd335 · 2026-04-16T21:24:54.000-07:00
# Make thread pool spin duration configurable via session option ## Problem The ORT Eigen thread pool's `SpinPause` loop uses a fixed iteration count (`1 << 20` = ~1M iterations) before blocking. The actual wall-clock spin duration varies dramatically by CPU architecture: | Pause Instruction | Architecture | Spin Duration (1M iterations) | |---|---|---| | `_mm_pause` | Pre-Skylake | ~3ms | | `_mm_pause` | Skylake+ @ 3 GHz | ~47ms | | `_tpause` | 3 GHz base | ~333ms | | `_tpause` | 2 GHz base | ~500ms | For client/on-device workloads (e.g., Whisper in Edge), this causes high CPU utilization visible in profilers and Task Manager, even though the CPU is in a low-power spin state. So 1M iterations at 3 GHz: - **Pre-Skylake:** 1M × 10 / 3G ≈ **3.3ms** - **Skylake @ 3 GHz:** 1M × 140 / 3G ≈ **47ms** - **Skylake @ 5 GHz (turbo):** 1M × 140 / 5G ≈ **28ms** - **AMD Zen @ 4 GHz:** 1M × 65 / 4G ≈ **16ms** The total duration scaled inversely with clock speed and varied dramatically across microarchitectures. The Skylake 14x increase was specifically because Intel found that the short pause was causing too much power waste and memory bus contention in spin loops. ### `_tpause` `_tpause(0x0, __rdtsc() + 1000)` waits for a fixed number of TSC ticks. TSC frequency is typically fixed at the processor's base frequency (not turbo), so: - **3 GHz base:** 1000 ticks ≈ 333ns per iteration → 1M iterations ≈ **333ms** - **2 GHz base:** 1000 ticks ≈ 500ns per iteration → 1M iterations ≈ **500ms** The per-iteration time is more predictable than `_mm_pause` (TSC is constant-rate on modern CPUs), but still scales with TSC frequency. The total spin is much longer because each iteration is ~333ns vs ~28–47ns for `_mm_pause` on Skylake+. ### Profiler visibility Both `_tpause` and `_mm_pause` are treated as **CPU busy** in Task Manager and ETW sampling profilers, even though these are low-power CPU states. This ends up looking like Edge consuming all the CPU during speech recognition. ## Solution This PR makes the thread pool spin behavior configurable while **preserving the default (original) behavior** for backward compatibility: - **Default (`-1`)**: Uses the original iteration-count-based spin loop (1M iterations). Unchanged throughput characteristics. - **`0`**: Disables spinning entirely (threads block immediately). - **`> 0`**: Enables time-based spinning for the specified duration in microseconds using `std::chrono::steady_clock`. Recommended for power-sensitive workloads. ### Session option usage ```cpp // Use time-based spinning with 1ms duration (recommended for on-device/client workloads) session_options.AddConfigEntry("session.intra_op.spin_duration_us", "1000"); // Disable spinning entirely session_options.AddConfigEntry("session.intra_op.spin_duration_us", "0"); ``` Both intra-op and inter-op thread pools are independently configurable via `session.intra_op.spin_duration_us` and `session.inter_op.spin_duration_us`. ## Changes ### Core thread pool (EigenNonBlockingThreadPool.h) - `WorkerLoop` now has two spin paths selected by `spin_duration_us_`: - Negative (default): original iteration-count loop, identical to `main` - Positive: time-based spin using `steady_clock` with power-of-2 bitmask optimizations for steal interval and clock-read frequency - Constructor parameter changed from `bool allow_spinning` → `int spin_duration_us` - `ComputeTimeCheckMask()`: dynamically computes clock-read frequency based on spin duration (clamped to [128, 4096] iterations) to keep overhead under 1% ### Configuration plumbing - New session config keys: `session.intra_op.spin_duration_us`, `session.inter_op.spin_duration_us` - `OrtThreadPoolParams.spin_duration_us` field with sentinel default `-1` - `ParseSpinDurationUs()` helper using `TryParseStringWithClassicLocale` for safe parsing - `allow_spinning` and `spin_duration_us` merged at `CreateThreadPoolHelper`: when `allow_spinning=false`, spin duration is forced to `0` ### Test updates - All 8 internal call sites passing `bool true` updated to `concurrency::kSpinDurationDefault` to avoid silent implicit bool-to-int conversion - `onnxruntime_perf_test` supports `--spin_duration_us` CLI flag - Thread pool benchmarks use `kSpinDurationDefault` ## Key design decisions 1. **Default preserves original behavior**: No performance regression for existing users. Benchmarks confirmed the iteration-count path matches `main`. 2. **`steady_clock` over `high_resolution_clock`**: Monotonic guarantee prevents spin-deadline issues from clock jumps. 3. **`unsigned int` loop counter**: Prevents signed overflow in the unbounded time-based spin loop. 4. **Power-of-2 bitmask optimization**: Steal every 128 iterations (`& 0x7F`), clock checks at a separate frequency computed from spin duration — avoids modulo operations in the hot loop. # Results <img width="3838" height="1478" alt="image" src="https://github.com/user-attachments/assets/265a0af0-4ed7-46ae-8263-96553bb592b2" /> LHS shows the problem where 85% of CPU time is spent in SpinWait. RHS shows the same trace with the fix, 50% lower CPU utilization the length of the usage spikes drop from 527ms to 130ms. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com>
diff --git a/include/onnxruntime/core/common/spin_pause.h b/include/onnxruntime/core/common/spin_pause.h
@@ -9,5 +9,11 @@ namespace concurrency {
 // Intrinsic to use in spin-loops
 void SpinPause();
 
+// Measure the average duration of a single SpinPause() call in nanoseconds.
+// Runs exactly once per process (thread-safe via function-local static init).
+// Used to convert a user-specified spin duration in microseconds into an
+// iteration count, avoiding clock reads in the hot spin loop.
+int CalibrateSpinPauseNs();
+
 }  // namespace concurrency
 }  // namespace onnxruntime
diff --git a/include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h b/include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h
@@ -37,6 +37,7 @@
 #pragma warning(disable : 4127)
 #pragma warning(disable : 4805)
 #endif
+#include <chrono>
 #include <memory>
 #include "unsupported/Eigen/CXX11/ThreadPool"
 
@@ -864,12 +865,13 @@ class ThreadPoolTempl : public onnxruntime::concurrency::ExtendedThreadPoolInter
 
   typedef RunQueue<CallbackPolicy, Tag, 1024> Queue;
 
-  ThreadPoolTempl(const CHAR_TYPE* name, int num_threads, bool allow_spinning, Environment& env,
-                  const ThreadOptions& thread_options)
+  ThreadPoolTempl(const CHAR_TYPE* name, int num_threads, int spin_duration_us,
+                  Environment& env, const ThreadOptions& thread_options)
       : profiler_(num_threads, name),
         env_(env),
         num_threads_(num_threads),
-        allow_spinning_(allow_spinning),
+        spin_count_(ComputeSpinCount(spin_duration_us)),
+        steal_interval_(std::max(spin_count_ / 100, 1)),
         set_denormal_as_zero_(thread_options.set_denormal_as_zero),
         callback_policy_(thread_options),
         worker_data_(num_threads),
@@ -1598,9 +1600,30 @@ class ThreadPoolTempl : public onnxruntime::concurrency::ExtendedThreadPoolInter
     std::condition_variable cv;
   };
 
+  // Measure the average duration of a single SpinPause() call in nanoseconds.
+  // Runs exactly once per process (thread-safe via function-local static init).
+  // The result is used to convert a user-specified spin duration in microseconds
+  // into an iteration count, avoiding clock reads in the hot spin loop.
+  static int CalibrateSpinPauseNs() {
+    return onnxruntime::concurrency::CalibrateSpinPauseNs();
+  }
+
+  // Convert spin_duration_us into an iteration count for the spin loop.
+  //   -1 (default): use the original fixed iteration count (1 << 20).
+  //    0: no spinning.
+  //   >0: calibrate SpinPause() latency and compute the corresponding count.
+  static int ComputeSpinCount(int spin_duration_us) {
+    if (spin_duration_us == 0) return 0;
+    if (spin_duration_us < 0) return 1 << 20;  // ~1M iterations (original default)
+    int ns_per_iter = CalibrateSpinPauseNs();
+    auto count = static_cast<int64_t>(spin_duration_us) * 1000 / ns_per_iter;
+    return static_cast<int>(std::min<int64_t>(count, 1 << 30));
+  }
+
   Environment& env_;
   const unsigned num_threads_;
-  const bool allow_spinning_;
+  const int spin_count_;      // Number of SpinPause iterations before blocking (0 = no spin)
+  const int steal_interval_;  // Attempt work steal every steal_interval_ iterations
   const bool set_denormal_as_zero_;
   CallbackPolicy callback_policy_;
   Eigen::MaxSizeVector<WorkerData> worker_data_;
@@ -1642,25 +1665,26 @@ class ThreadPoolTempl : public onnxruntime::concurrency::ExtendedThreadPoolInter
 
     assert(td.GetStatus() == WorkerData::ThreadStatus::Spinning);
 
-    constexpr int log2_spin = 20;
-    const int spin_count = allow_spinning_ ? (1ull << log2_spin) : 0;
-    const int steal_count = spin_count / 100;
-
     SetDenormalAsZero(set_denormal_as_zero_);
     profiler_.LogThreadId(thread_id);
 
     while (!should_exit) {
       Work w = q.PopFront();
       if (!w) {
-        // Spin waiting for work.
-        for (int i = 0; i < spin_count && !done_; i++) {
-          if (((i + 1) % steal_count == 0)) {
+        // Spin waiting for work. spin_count_ is determined at construction:
+        //   default (-1): 1<<20 iterations (original behavior)
+        //   0: no spinning (skip loop entirely)
+        //   >0 us: iteration count derived from one-time SpinPause() calibration
+        // steal_interval_ = max(spin_count_/100, 1), yielding ~100 steal attempts per spin window.
+        int steal_countdown = steal_interval_;
+        for (int i = 0; i < spin_count_ && !done_; i++) {
+          if (--steal_countdown == 0) {
             w = Steal(StealAttemptKind::TRY_ONE);
+            steal_countdown = steal_interval_;
           } else {
             w = q.PopFront();
           }
           if (w) break;
-
           if (spin_loop_status_.load(std::memory_order_relaxed) == SpinLoopStatus::kIdle) {
             break;
           }
diff --git a/include/onnxruntime/core/platform/threadpool.h b/include/onnxruntime/core/platform/threadpool.h
@@ -129,6 +129,11 @@ struct TensorOpCost {
 
 namespace concurrency {
 
+// Sentinel value for spin_duration_us indicating the default iteration-count-based
+// spinning behavior. This preserves the original spin loop performance characteristics
+// where the spin duration varies by architecture depending on pause instruction latency.
+static constexpr int kSpinDurationDefault = -1;
+
 template <typename Environment, typename CallbackPolicy>
 class ThreadPoolTempl;
 
@@ -145,20 +150,39 @@ class ThreadPool {
 #endif
   // Constructs a pool for running with with "degree_of_parallelism" threads with
   // specified "name". env->StartThread() is used to create individual threads
-  // with the given ThreadOptions. If "low_latency_hint" is true the thread pool
-  // implementation may use it as a hint that lower latency is preferred at the
-  // cost of higher CPU usage, e.g. by letting one or more idle threads spin
-  // wait. Conversely, if the threadpool is used to schedule high-latency
-  // operations like I/O the hint should be set to false.
+  // with the given ThreadOptions. "spin_duration_us" controls idle thread spin behavior:
+  //   -1 (kSpinDurationDefault) = use default iteration-count-based spinning (best throughput,
+  //       but spin duration varies by CPU architecture and pause instruction latency)
+  //    0 = disable spinning entirely (threads block immediately when idle)
+  //   >0 = calibrated iteration-based spinning for the specified duration in microseconds
+  //        (best-effort duration via one-time SpinPause() calibration; actual spin time
+  //         may vary with CPU frequency changes)
+  //
+  // Note: The OrtThreadPoolParams.allow_spinning flag (controlled by the
+  // session.intra_op.allow_spinning / session.inter_op.allow_spinning config keys or
+  // the C API) takes priority. When allow_spinning is false, spin_duration_us is forced
+  // to 0 by CreateThreadPoolHelper regardless of the value passed here.
   //
   // REQUIRES: degree_of_parallelism > 0
   ThreadPool(Env* env,
              const ThreadOptions& thread_options,
              const NAME_CHAR_TYPE* name,
              int degree_of_parallelism,
-             bool low_latency_hint,
+             int spin_duration_us = kSpinDurationDefault,
              bool force_hybrid = false);
 
+  // Backward-compatible overload: maps the legacy bool parameter to the new
+  // spin_duration_us semantics so that external callers passing true/false
+  // don't silently get implicit bool-to-int conversion (true -> 1us).
+  ThreadPool(Env* env,
+             const ThreadOptions& thread_options,
+             const NAME_CHAR_TYPE* name,
+             int degree_of_parallelism,
+             bool allow_spinning,
+             bool force_hybrid = false)
+      : ThreadPool(env, thread_options, name, degree_of_parallelism,
+                   allow_spinning ? kSpinDurationDefault : 0, force_hybrid) {}
+
   // Waits until all scheduled work has finished and then destroy the
   // set of threads.
   ~ThreadPool();
diff --git a/include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h b/include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h
@@ -154,6 +154,22 @@ static const char* const kOrtSessionOptionsUseDeviceAllocatorForInitializers = "
 static const char* const kOrtSessionOptionsConfigAllowInterOpSpinning = "session.inter_op.allow_spinning";
 static const char* const kOrtSessionOptionsConfigAllowIntraOpSpinning = "session.intra_op.allow_spinning";
 
+// Configure the duration in microseconds that threads spin waiting for work before blocking.
+// This setting is subordinate to the allow_spinning flags (session.intra_op.allow_spinning /
+// session.inter_op.allow_spinning). When allow_spinning is "0", spinning is disabled and
+// the spin duration is forced to 0 regardless of this setting.
+// By default (when this option is not set), the thread pool uses an iteration-count-based spin loop
+// whose wall-clock duration varies by CPU architecture and pause instruction latency. This provides
+// the best throughput but may result in high CPU utilization.
+// Setting a positive value switches to calibrated iteration-based spinning that targets
+// the specified duration. The actual spin time is a best-effort approximation based on a
+// one-time measurement of the pause instruction latency; it may vary with CPU frequency
+// changes. Recommended for power-sensitive or client/on-device workloads.
+// Common values: 500-2000 (0.5-2ms).
+// Setting to "0" with spinning enabled effectively disables spinning (equivalent to allow_spinning = false).
+static const char* const kOrtSessionOptionsConfigIntraOpSpinDurationUs = "session.intra_op.spin_duration_us";
+static const char* const kOrtSessionOptionsConfigInterOpSpinDurationUs = "session.inter_op.spin_duration_us";
+
 // Key for using model bytes directly for ORT format
 // If a session is created using an input byte array contains the ORT format model data,
 // By default we will copy the model bytes at the time of session creation to ensure the model bytes
diff --git a/onnxruntime/core/common/spin_pause.cc b/onnxruntime/core/common/spin_pause.cc
@@ -3,6 +3,11 @@
 
 #include "core/common/spin_pause.h"
 
+#include <algorithm>
+#include <atomic>
+#include <chrono>
+#include <cstdint>
+
 #if defined(_M_AMD64)
 #include <intrin.h>
 #endif
@@ -39,8 +44,60 @@ void SpinPause() {
   } else {
     _mm_pause();
   }
+#elif defined(__aarch64__) || defined(_M_ARM64) || defined(_M_ARM64EC)
+  // ARM64 hint that yields the pipeline without descheduling the thread.
+  // Emitted as a non-inline asm statement so the optimizer cannot elide it
+  // from the calibration loop in CalibrateSpinPauseNs().
+  __asm__ __volatile__("yield" ::: "memory");
+#elif defined(__arm__)
+  __asm__ __volatile__("yield" ::: "memory");
+#else
+  // Generic fallback: a compiler barrier. This prevents the optimizer from
+  // collapsing the SpinPause() calls in the calibration loop into nothing.
+  // It is intentionally much cheaper than std::this_thread::yield() so that
+  // callers in worker spin loops do not pay scheduler overhead.
+  std::atomic_signal_fence(std::memory_order_seq_cst);
 #endif
 }
 
+// Measure the average wall-clock cost of one SpinPause() call in nanoseconds.
+// This is intentionally done once per process via function-local static init.
+//
+// Caveats (documented so callers set the right expectations):
+//   * On heterogeneous architectures (Intel P/E cores, ARM big.LITTLE) the
+//     calibration runs on whichever core first hits this function. Other
+//     cores may see a different per-iteration cost, so any value derived
+//     from this number is best-effort across worker threads.
+//   * On platforms where SpinPause() has no architecture-specific pause
+//     instruction we emit a compiler barrier instead, which means the
+//     measured cost tracks loop + barrier overhead rather than the hardware
+//     pause latency. This is still the correct quantity to use for scaling
+//     an iteration count because the worker spin loop executes the same
+//     SpinPause() call.
+int CalibrateSpinPauseNs() {
+  static const int ns_per_iter = []() {
+    constexpr int kWarmupIters = 256;
+    constexpr int kCalibrationIters = 1024;
+    // Use a volatile sink so the optimizer cannot conclude SpinPause() is
+    // side-effect-free and delete the calibration loops. This is belt-and-
+    // suspenders on top of the fallback barrier inside SpinPause() above.
+    [[maybe_unused]] volatile int sink = 0;
+    for (int i = 0; i < kWarmupIters; i++) {
+      SpinPause();
+      sink = sink + 1;
+    }
+    auto start = std::chrono::steady_clock::now();
+    for (int i = 0; i < kCalibrationIters; i++) {
+      SpinPause();
+      sink = sink + 1;
+    }
+    auto elapsed_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(
+                          std::chrono::steady_clock::now() - start)
+                          .count();
+    return static_cast<int>(std::max<int64_t>(elapsed_ns / kCalibrationIters, 1));
+  }();
+  return ns_per_iter;
+}
+
 }  // namespace concurrency
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/common/threadpool.cc b/onnxruntime/core/common/threadpool.cc
@@ -374,7 +374,7 @@ ThreadPool::ThreadPool(Env* env,
                        const ThreadOptions& thread_options,
                        const NAME_CHAR_TYPE* name,
                        int degree_of_parallelism,
-                       bool low_latency_hint,
+                       int spin_duration_us,
                        bool force_hybrid)
     : thread_options_(thread_options), force_hybrid_(force_hybrid) {
   // In the current implementation, a thread pool with degree_of_parallelism==1 uses
@@ -396,7 +396,9 @@ ThreadPool::ThreadPool(Env* env,
     using PoolType = ThreadPoolTempl<Env, WorkNoCallbackPolicy>;
 #endif
     extended_eigen_threadpool_ =
-        std::make_unique<PoolType>(name, threads_to_create, low_latency_hint, *env, thread_options_);
+        std::make_unique<PoolType>(name, threads_to_create,
+                                   spin_duration_us,
+                                   *env, thread_options_);
     underlying_threadpool_ = extended_eigen_threadpool_.get();
   }
 }
diff --git a/onnxruntime/core/session/inference_session.cc b/onnxruntime/core/session/inference_session.cc
@@ -103,6 +103,29 @@ using namespace onnxruntime::common;
 
 namespace onnxruntime {
 namespace {
+
+// Parse a spin duration config value (in microseconds) from a string.
+// Returns kSpinDurationDefault (-1) if the config is not explicitly set.
+// Returns the parsed value (>= -1) if valid. Logs a warning and returns
+// kSpinDurationDefault on parse failure.
+constexpr int kSpinDurationWarnThresholdUs = 10000;  // 10ms — warn above this
+int ParseSpinDurationUs(std::string_view str, const char* config_key,
+                        const logging::Logger& logger) {
+  int spin_us = concurrency::kSpinDurationDefault;
+  if (!TryParseStringWithClassicLocale(str, spin_us) || spin_us < -1) {
+    LOGS(logger, WARNING) << "Invalid value for " << config_key
+                          << ": \"" << str << "\", using default spin duration setting";
+    return concurrency::kSpinDurationDefault;
+  }
+  if (spin_us > kSpinDurationWarnThresholdUs) {
+    LOGS(logger, WARNING) << config_key << " is set to " << spin_us
+                          << "us (>" << kSpinDurationWarnThresholdUs
+                          << "us). Large spin durations increase CPU/power usage. "
+                          << "Typical values are 500-2000us.";
+  }
+  return spin_us;
+}
+
 template <typename T>
 const T* GetDateFormatString();
 
@@ -455,6 +478,9 @@ void InferenceSession::ConstructorCommon(const SessionOptions& session_options,
         // If the thread pool can use all the processors, then
         // we set affinity of each thread to each processor.
         to.allow_spinning = allow_intra_op_spinning;
+        to.spin_duration_us = ParseSpinDurationUs(
+            session_options_.config_options.GetConfigOrDefault(kOrtSessionOptionsConfigIntraOpSpinDurationUs, "-1"),
+            kOrtSessionOptionsConfigIntraOpSpinDurationUs, *session_logger_);
         to.dynamic_block_base_ = std::stoi(session_options_.config_options.GetConfigOrDefault(kOrtSessionOptionsConfigDynamicBlockBase, "0"));
         LOGS(*session_logger_, INFO) << "Dynamic block base set to " << to.dynamic_block_base_;
 
@@ -502,6 +528,9 @@ void InferenceSession::ConstructorCommon(const SessionOptions& session_options,
         to.name = inter_thread_pool_name_.c_str();
         to.set_denormal_as_zero = set_denormal_as_zero;
         to.allow_spinning = allow_inter_op_spinning;
+        to.spin_duration_us = ParseSpinDurationUs(
+            session_options_.config_options.GetConfigOrDefault(kOrtSessionOptionsConfigInterOpSpinDurationUs, "-1"),
+            kOrtSessionOptionsConfigInterOpSpinDurationUs, *session_logger_);
         to.dynamic_block_base_ = std::stoi(session_options_.config_options.GetConfigOrDefault(kOrtSessionOptionsConfigDynamicBlockBase, "0"));
 
         // Set custom threading functions
diff --git a/onnxruntime/core/util/thread_utils.cc b/onnxruntime/core/util/thread_utils.cc
@@ -19,6 +19,7 @@ std::ostream& operator<<(std::ostream& os, const OrtThreadPoolParams& params) {
   os << " thread_pool_size: " << params.thread_pool_size;
   os << " auto_set_affinity: " << params.auto_set_affinity;
   os << " allow_spinning: " << params.allow_spinning;
+  os << " spin_duration_us: " << params.spin_duration_us;
   os << " dynamic_block_base_: " << params.dynamic_block_base_;
   os << " stack_size: " << params.stack_size;
   os << " affinity_str: " << params.affinity_str;
@@ -162,8 +163,9 @@ CreateThreadPoolHelper(Env* env, OrtThreadPoolParams options) {
   }
 #endif
 
-  return std::make_unique<ThreadPool>(env, to, options.name, options.thread_pool_size,
-                                      options.allow_spinning);
+  // Clamp so that invalid negatives (e.g. -5) are treated as the default (-1).
+  const int spin_us = options.allow_spinning ? std::max(options.spin_duration_us, -1) : 0;
+  return std::make_unique<ThreadPool>(env, to, options.name, options.thread_pool_size, spin_us);
 }
 
 std::unique_ptr<ThreadPool>
diff --git a/onnxruntime/core/util/thread_utils.h b/onnxruntime/core/util/thread_utils.h
@@ -27,6 +27,14 @@ struct OrtThreadPoolParams {
   bool allow_spinning = false;
 #endif
 
+  // Duration in microseconds that threads spin waiting for work before blocking.
+  // Subordinate to allow_spinning: when allow_spinning is false, this value is
+  // ignored and spinning is disabled (equivalent to spin_duration_us = 0).
+  //   -1 (kSpinDurationDefault) = use default iteration-count-based spinning
+  //    0 = disable spinning (equivalent to allow_spinning = false)
+  //   >0 = calibrated iteration-based spinning for specified duration (best-effort)
+  int spin_duration_us = onnxruntime::concurrency::kSpinDurationDefault;
+
   // It it is non-negative, thread pool will split a task by a decreasing block size
   // of remaining_of_total_iterations / (num_of_threads * dynamic_block_base_)
   int dynamic_block_base_ = 0;
diff --git a/onnxruntime/test/onnx/microbenchmark/tptest.cc b/onnxruntime/test/onnx/microbenchmark/tptest.cc
@@ -12,15 +12,15 @@ using namespace onnxruntime::concurrency;
 
 // Thread pool configuration to test.
 constexpr int NUM_THREADS = 8;
-constexpr bool ALLOW_SPINNING = true;
+constexpr int SPIN_DURATION_US = kSpinDurationDefault;
 
 static void BM_CreateThreadPool(benchmark::State& state) {
   for (auto _ : state) {
     ThreadPool tp(&onnxruntime::Env::Default(),
                   onnxruntime::ThreadOptions(),
                   ORT_TSTR(""),
                   NUM_THREADS,
-                  ALLOW_SPINNING);
+                  SPIN_DURATION_US);
   }
 }
 BENCHMARK(BM_CreateThreadPool)
@@ -53,7 +53,7 @@ static void BM_ThreadPoolParallelFor(benchmark::State& state) {
   auto tp = std::make_unique<ThreadPool>(&onnxruntime::Env::Default(),
                                          onnxruntime::ThreadOptions(),
                                          nullptr,
-                                         NUM_THREADS, ALLOW_SPINNING);
+                                         NUM_THREADS, SPIN_DURATION_US);
   for (auto _ : state) {
     ThreadPool::TryParallelFor(tp.get(), len, cost, SimpleForLoop);
   }
@@ -98,7 +98,7 @@ static void BM_ThreadPoolSimpleParallelFor(benchmark::State& state) {
   auto tp = std::make_unique<ThreadPool>(&onnxruntime::Env::Default(),
                                          onnxruntime::ThreadOptions(),
                                          nullptr,
-                                         num_threads, ALLOW_SPINNING);
+                                         num_threads, SPIN_DURATION_US);
   for (auto _ : state) {
     for (int j = 0; j < 100; j++) {
       ThreadPool::TrySimpleParallelFor(tp.get(), len, [&](size_t) {
diff --git a/onnxruntime/test/perftest/command_args_parser.cc b/onnxruntime/test/perftest/command_args_parser.cc
diff --git a/onnxruntime/test/perftest/ort_test_session.cc b/onnxruntime/test/perftest/ort_test_session.cc
diff --git a/onnxruntime/test/perftest/test_configuration.h b/onnxruntime/test/perftest/test_configuration.h
diff --git a/onnxruntime/test/platform/threadpool_test.cc b/onnxruntime/test/platform/threadpool_test.cc
diff --git a/onnxruntime/test/providers/memcpy_test.cc b/onnxruntime/test/providers/memcpy_test.cc