Implement CUDA EP Plugin profiling API by yuslepukhin · Pull Request #28216 · microsoft/onnxruntime

yuslepukhin · 2026-04-23T22:49:59Z

This pull request adds support for CUPTI-based GPU profiling to the CUDA plugin execution provider (EP) in ONNX Runtime. Profiling is now available in the plugin EP when built with the onnxruntime_ENABLE_CUDA_PROFILING CMake flag, enabling detailed GPU activity tracing and integration with ORT's profiling system. The implementation introduces a new CudaPluginEpProfiler that bridges between ORT's profiling API and CUPTI, and updates the build system, plugin interface, and documentation accordingly.

CUDA Plugin Profiling Integration:

Added a new CudaPluginEpProfiler class (cuda_profiler_plugin.h/.cc) that implements the OrtEpProfilerImpl interface, delegates to a CUPTIManager singleton for GPU activity tracing, and provides callbacks for profiling lifecycle and event correlation. [1] [2]
Updated the plugin EP interface in cuda_ep.h/cuda_ep.cc to conditionally provide a CreateProfilerImpl callback when profiling is enabled, wiring up the new profiler implementation. [1] [2] [3]
Modified the CMake build (onnxruntime_providers_cuda_plugin.cmake) to conditionally link against CUDA::cupti and define the necessary compile-time flags for profiling support.

Documentation Updates:

Expanded the design documentation (cuda_plugin_ep_design.md) to describe the profiling and observability architecture, CUPTI integration, correlation ID flow, event collection, and differences from the in-tree CUDA EP profiler. Build configuration and relevant source files are also documented.

Miscellaneous:

Included the new profiler header in the plugin EP implementation.
Minor test and import adjustments (e.g., test_cuda_plugin_ep.py).

These changes enable the CUDA plugin EP to participate fully in ORT's profiling system, allowing users to observe GPU kernel and memory activity in conjunction with CPU-side events when profiling is enabled.

github-actions

You can commit the suggested changes from lintrunner.

github-actions · 2026-04-23T22:54:52Z

+            self.assertTrue(profile_file, "No profile file returned")
+            self.assertTrue(os.path.exists(profile_file), f"Profile file not found: {profile_file}")
+
+            with open(profile_file, "r") as f:


Suggested change

with open(profile_file, "r") as f:

with open(profile_file) as f:

github-actions · 2026-04-23T22:54:52Z

+            kernel_events = [
+                e for e in profile_data
+                if isinstance(e, dict) and e.get("cat") == "Kernel"
+            ]


Suggested change

kernel_events = [

e for e in profile_data

if isinstance(e, dict) and e.get("cat") == "Kernel"

]

kernel_events = [e for e in profile_data if isinstance(e, dict) and e.get("cat") == "Kernel"]

github-actions · 2026-04-23T22:54:52Z

+                print(
+                    "Note: No GPU Kernel events found in profile. "
+                    "CUDA profiling may not be enabled in this build."
+                )


Suggested change

print(

"Note: No GPU Kernel events found in profile. "

"CUDA profiling may not be enabled in this build."

)

print("Note: No GPU Kernel events found in profile. CUDA profiling may not be enabled in this build.")

Copilot

Pull request overview

This PR adds CUPTI-backed GPU profiling support to the CUDA plugin Execution Provider so GPU kernel/memcpy activity can be emitted into ONNX Runtime’s profiling JSON when onnxruntime_ENABLE_CUDA_PROFILING is enabled.

Changes:

Introduces a plugin-side CudaPluginEpProfiler implementing OrtEpProfilerImpl, using CUPTIManager to collect GPU activity and report it via OrtProfilingEventsContainer.
Wires CudaEp::CreateProfiler in the CUDA plugin EP behind ENABLE_CUDA_PROFILING.
Updates the CUDA plugin CMake to link CUDA::cupti and adds a Python test + design doc updates for profiling.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
onnxruntime/test/python/transformers/test_cuda_plugin_ep.py	Adds a session profiling test that validates basic trace JSON structure and (when enabled) checks for GPU “Kernel” events/metadata.
onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.h	Declares `CudaPluginEpProfiler` (plugin-side `OrtEpProfilerImpl`).
onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.cc	Implements profiling lifecycle + CUPTI correlation + event conversion to `Ort::ProfilingEvent`.
onnxruntime/core/providers/cuda/plugin/cuda_ep.h	Adds `CreateProfilerImpl` declaration behind `ENABLE_CUDA_PROFILING`.
onnxruntime/core/providers/cuda/plugin/cuda_ep.cc	Wires `CreateProfiler` callback and implements `CreateProfilerImpl`.
docs/cuda_plugin_ep/cuda_plugin_ep_design.md	Documents profiling/observability architecture and build configuration.
cmake/onnxruntime_providers_cuda_plugin.cmake	Conditionally links CUPTI and defines compile-time flags for profiling build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-23T22:58:50Z

+
+  // Flatten all GPU events and convert to OrtProfilingEvent.
+  std::vector<Ort::ProfilingEvent> events;
+  for (auto& [correlation_id, event_list] : event_map) {


In EndProfilingImpl, correlation_id from the structured binding (for (auto& [correlation_id, event_list] : event_map)) is never used. This can trigger unused-variable warnings (and potentially -Werror build breaks). Consider iterating as for (auto& kv : event_map) (using kv.second) or otherwise marking the binding element unused.

Suggested change

for (auto& [correlation_id, event_list] : event_map) {

for (auto& kv : event_map) {

auto& event_list = kv.second;

Copilot · 2026-04-23T22:58:51Z

+#if defined(ENABLE_CUDA_PROFILING)
+
+#include "cuda_plugin_utils.h"
+#include "cupti_manager.h"
+#include "core/common/gpu_profiler_common.h"


cuda_profiler_plugin.h is guarded only by ENABLE_CUDA_PROFILING, but it includes cupti_manager.h which itself requires both USE_CUDA and ENABLE_CUDA_PROFILING to expose profiling::CUPTIManager. To avoid configuration-dependent build failures, consider guarding this header with the same condition (defined(USE_CUDA) && defined(ENABLE_CUDA_PROFILING)) or otherwise ensuring USE_CUDA is always defined when this header is compiled.

Copilot · 2026-04-23T22:58:51Z

+  auto* ep = static_cast<CudaEp*>(this_ptr);
+  *profiler = new CudaPluginEpProfiler(ep->factory_.GetEpApi());


CreateProfilerImpl writes to *profiler only after allocation. For safety (and to avoid leaving callers with an indeterminate output value if new throws and the exception is converted to an OrtStatus), consider setting *profiler = nullptr immediately after validating profiler != nullptr, and/or using a std::unique_ptr locally before releasing.

Suggested change

auto* ep = static_cast<CudaEp*>(this_ptr);

*profiler = new CudaPluginEpProfiler(ep->factory_.GetEpApi());

*profiler = nullptr;

auto* ep = static_cast<CudaEp*>(this_ptr);

auto profiler_impl = std::make_unique<CudaPluginEpProfiler>(ep->factory_.GetEpApi());

*profiler = profiler_impl.release();

Copilot · 2026-04-23T22:58:51Z

+The plugin does **not** perform the post-hoc merge/sort that the in-tree `GPUProfilerBase::EndProfiling` does. The plugin API is append-only; the `PluginEpProfiler` bridge on the ORT side handles merging EP events into the global event timeline.
+
+### 14.5 Design Differences from In-Tree CUDA EP Profiler
+
+| Aspect | In-tree CUDA EP | CUDA Plugin EP |
+|--------|----------------|----------------|
+| Event merge | `GPUProfilerBase::MergeEvents` interleaves GPU events into ORT's array (has known sort-order bug) | Append-only; ORT-side bridge merges |


The doc states that the ORT-side PluginEpProfiler bridge “handles merging EP events into the global event timeline”. In current implementation, PluginEpProfiler::EndProfiling simply appends EP events to the events vector without any merge/sort by timestamp or correlation ID (see core/session/plugin_ep/ep_event_profiling.cc). Please update the wording to match the actual behavior (append-only; any ordering is handled by trace consumers, not by the bridge).

Suggested change

The plugin does **not** perform the post-hoc merge/sort that the in-tree `GPUProfilerBase::EndProfiling` does. The plugin API is append-only; the `PluginEpProfiler` bridge on the ORT side handles merging EP events into the global event timeline.

### 14.5 Design Differences from In-Tree CUDA EP Profiler

| Aspect | In-tree CUDA EP | CUDA Plugin EP |

|--------|----------------|----------------|

| Event merge | `GPUProfilerBase::MergeEvents` interleaves GPU events into ORT's array (has known sort-order bug) | Append-only; ORT-side bridge merges |

The plugin does **not** perform the post-hoc merge/sort that the in-tree `GPUProfilerBase::EndProfiling` does. The plugin API is append-only, and the `PluginEpProfiler` bridge on the ORT side likewise appends EP events to ORT's profiling event collection without merge/sort by timestamp or correlation ID. Any ordering or interleaving into a global timeline is handled by downstream trace consumers.

### 14.5 Design Differences from In-Tree CUDA EP Profiler

| Aspect | In-tree CUDA EP | CUDA Plugin EP |

|--------|----------------|----------------|

| Event merge | `GPUProfilerBase::MergeEvents` interleaves GPU events into ORT's array (has known sort-order bug) | Append-only; ORT-side bridge appends only, and trace consumers handle ordering |

Initial impl

67ef765

yuslepukhin requested a review from Copilot April 23, 2026 22:49

Copilot started reviewing on behalf of yuslepukhin April 23, 2026 22:51 View session

github-actions Bot reviewed Apr 23, 2026

View reviewed changes

github-advanced-security AI found potential problems Apr 23, 2026

View reviewed changes

Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Fixed

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Address feedback

bfd9445

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement CUDA EP Plugin profiling API#28216

Implement CUDA EP Plugin profiling API#28216
yuslepukhin wants to merge 2 commits intomainfrom
yuslepukhin/cuda_ep_plugin_profiling

yuslepukhin commented Apr 23, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot Apr 23, 2026

Uh oh!

github-actions Bot Apr 23, 2026

Uh oh!

github-actions Bot Apr 23, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	with open(profile_file, "r") as f:
	with open(profile_file) as f:

	for (auto& [correlation_id, event_list] : event_map) {
	for (auto& kv : event_map) {
	auto& event_list = kv.second;

		auto* ep = static_cast<CudaEp*>(this_ptr);
		*profiler = new CudaPluginEpProfiler(ep->factory_.GetEpApi());

Conversation

yuslepukhin commented Apr 23, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants