Skip to content

Implement CUDA EP Plugin profiling API#28216

Open
yuslepukhin wants to merge 2 commits intomainfrom
yuslepukhin/cuda_ep_plugin_profiling
Open

Implement CUDA EP Plugin profiling API#28216
yuslepukhin wants to merge 2 commits intomainfrom
yuslepukhin/cuda_ep_plugin_profiling

Conversation

@yuslepukhin
Copy link
Copy Markdown
Member

This pull request adds support for CUPTI-based GPU profiling to the CUDA plugin execution provider (EP) in ONNX Runtime. Profiling is now available in the plugin EP when built with the onnxruntime_ENABLE_CUDA_PROFILING CMake flag, enabling detailed GPU activity tracing and integration with ORT's profiling system. The implementation introduces a new CudaPluginEpProfiler that bridges between ORT's profiling API and CUPTI, and updates the build system, plugin interface, and documentation accordingly.

CUDA Plugin Profiling Integration:

  • Added a new CudaPluginEpProfiler class (cuda_profiler_plugin.h/.cc) that implements the OrtEpProfilerImpl interface, delegates to a CUPTIManager singleton for GPU activity tracing, and provides callbacks for profiling lifecycle and event correlation. [1] [2]
  • Updated the plugin EP interface in cuda_ep.h/cuda_ep.cc to conditionally provide a CreateProfilerImpl callback when profiling is enabled, wiring up the new profiler implementation. [1] [2] [3]
  • Modified the CMake build (onnxruntime_providers_cuda_plugin.cmake) to conditionally link against CUDA::cupti and define the necessary compile-time flags for profiling support.

Documentation Updates:

  • Expanded the design documentation (cuda_plugin_ep_design.md) to describe the profiling and observability architecture, CUPTI integration, correlation ID flow, event collection, and differences from the in-tree CUDA EP profiler. Build configuration and relevant source files are also documented.

Miscellaneous:

  • Included the new profiler header in the plugin EP implementation.
  • Minor test and import adjustments (e.g., test_cuda_plugin_ep.py).

These changes enable the CUDA plugin EP to participate fully in ORT's profiling system, allowing users to observe GPU kernel and memory activity in conjunction with CPU-side events when profiling is enabled.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

self.assertTrue(profile_file, "No profile file returned")
self.assertTrue(os.path.exists(profile_file), f"Profile file not found: {profile_file}")

with open(profile_file, "r") as f:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with open(profile_file, "r") as f:
with open(profile_file) as f:

Comment on lines +2443 to +2446
kernel_events = [
e for e in profile_data
if isinstance(e, dict) and e.get("cat") == "Kernel"
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kernel_events = [
e for e in profile_data
if isinstance(e, dict) and e.get("cat") == "Kernel"
]
kernel_events = [e for e in profile_data if isinstance(e, dict) and e.get("cat") == "Kernel"]

Comment on lines +2462 to +2465
print(
"Note: No GPU Kernel events found in profile. "
"CUDA profiling may not be enabled in this build."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print(
"Note: No GPU Kernel events found in profile. "
"CUDA profiling may not be enabled in this build."
)
print("Note: No GPU Kernel events found in profile. CUDA profiling may not be enabled in this build.")

Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Fixed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds CUPTI-backed GPU profiling support to the CUDA plugin Execution Provider so GPU kernel/memcpy activity can be emitted into ONNX Runtime’s profiling JSON when onnxruntime_ENABLE_CUDA_PROFILING is enabled.

Changes:

  • Introduces a plugin-side CudaPluginEpProfiler implementing OrtEpProfilerImpl, using CUPTIManager to collect GPU activity and report it via OrtProfilingEventsContainer.
  • Wires CudaEp::CreateProfiler in the CUDA plugin EP behind ENABLE_CUDA_PROFILING.
  • Updates the CUDA plugin CMake to link CUDA::cupti and adds a Python test + design doc updates for profiling.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Adds a session profiling test that validates basic trace JSON structure and (when enabled) checks for GPU “Kernel” events/metadata.
onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.h Declares CudaPluginEpProfiler (plugin-side OrtEpProfilerImpl).
onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.cc Implements profiling lifecycle + CUPTI correlation + event conversion to Ort::ProfilingEvent.
onnxruntime/core/providers/cuda/plugin/cuda_ep.h Adds CreateProfilerImpl declaration behind ENABLE_CUDA_PROFILING.
onnxruntime/core/providers/cuda/plugin/cuda_ep.cc Wires CreateProfiler callback and implements CreateProfilerImpl.
docs/cuda_plugin_ep/cuda_plugin_ep_design.md Documents profiling/observability architecture and build configuration.
cmake/onnxruntime_providers_cuda_plugin.cmake Conditionally links CUPTI and defines compile-time flags for profiling build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


// Flatten all GPU events and convert to OrtProfilingEvent.
std::vector<Ort::ProfilingEvent> events;
for (auto& [correlation_id, event_list] : event_map) {
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In EndProfilingImpl, correlation_id from the structured binding (for (auto& [correlation_id, event_list] : event_map)) is never used. This can trigger unused-variable warnings (and potentially -Werror build breaks). Consider iterating as for (auto& kv : event_map) (using kv.second) or otherwise marking the binding element unused.

Suggested change
for (auto& [correlation_id, event_list] : event_map) {
for (auto& kv : event_map) {
auto& event_list = kv.second;

Copilot uses AI. Check for mistakes.
Comment on lines +6 to +10
#if defined(ENABLE_CUDA_PROFILING)

#include "cuda_plugin_utils.h"
#include "cupti_manager.h"
#include "core/common/gpu_profiler_common.h"
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuda_profiler_plugin.h is guarded only by ENABLE_CUDA_PROFILING, but it includes cupti_manager.h which itself requires both USE_CUDA and ENABLE_CUDA_PROFILING to expose profiling::CUPTIManager. To avoid configuration-dependent build failures, consider guarding this header with the same condition (defined(USE_CUDA) && defined(ENABLE_CUDA_PROFILING)) or otherwise ensuring USE_CUDA is always defined when this header is compiled.

Copilot uses AI. Check for mistakes.
Comment on lines +671 to +672
auto* ep = static_cast<CudaEp*>(this_ptr);
*profiler = new CudaPluginEpProfiler(ep->factory_.GetEpApi());
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateProfilerImpl writes to *profiler only after allocation. For safety (and to avoid leaving callers with an indeterminate output value if new throws and the exception is converted to an OrtStatus), consider setting *profiler = nullptr immediately after validating profiler != nullptr, and/or using a std::unique_ptr locally before releasing.

Suggested change
auto* ep = static_cast<CudaEp*>(this_ptr);
*profiler = new CudaPluginEpProfiler(ep->factory_.GetEpApi());
*profiler = nullptr;
auto* ep = static_cast<CudaEp*>(this_ptr);
auto profiler_impl = std::make_unique<CudaPluginEpProfiler>(ep->factory_.GetEpApi());
*profiler = profiler_impl.release();

Copilot uses AI. Check for mistakes.
Comment on lines +878 to +884
The plugin does **not** perform the post-hoc merge/sort that the in-tree `GPUProfilerBase::EndProfiling` does. The plugin API is append-only; the `PluginEpProfiler` bridge on the ORT side handles merging EP events into the global event timeline.

### 14.5 Design Differences from In-Tree CUDA EP Profiler

| Aspect | In-tree CUDA EP | CUDA Plugin EP |
|--------|----------------|----------------|
| Event merge | `GPUProfilerBase::MergeEvents` interleaves GPU events into ORT's array (has known sort-order bug) | Append-only; ORT-side bridge merges |
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc states that the ORT-side PluginEpProfiler bridge “handles merging EP events into the global event timeline”. In current implementation, PluginEpProfiler::EndProfiling simply appends EP events to the events vector without any merge/sort by timestamp or correlation ID (see core/session/plugin_ep/ep_event_profiling.cc). Please update the wording to match the actual behavior (append-only; any ordering is handled by trace consumers, not by the bridge).

Suggested change
The plugin does **not** perform the post-hoc merge/sort that the in-tree `GPUProfilerBase::EndProfiling` does. The plugin API is append-only; the `PluginEpProfiler` bridge on the ORT side handles merging EP events into the global event timeline.
### 14.5 Design Differences from In-Tree CUDA EP Profiler
| Aspect | In-tree CUDA EP | CUDA Plugin EP |
|--------|----------------|----------------|
| Event merge | `GPUProfilerBase::MergeEvents` interleaves GPU events into ORT's array (has known sort-order bug) | Append-only; ORT-side bridge merges |
The plugin does **not** perform the post-hoc merge/sort that the in-tree `GPUProfilerBase::EndProfiling` does. The plugin API is append-only, and the `PluginEpProfiler` bridge on the ORT side likewise appends EP events to ORT's profiling event collection without merge/sort by timestamp or correlation ID. Any ordering or interleaving into a global timeline is handled by downstream trace consumers.
### 14.5 Design Differences from In-Tree CUDA EP Profiler
| Aspect | In-tree CUDA EP | CUDA Plugin EP |
|--------|----------------|----------------|
| Event merge | `GPUProfilerBase::MergeEvents` interleaves GPU events into ORT's array (has known sort-order bug) | Append-only; ORT-side bridge appends only, and trace consumers handle ordering |

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants