Conversation
| self.assertTrue(profile_file, "No profile file returned") | ||
| self.assertTrue(os.path.exists(profile_file), f"Profile file not found: {profile_file}") | ||
|
|
||
| with open(profile_file, "r") as f: |
There was a problem hiding this comment.
| with open(profile_file, "r") as f: | |
| with open(profile_file) as f: |
| kernel_events = [ | ||
| e for e in profile_data | ||
| if isinstance(e, dict) and e.get("cat") == "Kernel" | ||
| ] |
There was a problem hiding this comment.
| kernel_events = [ | |
| e for e in profile_data | |
| if isinstance(e, dict) and e.get("cat") == "Kernel" | |
| ] | |
| kernel_events = [e for e in profile_data if isinstance(e, dict) and e.get("cat") == "Kernel"] |
| print( | ||
| "Note: No GPU Kernel events found in profile. " | ||
| "CUDA profiling may not be enabled in this build." | ||
| ) |
There was a problem hiding this comment.
| print( | |
| "Note: No GPU Kernel events found in profile. " | |
| "CUDA profiling may not be enabled in this build." | |
| ) | |
| print("Note: No GPU Kernel events found in profile. CUDA profiling may not be enabled in this build.") |
There was a problem hiding this comment.
Pull request overview
This PR adds CUPTI-backed GPU profiling support to the CUDA plugin Execution Provider so GPU kernel/memcpy activity can be emitted into ONNX Runtime’s profiling JSON when onnxruntime_ENABLE_CUDA_PROFILING is enabled.
Changes:
- Introduces a plugin-side
CudaPluginEpProfilerimplementingOrtEpProfilerImpl, usingCUPTIManagerto collect GPU activity and report it viaOrtProfilingEventsContainer. - Wires
CudaEp::CreateProfilerin the CUDA plugin EP behindENABLE_CUDA_PROFILING. - Updates the CUDA plugin CMake to link
CUDA::cuptiand adds a Python test + design doc updates for profiling.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/python/transformers/test_cuda_plugin_ep.py | Adds a session profiling test that validates basic trace JSON structure and (when enabled) checks for GPU “Kernel” events/metadata. |
| onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.h | Declares CudaPluginEpProfiler (plugin-side OrtEpProfilerImpl). |
| onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.cc | Implements profiling lifecycle + CUPTI correlation + event conversion to Ort::ProfilingEvent. |
| onnxruntime/core/providers/cuda/plugin/cuda_ep.h | Adds CreateProfilerImpl declaration behind ENABLE_CUDA_PROFILING. |
| onnxruntime/core/providers/cuda/plugin/cuda_ep.cc | Wires CreateProfiler callback and implements CreateProfilerImpl. |
| docs/cuda_plugin_ep/cuda_plugin_ep_design.md | Documents profiling/observability architecture and build configuration. |
| cmake/onnxruntime_providers_cuda_plugin.cmake | Conditionally links CUPTI and defines compile-time flags for profiling build. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| // Flatten all GPU events and convert to OrtProfilingEvent. | ||
| std::vector<Ort::ProfilingEvent> events; | ||
| for (auto& [correlation_id, event_list] : event_map) { |
There was a problem hiding this comment.
In EndProfilingImpl, correlation_id from the structured binding (for (auto& [correlation_id, event_list] : event_map)) is never used. This can trigger unused-variable warnings (and potentially -Werror build breaks). Consider iterating as for (auto& kv : event_map) (using kv.second) or otherwise marking the binding element unused.
| for (auto& [correlation_id, event_list] : event_map) { | |
| for (auto& kv : event_map) { | |
| auto& event_list = kv.second; |
| #if defined(ENABLE_CUDA_PROFILING) | ||
|
|
||
| #include "cuda_plugin_utils.h" | ||
| #include "cupti_manager.h" | ||
| #include "core/common/gpu_profiler_common.h" |
There was a problem hiding this comment.
cuda_profiler_plugin.h is guarded only by ENABLE_CUDA_PROFILING, but it includes cupti_manager.h which itself requires both USE_CUDA and ENABLE_CUDA_PROFILING to expose profiling::CUPTIManager. To avoid configuration-dependent build failures, consider guarding this header with the same condition (defined(USE_CUDA) && defined(ENABLE_CUDA_PROFILING)) or otherwise ensuring USE_CUDA is always defined when this header is compiled.
| auto* ep = static_cast<CudaEp*>(this_ptr); | ||
| *profiler = new CudaPluginEpProfiler(ep->factory_.GetEpApi()); |
There was a problem hiding this comment.
CreateProfilerImpl writes to *profiler only after allocation. For safety (and to avoid leaving callers with an indeterminate output value if new throws and the exception is converted to an OrtStatus), consider setting *profiler = nullptr immediately after validating profiler != nullptr, and/or using a std::unique_ptr locally before releasing.
| auto* ep = static_cast<CudaEp*>(this_ptr); | |
| *profiler = new CudaPluginEpProfiler(ep->factory_.GetEpApi()); | |
| *profiler = nullptr; | |
| auto* ep = static_cast<CudaEp*>(this_ptr); | |
| auto profiler_impl = std::make_unique<CudaPluginEpProfiler>(ep->factory_.GetEpApi()); | |
| *profiler = profiler_impl.release(); |
| The plugin does **not** perform the post-hoc merge/sort that the in-tree `GPUProfilerBase::EndProfiling` does. The plugin API is append-only; the `PluginEpProfiler` bridge on the ORT side handles merging EP events into the global event timeline. | ||
|
|
||
| ### 14.5 Design Differences from In-Tree CUDA EP Profiler | ||
|
|
||
| | Aspect | In-tree CUDA EP | CUDA Plugin EP | | ||
| |--------|----------------|----------------| | ||
| | Event merge | `GPUProfilerBase::MergeEvents` interleaves GPU events into ORT's array (has known sort-order bug) | Append-only; ORT-side bridge merges | |
There was a problem hiding this comment.
The doc states that the ORT-side PluginEpProfiler bridge “handles merging EP events into the global event timeline”. In current implementation, PluginEpProfiler::EndProfiling simply appends EP events to the events vector without any merge/sort by timestamp or correlation ID (see core/session/plugin_ep/ep_event_profiling.cc). Please update the wording to match the actual behavior (append-only; any ordering is handled by trace consumers, not by the bridge).
| The plugin does **not** perform the post-hoc merge/sort that the in-tree `GPUProfilerBase::EndProfiling` does. The plugin API is append-only; the `PluginEpProfiler` bridge on the ORT side handles merging EP events into the global event timeline. | |
| ### 14.5 Design Differences from In-Tree CUDA EP Profiler | |
| | Aspect | In-tree CUDA EP | CUDA Plugin EP | | |
| |--------|----------------|----------------| | |
| | Event merge | `GPUProfilerBase::MergeEvents` interleaves GPU events into ORT's array (has known sort-order bug) | Append-only; ORT-side bridge merges | | |
| The plugin does **not** perform the post-hoc merge/sort that the in-tree `GPUProfilerBase::EndProfiling` does. The plugin API is append-only, and the `PluginEpProfiler` bridge on the ORT side likewise appends EP events to ORT's profiling event collection without merge/sort by timestamp or correlation ID. Any ordering or interleaving into a global timeline is handled by downstream trace consumers. | |
| ### 14.5 Design Differences from In-Tree CUDA EP Profiler | |
| | Aspect | In-tree CUDA EP | CUDA Plugin EP | | |
| |--------|----------------|----------------| | |
| | Event merge | `GPUProfilerBase::MergeEvents` interleaves GPU events into ORT's array (has known sort-order bug) | Append-only; ORT-side bridge appends only, and trace consumers handle ordering | |
This pull request adds support for CUPTI-based GPU profiling to the CUDA plugin execution provider (EP) in ONNX Runtime. Profiling is now available in the plugin EP when built with the
onnxruntime_ENABLE_CUDA_PROFILINGCMake flag, enabling detailed GPU activity tracing and integration with ORT's profiling system. The implementation introduces a newCudaPluginEpProfilerthat bridges between ORT's profiling API and CUPTI, and updates the build system, plugin interface, and documentation accordingly.CUDA Plugin Profiling Integration:
CudaPluginEpProfilerclass (cuda_profiler_plugin.h/.cc) that implements theOrtEpProfilerImplinterface, delegates to aCUPTIManagersingleton for GPU activity tracing, and provides callbacks for profiling lifecycle and event correlation. [1] [2]cuda_ep.h/cuda_ep.ccto conditionally provide aCreateProfilerImplcallback when profiling is enabled, wiring up the new profiler implementation. [1] [2] [3]onnxruntime_providers_cuda_plugin.cmake) to conditionally link againstCUDA::cuptiand define the necessary compile-time flags for profiling support.Documentation Updates:
cuda_plugin_ep_design.md) to describe the profiling and observability architecture, CUPTI integration, correlation ID flow, event collection, and differences from the in-tree CUDA EP profiler. Build configuration and relevant source files are also documented.Miscellaneous:
test_cuda_plugin_ep.py).These changes enable the CUDA plugin EP to participate fully in ORT's profiling system, allowing users to observe GPU kernel and memory activity in conjunction with CPU-side events when profiling is enabled.