Releases: anakryiko/wprof
wprof v0.3
wprof v0.3 Release Notes
Highlights
- User-defined tracing (utrace) — a comprehensive new subsystem for
custom event capture with a flexible DSL supporting uprobes, kprobes,
USDT, tracepoints, raw tracepoints, and BPF program probes. - Python function tracing — deterministic Python and PyTorch function
call tracing via injection, with stitched Python + native stack traces. - JSON output mode — full JSON trace output with documented schema.
- PMU counter collection — hardware and software performance counter
capture per scheduling event. - Request listing and filtering — post-capture request analysis with
sorting, filtering, and top/bottom-N selection.
New Features
User-Defined Tracing (-U)
A new DSL-based subsystem for capturing custom events alongside
built-in tracing. Define probes with -U '<definition>' or from a
file with -U @filepath.
Supported probe types:
- uprobes (
u:,uret:,uspan:) — userspace function entry/exit - kprobes (
k:,kret:,kspan:) — kernel function entry/exit - USDT (
usdt:provider:name) — User Statically-Defined Tracepoints - classic tracepoints (
tp:category:name) — perf tracepoint events - raw tracepoints (
raw_tp:name) — BTF-based kernel tracepoints - BPF probes (
bpf:,bpfret:,bpfspan:) — tracing loaded BPF programs - generic spans (
probe1 ~~ probe2) — arbitrary entry/exit pairs
Features:
- Argument capture by index (
arg:0) or name (arg:prev_comm) with
automatic type inference from BTF, tracefs format files, or USDT ELF notes - Wildcard capture (
arg:*) for all available arguments - Optional explicit type annotation (
arg:0:str,arg:1:u32->my_name) - Stack trace capture (
stackparameter) - Name format templates with argument substitution
(| name:'syscall #{id}' |) - Custom probe IDs for track grouping (
| id:my_probe |) - Binary/process filtering (
path:,pid:)
Events appear as Perfetto slices/instants with argument annotations and
as structured JSON with typed argument values.
See UTRACE.md for full documentation.
Python Stack Traces (-f py-stacks)
BPF-based Python stack trace capture:
- Captures Python call stacks and stitches them with native (C/C++)
stack traces for unified call stacks in timer and off-CPU events. -e py-stacks-onlyto show only Python frames without native stacks.- Auto-discovers Python processes, or target specific ones with
-f py-stacks=PIDor-f py-stacks=nvidia-smi.
Python Function Tracing (-f py-trace, -f py-torch)
Deterministic function-level tracing for Python and PyTorch applications
via library injection:
- Python tracing (
-f py-trace): Captures Python function calls and
returns viaPyEval_SetProfile, producing exact call trees with
timestamps. Rendered as collapsible tracks under kernel threads in
Perfetto. - PyTorch tracing (
-f py-torch): Captures PyTorch operator execution
via the RecordFunction callback system, covering autograd and C++ threads. - Supports both statically and dynamically linked Python and libpytorch.
JSON Output Mode (-J)
Full JSON trace output as newline-delimited JSON:
-J <file>writes JSON trace to file;-J -writes to stdout- Complete event coverage: scheduling, interrupts, workqueue, task
lifecycle, CUDA, Python, PyTorch, utrace, requests, sched-ext - Documented schema available via
--json-schemaflag - Float-second timestamps with nanosecond precision
- Structured stack traces with symbolized frames and source locations
See JSON_SCHEMA.md for the full data model.
PMU Counter Collection (--pmu)
Hardware and software performance counter capture:
--pmu r003c— raw PMU event--pmu cpu/cpu-cycles/— named PMU event--pmu sw:page-faults— software event--pmu L1-icache-loads— cache event--pmu derived:ipc=cpu_instructions/cpu_cpu-cycles— derived counters- PMU values attached to scheduling events (context switches, interrupts)
- Rendered as annotations in Perfetto and as
pmusarrays in JSON
Request Listing (--req-list)
Post-capture analysis of completed requests:
--req-list— list all completed requests--req-sort latency/--req-sort-asc/--req-sort-desc— sort by
field--req-filter 'latency>1ms'— filter by field expressions--req-top-n 10/--req-bottom-n 10— limit output-S req— capture stack traces at request lifecycle events
IRQ Collection Control (-f softirq/hardirq/irq)
Fine-grained control over interrupt event capture:
-f softirq/-f hardirq— enable specific IRQ types-f irq— enable both softirq and hardirq-f no-softirq/-f no-hardirq— disable specific IRQ types
Custom Metadata (-M)
Attach arbitrary key=value metadata to recordings:
-M key=value— repeatable, stored in data file- Appears in JSON header
metadataobject and replay info - Session timestamp automatically recorded in UTC
Sandboxing Support
New options for running wprof in partially untrusted environments:
--record— explicitly enforce recording mode (mutually exclusive
with--replay)--seal-output(hidden) — prevent subsequent-D,-T,-Joptions,
allowing a trusted runner to lock down output paths before passing
control to untrusted arguments
Perfetto Trace Improvements
- Restructured thread tracks: Timer, CUDA API, request, and utrace
events each get their own collapsible child track under the thread's
scheduler track, reducing visual clutter. - CUDA GPU hierarchy: GPU tracks sorted numerically with GPU #N at top.
- Python/PyTorch tracks: Nested as collapsible rows under kernel threads.
- Request visualization: Per-thread request activity tracks with
-e req-split(default) and-e req-embedoptions.
Bug Fixes
- Fix per-CPU stack trace scratch buffer clobbering that could corrupt
stack traces under heavy load. - Warn when not running as root in capture mode.
- Error out on bare
-Rwithout an output mode (-T,-J,-I, or
--req-list). - Numerous other small bug fixes across pytrace, CUDA, request tracking,
sched-ext, ELF symbol resolution, and Perfetto rendering.
Performance
- Significantly sped up BPF ringbuf setup at startup.
- Improved ringbuf usage logic to reduce occasional data drops.
- Bumped default ringbuf size to 16MB for busier hosts.
Internal / Data Format
- Complete redesign of wprof.data persistence format for better
performance and reduced disk space. - Revamped stack symbolization pipeline.
- Improved
--replay-infooutput with detailed per-event-type
breakdowns, stack trace statistics, and PMU data sizes.
Full Changelog: v0.2.1...v0.3
wprof v0.2.1
Bug fix release fixing kernel stack trace symbolization issue on arm64. Updating blazesym to latest release fixes the issue.
Full Changelog: v0.2...v0.2.1
wprof v0.2
Release notes
- scheduler-centric per-CPU view (
-e sched) is now supported; - GPU tracing support (
-f cuda) using ptrace-based code injection into target processes; - reworked stack trace support in Perfetto traces, they are now attached to events and slices directly (relies on recently added Perfetto support);
- more sched-ext specific metrics are now collected (
-f scx-layer);
Full Changelog: v0.1...v0.2
wprof v0.1
First official release, in preparation for packaging.
Full Changelog: https://github.com/anakryiko/wprof/commits/v0.1