Profiling and tracing PyTorch code for CUDA kernels

Hello All,

I am working on performance analysis of deep learning models. While trying out tools to profile and trace a PyTorch program using CUDA, I ran into issue that torch creates many self generated CUDA kernels which are used for computation on the GPU device. I have not been able to trace the exact source for the kernel origin for eg. a kernel ampere_sgemm_32x32_sliced1x4_tn is generated but I am unable to identify the exact function resulting in the generation of this kernel( I know this kernel is coming from CUDNN or CUBLASS).

I have tried HolisticTraceAnalysis as well, which generated a Json file resulting in final trace showing the address of the register( for eg. <built-in method foreach_mul of type object at 0x00007FFECA657250>)

import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from torch.profiler import profile, record_function, ProfilerActivity
from torch.utils.tensorboard import SummaryWriter 
from tqdm import tqdm
from hta.trace_analysis import TraceAnalysis
import torch.cuda.nvtx as nvtx 
writer=SummaryWriter()
# Setting GPU or CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Transformer
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

# Load MNIST dataset
train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=1000, shuffle=False)

# Define the model
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        
        self.fc1 = nn.Linear(28 * 28, 128)
        self.fc2 = nn.Linear(128, 100)
        self.fc3 = nn.Linear(100, 10)
        

    def forward(self, x):
        x = x.view(-1, 28 * 28)  # Flatten input images
        nvtx.range_push("relu process")
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        nvtx.range_pop()
        return x
model = Net().to(device)

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.7)

# Training loop with profiling
def train(model, device, train_loader, optimizer, criterion, epoch, prof=None):
    model.train()
    total_batches = len(train_loader)
    
    with tqdm(train_loader, unit="batch") as pbar:
        pbar.set_description(f"Epoch {epoch}")
        
        for batch_idx, (data, target) in enumerate(pbar):
            data, target = data.to(device), target.to(device)
            
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            
            loss.backward()
            optimizer.step()
            

            
            pbar.set_postfix(loss=loss.item())
            if prof:  # Only step if profiler exists
                prof.step()

# Test loop
def test(model, device, test_loader, criterion):
    model.eval()
    test_loss = 0
    correct = 0
    
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += criterion(output, target).item()
            pred = output.argmax(dim=1, keepdim=True)  # Get predictions
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    accuracy = 100. * correct / len(test_loader.dataset)
    print(f'\nTest set: Average loss: {test_loss:.4f}, Accuracy: {correct}/{len(test_loader.dataset)} ({accuracy:.2f}%)\n')

# Main execution block with unified profiling
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=torch.profiler.schedule(
        wait=0,       # Start immediately
        warmup=1,     # 1 warmup step
        active=100, # Capture all steps (set higher than total batches)
        repeat=1      # Don't repeat cycle
    ),
    record_shapes=True,
    with_stack=True
) as prof:
    for epoch in range(1, 2):  # 4 epochs
        train(model, device, train_loader, optimizer, criterion, epoch, prof=prof)
        test(model, device, test_loader, criterion)

# Export unified trace file after all epochs
prof.export_chrome_trace("full_training_trace_10.json")

print("The number of GPUs being used is", torch.cuda.device_count())
if torch.cuda.is_available():
    print("The device properties are", torch.cuda.get_device_properties(device=None))

print('Training complete')

# Example of using HTA for analysis
analyzer = TraceAnalysis("full_training_trace_10.json")
print(f"Total trace duration: {analyzer.get_trace_duration()} seconds")

I have tried performing Nsight compute analysis for the trace using command line as

nsys profile -b all   -t cuda,nvtx,cublas -s cpu   --python-sampling-frequency=1000 --python-sampling=true  python .\pytorch_NM.py

this gives me a sperate row showing python backtrace but even with a higher sampling rate nothing coclusive is derived out of it. It is unuseful in sense that I am unable to superimpose the pythhon backtrace with CUDA kernels and many of the python backtrace and CUDA kernels superimposition overlap(which is possible but again not telling the accurate results)

After the above I try --cudabacktrace=all , and --cudabacktrace=auto

nsys profile -b all --cudabacktrace=all -t cuda,nvtx,cublas -s cpu --python-sampling-frequency=1000 --python-sampling=true python .\pytorch_NM.py
and this given me an error

`unrecognised option ‘–cudabacktrace=all’

usage: nsys profile [] [application] []
Try ‘nsys profile --help’ for more information.`

Details:
Nsight version
NVIDIA Nsight Systems version 2024.2.3.38-242334140272v0
torch 2.5.1
NVIDIA-SMI 556.35 Driver Version: 556.35 CUDA Version: 12.5
(base) PS D:\JGU\Neural_network\Pytorch> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:36:51_Pacific_Daylight_Time_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0

I have tried using nvtx ranges as well, which too are not satisfactory because they “hint” from where a kernel might be coming from by superimposing the kernel with another function in the time line view.

In general is my approach correct to trace the kernels? Why is --cudabacktrace not working even when I have enabled CPU sampling? Is there a better method to perform the given task of identifying the cuda kernels orginating from pytorch?
Thanks

Did you check the --help option to check which arguments are expected for the release you are using?

Thank you for the reply.
I could not find --cudabacktrace from
nsys profile --help , in --help it only shows --backtrace and --trace , I have already tried with the above given options and it gives me this traceback solution which I could understand is telling me about the for eg ucrtbase.dll which is related to Microsoft’s Universal C Runtime library. I have tried --cuda-graph-trace=graph , but that is not useful as well.

** I am unable to connect the CUDA Kernels with the python call functions.**

Also in the documentation of version 2024.2 user guide is different from nsys profile --help because --cudabacktrace is not visible in --help whereas it is visible in the documentation.

NVIDIA Doc for version 2024.2

How can I trace the cuda kernel executed to the python call. Thanks!

From the docs of nsys profile --help you will see that --cudabacktrace and --python-backtrace can be used.
Profiling this simple code:

import torch
import torch.nn as nn

device = "cuda"
x = torch.randn(1, 1, device=device)
lin = nn.Linear(1, 1).to(device)

for _ in range(5):
    out = lin(x)

for _ in range(5):
    x = x.sin()

torch.cuda.synchronize()

using nsys profile --cudabacktrace all:500 --python-backtrace python tmp.py shows proper tooltips with stacktraces pointing to the calls from Python into the backend into the actual math library:

I have copied the output from nsys profile --help and I have been unable to find/there is no flag as --cudabacktrace and --python-backtrace in the --help

also I tried using the command line you shared with me:

nsys profile --cudabacktrace all:500 --python-backtrace python pytorch_NM.py
unrecognised option ‘–cudabacktrace’

usage: nsys profile [] [application] []

Here is the output from --help

 nsys profile --help

usage: nsys profile [<args>] [application] [<application args>]

        --auto-report-name=

           Possible values are 'true' or 'false'.
           Derive report file name from collected data, uses details of profiled graphics application.
           Format: [Process Name][GPU Name][Window Resolution][Graphics API] Timestamp.nsys-rep
           If true, automatically generate report file names.
           Default is 'false'. Application scope.

        -b, --backtrace=

           Possible values are 'auto', or 'none'.
           Select the backtrace method to use while sampling.
           Select 'none' to disable backtrace collection.
           Default is 'auto'.

        -c, --capture-range=

           Possible values are none, cudaProfilerApi, nvtx, hotkey.
           When '-c cudaProfilerApi' is used, profiling will start only when cudaProfilerStart API is
           invoked in the application.
           When '-c nvtx' is used, profiling will start only when the specified NVTX range is
           started in the application.
           When '-c hotkey' is used, profiling will start only when the hotkey
           set by '--hotkey-capture' is pressed in the application. This works for graphic apps only.
           Note that you must enable CUDA or NVTX tracing of the target application
           for '-c cudaProfilerApi' or '-c nvtx' to work.
           When '-capture-range none' is used, cudaProfilerStart/Stop APIs and hotkey will
           be ignored and NVTX ranges will be ignored as collection start/stop triggers.
           Default is none.

        --capture-range-end=

           Possible values are 'none', 'stop', 'stop-shutdown', 'repeat[:N]' or 'repeat-shutdown:N'.
           Specify the desired behavior when a capture range ends. Applicable only when used along
           with --capture-range option.
           If 'none', capture range end will be ignored.
           If 'stop', collection will stop at capture range end. Any subsequent capture ranges will be
           ignored. Target app will continue running.
           If 'stop-shutdown', collection will stop at capture range end and session will be shutdown.
           If 'repeat[:N]', collection will stop at capture range end and subsequent capture ranges
           will trigger more collections.
           Use the optional ':N' to specify max number of capture ranges to be honored. Any subsequent
           capture ranges will be ignored once N capture ranges are collected.
           If 'repeat-shutdown:N', same behavior as 'repeat:N' but session will be shutdown after N
           ranges.
           For 'stop-shutdown' and 'repeat-shutdown:N', use --kill option to specify whether target
           app should be terminated when shutting down session.
           Default is 'stop-shutdown'.

        --command-file=

           Open a file that contains nsys switches and parse the switches. Note that
           command line switches will override switches found in the command-file.

        --cpuctxsw=

           Possible values are 'process-tree', 'system-wide', or 'none'.
           Trace OS thread scheduling activity. Select 'none' to disable tracing CPU context switches.
           'process-tree' or 'system-wide' requires administrative privileges.
           If a target app is specified, the default is 'process-tree'.
           Otherwise the default is 'system-wide'.

        --cuda-flush-interval=

           Set the interval, in milliseconds, when buffered CUDA data is automatically saved to
           storage. CUDA data buffer saves may cause profiler overhead. Buffer save behavior can be
           controlled with this switch.

           If the CUDA flush interval is set to 0 on systems running CUDA 11.0 or newer, buffers are
           saved when they fill. If a flush interval is set to a non-zero value on such systems,
           buffers are saved only when the flush interval expires. If a flush interval is set and the
           profiler runs out of available buffers before the flush interval expires, additional buffers
           will be allocated as needed. In this case, setting a flush interval can reduce buffer
           save overhead but increase memory use by the profiler.

           If the flush interval is set to 0 on systems running older versions of CUDA,
           buffers are saved at the end of the collection. If the profiler runs out of available
           buffers, additional buffers are allocated as needed. If a flush interval is set to a
           non-zero value on such systems, buffers are saved when the flush interval expires.
           A cuCtxSynchronize call may be inserted into the workflow before the buffers
           are saved which will cause application overhead. In this case, setting a flush interval
           can reduce memory use by the profiler but may increase save overhead.
           Default is '0'. Application scope.

        --cuda-graph-trace=<granularity>[:<launch origin>]

           Set the granularity and launch origin for CUDA graph trace.
           Applicable only when CUDA tracing is enabled.

           Possible values for <granularity> are 'graph' or 'node'.
           If 'graph' is selected, CUDA graphs will be traced as a whole and node
           activities will not be collected. This can reduce overhead to the minimal,
           but requires CUDA driver version 11.7 or higher.
           If 'node' is selected, node activities will be collected, but CUDA graphs
           will not be traced as a whole. This may cause significant runtime overhead.
           If CUDA driver version is 11.7 or higher, default is 'graph', otherwise default is 'node'.

           Possible values for <launch origin> are 'host-only' or 'host-and-device'.
           If 'host-only' is selected, only CUDA graphs launched from host codes will be traced.
           If 'host-and-device' is selected, CUDA graphs launched from host codes and device codes
           will both be traced. This is only supported when the granularity is set to 'graph' and
           the CUDA driver is version 12.3 or higher. This may cause significant runtime overhead.
           If granularity is set to 'graph' and the CUDA driver version is 12.3 or higher,
           the default is 'host-and-device', otherwise default is 'host-only'.'.

           Application scope.

        --cuda-memory-usage=

           Possible values are 'true' or 'false'.
           Track the GPU memory usage. Applicable only when CUDA tracing is enabled.
           This feature may cause significant runtime overhead.
           Default is 'false'. Application scope.

        -d, --duration=

           Collection duration in seconds.
           Default is 0 seconds.

        --duration-frames=

           Stop the recording session after this many frames have been captured.
           Minimum supported frame is '60'.
           Note when it is selected cannot include any other stop options.
           If not specified the default is disabled. Application scope.

        --dx-force-declare-adapter-removal-support=

           Possible values are 'true' or 'false'.
           The Nsight Systems trace initialization involves creating a D3D
           device and discarding it. Enabling this flag makes a call to
           DXGIDeclareAdapterRemovalSupport() before device creation.
           Default is 'false'.

        --dx12-gpu-workload=

           Possible values are 'individual', 'batch', 'none', 'true' or 'false'.
           If individual or true, trace each DX12 workload's GPU activity individually.
           If batch, trace DX12 workloads' GPU activity in ExecuteCommandLists call batches.
           If none or false, do not trace DX12 workloads' GPU activity.
           Note that this switch is applicable only when --trace=dx12 is specified.
           Default is 'individual'. Application scope.

        --dx12-wait-calls=

           Possible values are 'true' or 'false'.
           If true, trace wait calls that block on fences for DX12.
           Note that this switch is applicable only when --trace=dx12 is specified.
           Default is 'false'. Application scope.

        -e, --env-var=

           Set environment variable(s) for application process to be launched.
           Environment variable(s) should be defined as 'A=B'.
           Multiple environment variables can be specified as 'A=B,C=D'

        --etw-provider=

           Add custom ETW trace provider(s).
           Possible values are '<name>,<guid>' or JSON configuration file path.
           If you want to specify more attributes than Name and GUID, provide a JSON
           configuration file.
           Find 'C:\Program Files\NVIDIA Corporation\Nsight Systems 2024.2.3\target-windows-x64\etw_providers_template.json' 
           as a template.
           This switch can be used multiple times to add multiple providers.

        --export=<format>[,<format>...]

           Possible formats are: none arrow sqlite hdf text json arrow arrowdir parquetdir
           Create additional output file(s) based on the data collected.
           If 'none' is selected, no additional files are created.
           Default is 'none'. This option can be given more than once.

        -f, --force-overwrite=

           Possible values are 'true' or 'false'.
           If true, overwrite all existing result files with same output filename
           (QDSTRM, nsys-rep, SQLITE, HDF, TEXT, JSON, ARROW, ARROWDIR, PARQUETDIR).
           Default is 'false'.

        --flush-on-cudaprofilerstop=

           If set to 'true', any call to cudaProfilerStop() will
           cause the CUDA trace buffers to be flushed. Note that the CUDA trace
           buffers will be flushed when the collection ends, irrespective of the
           value of this switch. Default value is 'true'.

        --gpu-metrics-device=

           Collect GPU Metrics from specified devices.
           The option argument must be 'none' or one of GPU IDs reported
           by '--gpu-metrics-device=help' switch.
           Default is 'none'. System scope.

        --gpu-metrics-frequency=

           Specify GPU Metrics sampling frequency.
           Minimum supported frequency is '10' (Hz).
           Maximum supported frequency is '200000' (Hz).
           Default is '10000'. System scope.

        --gpu-metrics-set=

           Specify metric set for GPU Metrics sampling.
           The option argument must be one of indices reported by '--gpu-metrics-set=help' switch.
           Default is the first metric set that supports selected GPU. System scope.

        --gpu-video-device=

           Collect GPU video accelerator traces from specified devices.
           The argument must be 'none' or one or more GPU IDs reported by '--gpu-video-device=help'.
           Default is 'none'. System scope.

        --gpuctxsw=

           Possible values are 'true' or 'false'.
           Trace GPU context switches. This switch requires CUDA driver r435.17 or higher.
           Requires root privileges.
           Default is 'false'. System scope.

        -h, --help=[<tag>]

           Print the command's help menu. The switch can take one optional
           argument that will be used as a tag. If a tag is provided, only options
           relevant to the tag will be printed.
           The available help menu tags for this command are:

           app, application, backtrace, capture, cli, command, cuda, driver, dx, dx12,
           env, environment, etw, events, export, file, filter, frame, gpu, hotkey,
           injection, interactive, interrupt, isr, log, logs, memory, nvtx, opengl,
           output, profile, profiling, range, report, sample, sampling, session, stats,
           switch, symbol, symbols, trace, vulkan, wait, wddm, and windows.

        --hotkey-capture=

           Possible values are `F1` to `F12`.
           Note that on Windows platforms `F10` is not supported.
           Hotkey to trigger the profiling session.
           Note that this switch is applicable only when --capture-range=hotkey is specified.
           Default is `F12`.

        --injection-use-detours=

           Possible values are 'true' or 'false'.
           Use detours for injection.
           Equivalent to setting the --system-wide option to the
           inverse value.
           Default is 'true'.

        --isr=

           Possible values are 'true' or 'false'.
           Trace Interrupt Service Routines (ISRs) and Deferred Procedure Calls (DPCs).
           Requires administrative privileges. Available only on Windows devices.
           Default is 'false'.

        --kill=

           Possible values are 'true' or 'false'.
           Terminate the target application when ending/shutting down profiling
           session.
           Default is 'true', so the application is terminated when profiling session ends/is
           shutdown.

        -n, --inherit-environment=

           Possible values are 'true' or 'false'.
           Inherit environment variables.
           Default is 'true'.

        --nvtx-domain-[include|exclude]=

           Possible values are a comma-separated list of NVTX domains.
           Choose the include or exclude option to (only) include or exclude the specified domains. The
           options are mutually exclusive. 'default' filters the NVTX default domain. A domain with
           this name and commas in a domain name have to be escaped with '\'.
           Note that both switches are applicable only when --trace=nvtx is specified.

        -o, --output=

           Output report filename.
           Any %q{ENV_VAR} pattern in the filename will be substituted with the value of the
           environment variable.
           Any %h pattern in the filename will be substituted with the hostname of the system.
           Any %p pattern in the filename will be substituted with the PID of the target process or
           the PID of the root process if there is a process tree.
           Any %n pattern in the filename will be substituted with the minimal positive integer that is
           not already occupied.
           Any %% pattern in the filename will be substituted with %.
           Default is 'report%n'.

        --opengl-gpu-workload=

           Possible values are 'true' or 'false'.
           If true, trace the OpenGL workload's GPU activity.
           Note that this switch is applicable only when --trace=opengl is specified.
           Default is 'true'. Application scope.

        -p, --nvtx-capture=

           Possible values are: `range@domain' to specify both range and domain, 
           `range' to specify range in default domain, `range@*' to specify a range in any domain.
           NVTX message and domain to trigger the profiling session.
           '@' can be escaped with backslash '\'.
           Note that this switch is applicable only when --capture-range=nvtx is specified.

        --python-functions-trace=

           Specify the path to the json file containing the requested
           Python functions to trace.
           Note that nvtx package must be installed on the target Python.
           See 'C:\Program Files\NVIDIA Corporation\Nsight Systems 2024.2.3\host-windows-x64\PythonFunctionsTrace/annotations.json' as an example.       

        --python-sampling=

           Possible values are 'true' or 'false'.
           Sample Python backtrace.
           Default is 'false'.
           Note: This feature provides meaningful backtraces for Python processes.
           When profiling Python-only workflows, consider disabling the CPU sampling option to reduce overhead.

        --python-sampling-frequency=

           Specify Python sampling frequency.
           Minimum supported frequency is '1' (Hz).
           Maximum supported frequency is '2000' (Hz).
           Default is '1000' (Hz).

        --resolve-symbols=

           Possible values are 'true' or 'false'.
           Resolve symbols of captured samples and backtraces.
           Default is 'false' on Windows, 'true' on other platforms.

        --retain-etw-files=

           Possible values are 'true' or 'false'.
           Retain ETW files.
           If true, retains ETW files generated by the trace, merges and moves the files to the output directory.
           Default is 'false'.

        -s, --sample=

           Possible values are 'process-tree', 'system-wide' or 'none'.
           Collect CPU IP/backtrace samples. Select 'none' to disable sampling. 'process-tree' or 'system-wide' requires administrative privileges. 
           If a target application is launched, the default is 'process-tree', otherwise the default
           is 'none'.

        --sampling-frequency=

           Specify sampling/backtracing frequency.
           Minimum supported frequency is '100' (Hz).
           Maximum supported frequency is '8000' (Hz).
           Default is '1000' (Hz).

        --session-new=

           Start the collection in a new named session. The option  argument represents the session
           name.
           The session name must start with an alphabetical character followed by printable or space
           characters.
           Any '%q{ENV_VAR}' pattern in the session name will be substituted with the value of the
           environment variable.
           Any '%h' pattern in the option argument will be substituted with the hostname of the system.
           Any '%%' pattern in the option argument will be substituted with '%'.

        --start-frame-index=

           Start the recording session when the frame index reaches the frame number preceding the
           start frame index. Minimum supported frame is '1'.
           Note when it is selected cannot include any other start options.
           If not specified the default is disabled. Application scope.

        --stats=

           Possible values are 'true' or 'false'.
           Generate summary statistics after the collection.
           When set to true, an SQLite database file will be created after the collection.
           Default is 'false'.

        --system-wide=

           Possible values are 'true' or 'false'.
           Perform system-wide injection using Windows hooks.
           Equivalent to setting the --injection-use-detours option to the
           inverse value.
           Default is 'false'.

        -t, --trace=

           Possible values are 'cuda', 'nvtx', 'cublas', 'cublas-verbose', 'cusolver', 
           'cusolver-verbose', 'cusparse', 'cusparse-verbose', 'opengl',
           'opengl-annotations', 'nvvideo', 'vulkan', 'vulkan-annotations', 'dx11',
           'dx11-annotations', 'dx12', 'dx12-annotations', 'openxr',
           'openxr-annotations', 'wddm', 'python-gil' or 'none'.
           Select the API(s) to trace. Multiple APIs can be selected, separated by commas only
           (no spaces).
           If '<api>-annotations' is selected, the corresponding API will also be traced.
           If 'none' is selected, no APIs are traced.
           Default is 'cuda,nvtx,opengl'. Application scope.

        --vulkan-gpu-workload=

           Possible values are 'individual', 'batch', 'none', 'true' or 'false'.
           If individual or true, trace each Vulkan workload's GPU activity individually.
           If batch, trace Vulkan workloads' GPU activity in vkQueueSubmit call batches.
           If none or false, do not trace Vulkan workloads' GPU activity.
           Note that this switch is applicable only when --trace=vulkan is specified.
           Default is 'individual'. Application scope.

        -w, --show-output=

           Possible values are 'true' or 'false'.
           If true, send target process's stdout and stderr streams to both the console and
           stdout/stderr files which are added to the report file.
           If false, only send target process stdout and stderr streams to the stdout/stderr files
           which are added to the report file.
           Default is 'true'.

        --wait=

           Possible values are 'primary' or 'all'.
           If 'primary', the CLI will wait on the application process termination.
           If 'all', the CLI will additionally wait on re-parented processes created by the
           application.
           Default is 'all'.

        --wddm-additional-events=

           Possible values are 'true' or 'false'.
           If true, collect additional range of ETW events, including context status, allocations, sync wait and signal events, etc.
           Requires administrative privileges.
           Note that this switch is applicable only when --trace=wddm is specified.
           Default is 'true'. System scope.

        --wddm-backtraces=

           Possible values are 'true' or 'false'.
           If true, collect backtraces of WDDM events.
           Requires administrative privileges.
           Disabling this data collection can reduce overhead for target
           applications that generate many DxgKrnl WDDM Events.
           Note that this switch is applicable only when --trace=wddm is specified.
           Default is 'false'.

        -x, --stop-on-exit=

           Possible values are 'true' or 'false'.
           Stop profiling when the launched application exits.
           If stop-on-exit=false, duration must be greater than 0.
           Default is 'true'.

        -Y, --start-later=

           Possible values are 'true' or 'false'.
           Delays collection indefinitely until the nsys start
           command is executed for this session.
           Enabling this option overrides the --delay option.
           Default is 'false'.

        -y, --delay=

           Collection start delay in seconds.
           Default is 0.

Your nsys version might be too old as I’m using the latest one.

I was using nsys version 2024.2 eariler, and I updated to version 2025.1
using nsys --version , the version is

NVIDIA Nsight Systems version 2025.2.1.130-252135690618v0

and when I again used the nsys profile --help these are the full list of the commands found.

usage: nsys profile [<args>] [application] [<application args>]

        --auto-report-name=

           Possible values are 'true' or 'false'.
           Derive report file name from collected data, uses details of profiled graphics application.
           Format: [Process Name][GPU Name][Window Resolution][Graphics API] Timestamp.nsys-rep
           If true, automatically generate report file names.
           Default is 'false'. Application scope.

        -b, --backtrace=

           Possible values are 'auto', or 'none'.
           Select the backtrace method to use while sampling.
           Select 'none' to disable backtrace collection.
           Default is 'auto'.

        -c, --capture-range=

           Possible values are none, cudaProfilerApi, nvtx, hotkey.
           When '-c cudaProfilerApi' is used, profiling will start only when cudaProfilerStart API is
           invoked in the application.
           When '-c nvtx' is used, profiling will start only when the specified NVTX range is
           started in the application.
           When '-c hotkey' is used, profiling will start only when the hotkey
           set by '--hotkey-capture' is pressed in the application. This works for graphic apps only.
           Note that you must enable CUDA or NVTX tracing of the target application
           for '-c cudaProfilerApi' or '-c nvtx' to work.
           When '-capture-range none' is used, cudaProfilerStart/Stop APIs and hotkey will
           be ignored and NVTX ranges will be ignored as collection start/stop triggers.
           Default is none.

        --capture-range-end=

           Possible values are 'none', 'stop', 'stop-shutdown', 'repeat[:N]' or 'repeat-shutdown:N'.
           Specify the desired behavior when a capture range ends. Applicable only when used along
           with --capture-range option.
           If 'none', capture range end will be ignored.
           If 'stop', collection will stop at capture range end. Any subsequent capture ranges will be
           ignored. Target app will continue running.
           If 'stop-shutdown', collection will stop at capture range end and session will be shutdown.
           If 'repeat[:N]', collection will stop at capture range end and subsequent capture ranges
           will trigger more collections. 
           Use the optional ':N' to specify max number of capture ranges to be honored. Any subsequent
           capture ranges will be ignored once N capture ranges are collected.
           If 'repeat-shutdown:N', same behavior as 'repeat:N' but session will be shutdown after N
           ranges.
           For 'stop-shutdown' and 'repeat-shutdown:N', use --kill option to specify whether target
           app should be terminated when shutting down session.
           Default is 'stop-shutdown'.

        --command-file=

           Open a file that contains nsys switches and parse the switches. Note that
           command line switches will override switches found in the command-file.

        --cpuctxsw=

           Possible values are 'process-tree', 'system-wide', or 'none'.
           Trace OS thread scheduling activity. Select 'none' to disable tracing CPU context switches.
           'process-tree' or 'system-wide' requires administrative privileges.
           If a target app is specified, the default is 'process-tree'.
           Otherwise the default is 'system-wide'.

        --cuda-event-trace=

           Possible values are 'auto', 'true' or 'false'.
           Trace CUDA Event completion on the device side, and get better correlation
           support among CUDA Event APIs. Applicable only when CUDA tracing is enabled.
           Note that 'CUDA Event' refers to the synchronization mechanism (cudaEventRecord,
           cudaStreamWaitEvent etc.).
           Enabling this feature may increase runtime overhead and the likelihood of false
           dependencies across CUDA Streams, similar to CUDA Event's timing functionality
           when cudaEventDisableTiming is not disabled.
           'auto' will automatically turn off the trace if a target process has
           CUDA_DEVICE_MAX_CONNECTIONS set to 1.
           This switch requires CUDA driver 12.8 or higher.
           Default is 'false'. Application scope.

        --cuda-flush-interval=

           Set the interval, in milliseconds, when buffered CUDA data is automatically saved to
           storage. CUDA data buffer saves may cause profiler overhead. Buffer save behavior can be
           controlled with this switch.

           If the CUDA flush interval is set to 0 on systems running CUDA 11.0 or newer, buffers are
           saved when they fill. If a flush interval is set to a non-zero value on such systems,
           buffers are saved only when the flush interval expires. If a flush interval is set and the
           profiler runs out of available buffers before the flush interval expires, additional buffers
           will be allocated as needed. In this case, setting a flush interval can reduce buffer
           save overhead but increase memory use by the profiler.

           If the flush interval is set to 0 on systems running older versions of CUDA,
           buffers are saved at the end of the collection. If the profiler runs out of available
           buffers, additional buffers are allocated as needed. If a flush interval is set to a
           non-zero value on such systems, buffers are saved when the flush interval expires.
           A cuCtxSynchronize call may be inserted into the workflow before the buffers
           are saved which will cause application overhead. In this case, setting a flush interval
           can reduce memory use by the profiler but may increase save overhead.
           Default is '0'. Application scope.

        --cuda-graph-trace=<granularity>[:<launch origin>]

           Set the granularity and launch origin for CUDA graph trace.
           Applicable only when CUDA tracing is enabled.

           Possible values for <granularity> are 'graph' or 'node'.
           If 'graph' is selected, CUDA graphs will be traced as a whole and node
           activities will not be collected. This can reduce overhead to the minimal,
           but requires CUDA driver version 11.7 or higher.
           If 'node' is selected, node activities will be collected, but CUDA graphs
           will not be traced as a whole. This may cause significant runtime overhead.
           If CUDA driver version is 11.7 or higher, default is 'graph', otherwise default is 'node'.

           Possible values for <launch origin> are 'host-only' or 'host-and-device'.
           If 'host-only' is selected, only CUDA graphs launched from host codes will be traced.
           If 'host-and-device' is selected, CUDA graphs launched from host codes and device codes
           will both be traced. This is only supported when the granularity is set to 'graph' and
           the CUDA driver is version 12.3 or higher. This may cause significant runtime overhead.
           If granularity is set to 'graph' and the CUDA driver version is 12.3 or higher,
           the default is 'host-and-device', otherwise default is 'host-only'.'.

           Application scope.

        --cuda-memory-usage=

           Possible values are 'true' or 'false'.
           Track the GPU memory usage. Applicable only when CUDA tracing is enabled.
           This feature may cause significant runtime overhead.
           Default is 'false'. Application scope.

        -d, --duration=

           Collection duration in seconds.
           Default is 0 seconds.

        --dask=

            Possible values are 'functions-trace' or 'none'.
            'functions-trace' implies '--python-functions-trace=C:\Program Files\NVIDIA Corporation\Nsight Systems 2025.2.1\target-windows-x64\PythonFunctionsTrace/dask.json',
            and will rename relevant threads to 'Dask Worker' and 'Dask Scheduler'.
            Default is 'none'.
            Implies '--trace=nvtx'.

        --debug-symbols=

           Specify the paths to directories with symbol files.
           Multiple directories can be selected, separated by a colon (':') only (no spaces).

        --duration-frames=

           Stop the recording session after this many frames have been captured.
           Minimum supported frame is '60'.
           Note when it is selected cannot include any other stop options.
           If not specified the default is disabled. Application scope.

        --dx-force-declare-adapter-removal-support=

           Possible values are 'true' or 'false'.
           The Nsight Systems trace initialization involves creating a D3D
           device and discarding it. Enabling this flag makes a call to
           DXGIDeclareAdapterRemovalSupport() before device creation.
           Default is 'false'.

        --dx12-gpu-workload=

           Possible values are 'individual', 'batch', 'none', 'true' or 'false'.
           If individual or true, trace each DX12 workload's GPU activity individually.
           If batch, trace DX12 workloads' GPU activity in ExecuteCommandLists call batches.
           If none or false, do not trace DX12 workloads' GPU activity.
           Note that this switch is applicable only when --trace=dx12 is specified.
           Default is 'individual'. Application scope.

        --dx12-wait-calls=

           Possible values are 'true' or 'false'.
           If true, trace wait calls that block on fences for DX12.
           Note that this switch is applicable only when --trace=dx12 is specified.
           Default is 'false'. Application scope.

        -e, --env-var=

           Set environment variable(s) for application process to be launched.
           Environment variable(s) should be defined as 'A=B'.
           Multiple environment variables can be specified as 'A=B,C=D'

        (Experimental) --enable=<plugin_name>[,arg1,arg2,...]

           Use the specified plugin.
           The option can be specified multiple times to enable multiple plugins.
           Plugin arguments are separated by commas only (no spaces).
           Commas can be escaped with a backslash '\'. The backslash itself can be
           escaped by another backslash '\\'. To include spaces in an argument,
           enclose the argument in double quotes '"'.
           To list all available plugins, use '--enable=help' command.

        --etw-provider=

           Add custom ETW trace provider(s).
           Possible values are '<name>,<guid>' or JSON configuration file path.
           If you want to specify more attributes than Name and GUID, provide a JSON
           configuration file.
           Find 'C:\Program Files\NVIDIA Corporation\Nsight Systems 2025.2.1\target-windows-x64\etw_providers_template.json'
           as a template.
           This switch can be used multiple times to add multiple providers.

        --export=<format>[,<format>...]

           Possible formats are: none sqlite hdf text json arrow arrowdir parquetdir
           Create additional output file(s) based on the data collected.
           If 'none' is selected, no additional files are created.
           Default is 'none'. This option can be given more than once.

        -f, --force-overwrite=

           Possible values are 'true' or 'false'.
           If true, overwrite all existing result files with same output filename
           (QDSTRM, nsys-rep, SQLITE, HDF, TEXT, JSON, ARROW, ARROWDIR, PARQUETDIR).
           Default is 'false'.

        --flush-on-cudaprofilerstop=

           If set to 'true', any call to cudaProfilerStop() will
           cause the CUDA trace buffers to be flushed. Note that the CUDA trace
           buffers will be flushed when the collection ends, irrespective of the
           value of this switch. Default value is 'true'.

        --gpu-metrics-devices=

           Collect GPU Metrics from the specified devices.
           Possible values are:
             'none', 'cuda-visible', 'all',
           or a comma separated list of GPU IDs reported by '--gpu-metrics-devices=help' switch.
           Default is 'none'. System scope.

        --gpu-metrics-frequency=

           Specify the sampling frequency for GPU Metrics.
           Minimum supported frequency is '10' (Hz).
           Maximum supported frequency is '200000' (Hz).
           Default is '10000'. System scope.

        --gpu-metrics-set=

           Specify the metric set for GPU Metrics.
           The option argument must be one of aliases reported by '--gpu-metrics-set=help' switch.
           Default is the first metric set that supports all selected GPU. System scope.

        --gpu-video-device=

           Collect GPU video accelerator traces from specified devices.
           The argument must be 'none' or one or more GPU IDs reported by '--gpu-video-device=help'.
           Default is 'none'. System scope.

        --gpuctxsw=

           Possible values are 'true' or 'false'.
           Trace GPU context switches. This switch requires CUDA driver r435.17 or higher.
           Requires root privileges.
           Default is 'false'. System scope.

        -h, --help=[<tag>]

           Print the command's help menu. The switch can take one optional
           argument that will be used as a tag. If a tag is provided, only options
           relevant to the tag will be printed.
           The available help menu tags for this command are:

           app, application, backtrace, capture, cli, command, cuda, driver, dx, dx12,
           env, environment, etw, events, export, file, filter, frame, gpu, hotkey,
           injection, interactive, interrupt, isr, log, logs, memory, nvtx, opengl,
           output, profile, profiling, range, report, sample, sampling, session, stats,
           switch, symbol, symbols, trace, vulkan, wait, wddm, and windows.

        --hotkey-capture=

           Possible values are `F1` to `F12`.
           Note that on Windows platforms `F10` is not supported.
           Hotkey to trigger the profiling session.
           Note that this switch is applicable only when --capture-range=hotkey is specified.
           Default is `F12`.

        --injection-use-detours=

           Possible values are 'true' or 'false'.
           Use detours for injection.
           Equivalent to setting the --system-wide option to the
           inverse value.
           Default is 'true'.

        --isr=

           Possible values are 'true' or 'false'.
           Trace Interrupt Service Routines (ISRs) and Deferred Procedure Calls (DPCs).
           Requires administrative privileges. Available only on Windows devices.
           Default is 'false'.

        --kill=

           Possible values are 'true' or 'false'.
           Terminate the target application when ending/shutting down profiling
           session.
           Default is 'true', so the application is terminated when profiling session ends/is
           shutdown.

        -n, --inherit-environment=

           Possible values are 'true' or 'false'.
           Inherit environment variables.
           Default is 'true'.

        --nvtx-domain-[include|exclude]=

           Possible values are a comma-separated list of NVTX domains.
           Choose the include or exclude option to (only) include or exclude the specified domains. The
           options are mutually exclusive. 'default' filters the NVTX default domain. A domain with
           this name and commas in a domain name have to be escaped with '\'.
           Note that both switches are applicable only when --trace=nvtx is specified.

        -o, --output=

           Output report filename.
           Any %q{ENV_VAR} pattern in the filename will be substituted with the value of the
           environment variable.
           Any %h pattern in the filename will be substituted with the hostname of the system.
           Any %p pattern in the filename will be substituted with the PID of the target process or
           the PID of the root process if there is a process tree.
           Any %n pattern in the filename will be substituted with the minimal positive integer that is
           not already occupied.
           Any %% pattern in the filename will be substituted with %.
           Default is 'report%n'.

        --opengl-gpu-workload=

           Possible values are 'true' or 'false'.
           If true, trace the OpenGL workload's GPU activity.
           Note that this switch is applicable only when --trace=opengl is specified.
           Default is 'true'. Application scope.

        -p, --nvtx-capture=

           Possible values are: `range@domain' to specify both range and domain,
           `range' to specify range in default domain, `range@*' to specify a range in any domain.
           NVTX message and domain to trigger the profiling session.
           '@' can be escaped with backslash '\'.
           Note that this switch is applicable only when --capture-range=nvtx is specified.

        --python-functions-trace=

           Specify the path to the json file containing the requested
           Python functions to trace.
           Note that nvtx package must be installed on the target Python.
           See 'C:\Program Files\NVIDIA Corporation\Nsight Systems 2025.2.1\target-windows-x64\PythonFunctionsTrace/annotations.json' as an example.
           For PyTorch application, see predefined annotations at 'C:\Program Files\NVIDIA Corporation\Nsight Systems 2025.2.1\target-windows-x64\PythonFunctionsTrace/pytorch.json'.
           For Dask application, see predefined annotations at 'C:\Program Files\NVIDIA Corporation\Nsight Systems 2025.2.1\target-windows-x64\PythonFunctionsTrace/dask.json'.

        --python-sampling=

           Possible values are 'true' or 'false'.
           Sample Python backtrace.
           Default is 'false'.
           Note: This feature provides meaningful backtraces for Python processes.
           When profiling Python-only workflows, consider disabling the CPU sampling option to reduce overhead.

        --python-sampling-frequency=

           Specify Python sampling frequency.
           Minimum supported frequency is '1' (Hz).
           Maximum supported frequency is '2000' (Hz).
           Default is '1000' (Hz).

        --pytorch=

            Possible values are 'autograd-nvtx', 'autograd-shapes-nvtx', 'functions-trace' or 'none'.
            If 'autograd-nvtx' is used, nsys will call
            torch.autograd.profiler.emit_nvtx(record_shapes=False)
            when pytorch is imported.
            If 'autograd-shapes-nvtx' is used, nsys will call
            torch.autograd.profiler.emit_nvtx(record_shapes=True)
            when pytorch is imported.
            'functions-trace' is an alias to '--python-functions-trace=C:\Program Files\NVIDIA Corporation\Nsight Systems 2025.2.1\target-windows-x64\PythonFunctionsTrace/pytorch.json'.
            The 'autograd-nvtx' and 'autograd-shapes-nvtx' options can be combined
            with the 'functions-trace' option by adding them separated by a comma.
            Default is 'none'.
            Implies '--trace=nvtx'.

        --reflex-events=

           Possible values are 'true' or 'false'.
           If true, collect Reflex SDK ETW events.
           Default is 'false'. System scope.

        --resolve-symbols=

           Possible values are 'true' or 'false'.
           Resolve symbols of captured samples and backtraces.
           Default is 'false' on Windows, 'true' on other platforms.

        --retain-etw-files=

           Possible values are 'true' or 'false'.
           Retain ETW files.
           If true, retains ETW files generated by the trace, merges and moves the files to the output directory.
           Default is 'false'.

        -s, --sample=

           Possible values are 'process-tree', 'system-wide' or 'none'.
           Collect CPU IP/backtrace samples. Select 'none' to disable sampling. 'process-tree' or 'system-wide' requires administrative privileges.
           If a target application is launched, the default is 'process-tree', otherwise the default
           is 'none'.

        --sampling-frequency=

           Specify sampling/backtracing frequency.
           Minimum supported frequency is '100' (Hz).
           Maximum supported frequency is '8000' (Hz).
           Default is '1000' (Hz).

        --session-new=

           Start the collection in a new named session. The option  argument represents the session
           name.
           The session name must start with an alphabetical character followed by printable or space
           characters.
           Any '%q{ENV_VAR}' pattern in the session name will be substituted with the value of the
           environment variable.
           Any '%h' pattern in the option argument will be substituted with the hostname of the system.
           Any '%%' pattern in the option argument will be substituted with '%'.

        --start-frame-index=

           Start the recording session when the frame index reaches the frame number preceding the
           start frame index. Minimum supported frame is '1'.
           Note when it is selected cannot include any other start options.
           If not specified the default is disabled. Application scope.

        --stats=

           Possible values are 'true' or 'false'.
           Generate summary statistics after the collection.
           When set to true, an SQLite database file will be created after the collection.
           Default is 'false'.

        --system-wide=

           Possible values are 'true' or 'false'.
           Perform system-wide injection using Windows hooks.
           Equivalent to setting the --injection-use-detours option to the 
           inverse value.
           Default is 'false'.

        -t, --trace=

           Possible values are 'cuda', 'cuda-hw', 'nvtx', 'cublas', 'cublas-verbose',
           'cusolver', 'cusolver-verbose', 'cusparse', 'cusparse-verbose', 'opengl',
           'opengl-annotations', 'nvvideo', 'vulkan', 'vulkan-annotations', 'dx11',
           'dx11-annotations', 'dx12', 'dx12-annotations', 'openxr',
           'openxr-annotations', 'wddm', 'python-gil' or 'none'.
           Select the API(s) to trace. Multiple APIs can be selected, separated by commas only
           (no spaces).
           If '<api>-annotations' is selected, the corresponding API will also be traced.
           If 'none' is selected, no APIs are traced.
           Default is 'cuda,nvtx,opengl'. Application scope.

        --vulkan-gpu-workload=

           Possible values are 'individual', 'batch', 'none', 'true' or 'false'.
           If individual or true, trace each Vulkan workload's GPU activity individually.
           If batch, trace Vulkan workloads' GPU activity in vkQueueSubmit call batches.
           If none or false, do not trace Vulkan workloads' GPU activity.
           Note that this switch is applicable only when --trace=vulkan is specified.
           Default is 'individual'. Application scope.

        -w, --show-output=

           Possible values are 'true' or 'false'.
           If true, send target process's stdout and stderr streams to both the console and
           stdout/stderr files which are added to the report file.
           If false, only send target process stdout and stderr streams to the stdout/stderr files
           which are added to the report file.
           Default is 'true'.

        --wait=

           Possible values are 'primary' or 'all'.
           If 'primary', the CLI will wait on the application process termination.
           If 'all', the CLI will additionally wait on re-parented processes created by the
           application.
           Default is 'all'.

        --wddm-additional-events=

           Possible values are 'true' or 'false'.
           If true, collect additional range of ETW events, including context status, allocations, sync wait and signal events, etc.
           Requires administrative privileges.
           Note that this switch is applicable only when --trace=wddm is specified.
           Default is 'true'. System scope.

        --wddm-backtraces=

           Possible values are 'true' or 'false'.
           If true, collect backtraces of WDDM events.
           Requires administrative privileges.
           Disabling this data collection can reduce overhead for target
           applications that generate many DxgKrnl WDDM Events.
           Note that this switch is applicable only when --trace=wddm is specified.
           Default is 'false'.

        -x, --stop-on-exit=

           Possible values are 'true' or 'false'.
           Stop profiling when the launched application exits.
           If stop-on-exit=false, duration must be greater than 0.
           Default is 'true'.

        -Y, --start-later=

           Possible values are 'true' or 'false'.
           Delays collection indefinitely until the nsys start
           command is executed for this session.
           Enabling this option overrides the --delay option.
           Default is 'false'.

        -y, --delay=

           Collection start delay in seconds.
           Default is 0.

The output command with profile are same as before version, which does not contain the flags options for

-cudabacktrace and --python-backtrace

And this again in not matching with the User Guide of Nvidia Nsys 2025 version.

I am not able to identify why is this issue arising ?

–update 20-03
I am using it on Win11 system.

I reinstalled Nsight systems, but that did not make a difference. It shows the same documentation for --help as before.

Did the command I posted still work?

No that does not work for me. I tried that well. It shows

unrecognised option ‘–cudabacktrace’

—update–

I tried reinstalling PyTorch again, set the evn. variable path for Nsys but the problem persists.

I have not been able to move forward with the issue, what could be the way to solve it?

I don’t use Windows, but let me check if there are some limitations.

Thanks, If anything do let me know please.

I think this is an issue with the windows ( Windows 11 x64 based) version specifically because on Ubuntu 22.04 nsys profile --help it shows the --cudabacktrace . Could you please look into it, or is it just for me?