Can 'torch.profiler' support GB20X NVIDIA GPU architectures?

arminzhu · February 5, 2026, 3:39pm

For more details, please see the open issue #174195.

I use torch.profiler like this:

prof = torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
schedule=torch.profiler.schedule(wait=3, warmup=3, active=3, repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler(f’./xxx’),
record_shapes=True,
with_stack=True,
acc_events=True,)

But I got an error:

WARNING:2026-02-03 20:37:18 334947:334947 init.cpp:115] function cbapi->getCuptiStatus() failed with error CUPTI_ERROR_INVALID_DEVICE (2)
WARNING:2026-02-03 20:37:18 334947:334947 init.cpp:116] CUPTI initialization failed - CUDA profiler activities will be missing
INFO:2026-02-03 20:37:18 334947:334947 init.cpp:118] If you see CUPTI_ERROR_INSUFFICIENT_PRIVILEGES, refer to NVIDIA Development Tools Solutions - CUPTI_ERROR_INSUFFICIENT_PRIVILEGES: CUPTI Permission issue with Performance Counters | NVIDIA Developer

My env:

PyTorch version: 2.10.0+cu128
Is debug build: False
CUDA used to build PyTorch: 12.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.3 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: Could not collect
CMake version: version 3.28.3
Libc version: glibc-2.39

Python version: 3.12.3 (main, Jan 8 2026, 11:30:50) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-6.14.0-37-generic-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: 12.8.93
CUDA_MODULE_LOADING set to:
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 5060 Laptop GPU
Nvidia driver version: 590.48.01
cuDNN version: Could not collect
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Caching allocator config: N/A

arminzhu · February 5, 2026, 3:41pm

github.com/pytorch/pytorch

A bug about `torch.profiler` (torch 2.10.0)

opened 12:45PM - 03 Feb 26 UTC

arminzhu

module: cuda oncall: profiler bot-triaged

### 🐛 Describe the bug Use `torch.profiler` like this: ``` prof = torch.profile…r.profile( activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA], schedule=torch.profiler.schedule(wait=3, warmup=3, active=3, repeat=1), on_trace_ready=torch.profiler.tensorboard_trace_handler(f'./xxx'), record_shapes=True, with_stack=True, acc_events=True,) ``` The error message: ``` WARNING:2026-02-03 20:37:18 334947:334947 init.cpp:115] function cbapi->getCuptiStatus() failed with error CUPTI_ERROR_INVALID_DEVICE (2) WARNING:2026-02-03 20:37:18 334947:334947 init.cpp:116] CUPTI initialization failed - CUDA profiler activities will be missing INFO:2026-02-03 20:37:18 334947:334947 init.cpp:118] If you see CUPTI_ERROR_INSUFFICIENT_PRIVILEGES, refer to https://developer.nvidia.com/nvidia-development-tools-solutions-err-nvgpuctrperm-cupti ``` ### Versions PyTorch version: 2.10.0+cu128 Is debug build: False CUDA used to build PyTorch: 12.8 ROCM used to build PyTorch: N/A OS: Ubuntu 24.04.3 LTS (x86_64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Clang version: Could not collect CMake version: version 3.28.3 Libc version: glibc-2.39 Python version: 3.12.3 (main, Jan 8 2026, 11:30:50) [GCC 13.3.0] (64-bit runtime) Python platform: Linux-6.14.0-37-generic-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: 12.8.93 CUDA_MODULE_LOADING set to: GPU models and configuration: GPU 0: NVIDIA GeForce RTX 5060 Laptop GPU Nvidia driver version: 590.48.01 cuDNN version: Could not collect Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Caching allocator config: N/A CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: AuthenticAMD Model name: AMD Ryzen 9 7945HX with Radeon Graphics CPU family: 25 Model: 97 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 2 Frequency boost: enabled CPU(s) scaling MHz: 70% CPU max MHz: 5462.0000 CPU min MHz: 400.0000 BogoMIPS: 4990.69 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d amd_lbr_pmc_freeze Virtualization: AMD-V L1d cache: 512 KiB (16 instances) L1i cache: 512 KiB (16 instances) L2 cache: 16 MiB (16 instances) L3 cache: 64 MiB (2 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-31 Vulnerability Gather data sampling: Not affected Vulnerability Ghostwrite: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; Safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsa: Mitigation; Clear CPU buffers Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Mitigation; IBPB before exit to userspace Versions of relevant libraries: [pip3] numpy==2.4.1 [pip3] nvidia-cublas-cu12==12.8.4.1 [pip3] nvidia-cuda-cupti-cu12==12.8.90 [pip3] nvidia-cuda-nvrtc-cu12==12.8.93 [pip3] nvidia-cuda-runtime-cu12==12.8.90 [pip3] nvidia-cudnn-cu12==9.10.2.21 [pip3] nvidia-cufft-cu12==11.3.3.83 [pip3] nvidia-curand-cu12==10.3.9.90 [pip3] nvidia-cusolver-cu12==11.7.3.90 [pip3] nvidia-cusparse-cu12==12.5.8.93 [pip3] nvidia-cusparselt-cu12==0.7.1 [pip3] nvidia-nccl-cu12==2.27.5 [pip3] nvidia-nvjitlink-cu12==12.8.93 [pip3] nvidia-nvtx-cu12==12.8.90 [pip3] torch==2.10.0 [pip3] torch-tb-profiler==0.4.3 [pip3] torchvision==0.25.0 [pip3] triton==3.6.0 cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia @robieta @chaekit @guotuofeng @guyang3532 @dzhulgakov @davidberard98 @briancoutinho @sraikund16 @sanrise @mwootton

arminzhu · February 8, 2026, 7:05pm

Use CUDA 13, and new driver can solve this.