Segamation fault when using torch.profiler

:bug: Bug

when using the torch.profiler feature, it seems that cupti encouter a Segmentation fault problem in my enivorment.

To Reproduce

My code:

import math

import torch
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
import torch.utils.data.distributed
import torchvision.models as models

train_dataset = \
    datasets.MNIST('./data', train=True, download=True,
                   transform=transforms.Compose([
                       transforms.Resize((224, 224)),
                       transforms.Lambda(lambda image: image.convert('RGB')),
                       transforms.ToTensor(),
                       transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])
                   ]))

torch.cuda.set_device(0)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=1, num_workers=2)

# hyper parameter
epoch_ = 100
lr_ = 0.001
momentum_ = 0.9
milestones = [30, 80, 120]
log_interval = 10

model = models.alexnet(num_classes=10)
if torch.cuda.is_available():
    model.cuda()

optimizer = optim.SGD(model.parameters(), lr=lr_, momentum=momentum_)

scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones, gamma=0.1, last_epoch=-1)

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA],
    schedule=torch.profiler.schedule(
        wait=1,
        warmup=1,
        active=1),
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
    on_trace_ready=torch.profiler.tensorboard_trace_handler('./log')
    ) as p: 
    for epoch in range(0, epoch_):
        model.train()
        for step, (data, target) in enumerate(train_loader):
            if torch.cuda.is_available():
                data, target = data.cuda(), target.cuda()
            output = model(data)
            loss = F.cross_entropy(output, target)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            if step % log_interval == 0:
                print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                    epoch, step * len(data), len(train_loader),
                    100.0 * step / len(train_loader), loss.item()))
            p.step()
        scheduler.step()

Problem:

root@# gdb python3
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python3...(no debugging symbols found)...done.
(gdb) r hvd_pytorch_mnist-profile.py 
Starting program: /usr/bin/python3 hvd_pytorch_mnist-profile.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fff9d92d700 (LWP 1308)]
[New Thread 0x7fff9d12c700 (LWP 1309)]
[New Thread 0x7fff9892b700 (LWP 1310)]
[New Thread 0x7fff9612a700 (LWP 1311)]
[New Thread 0x7fff93929700 (LWP 1312)]
[New Thread 0x7fff91128700 (LWP 1313)]
[New Thread 0x7fff8e927700 (LWP 1314)]
[New Thread 0x7fff8c126700 (LWP 1315)]
[New Thread 0x7fff89925700 (LWP 1316)]
[New Thread 0x7fff87124700 (LWP 1317)]
[New Thread 0x7fff84923700 (LWP 1318)]
[New Thread 0x7fff82122700 (LWP 1319)]
[New Thread 0x7fff7f921700 (LWP 1320)]
[New Thread 0x7fff7d120700 (LWP 1321)]
[New Thread 0x7fff7c91f700 (LWP 1322)]
[New Thread 0x7fff7811e700 (LWP 1323)]
[New Thread 0x7fff7591d700 (LWP 1324)]
[New Thread 0x7fff7311c700 (LWP 1325)]
[New Thread 0x7fff7291b700 (LWP 1326)]
[New Thread 0x7fff6e11a700 (LWP 1327)]
[New Thread 0x7fff6b919700 (LWP 1328)]
[New Thread 0x7fff69118700 (LWP 1329)]
[New Thread 0x7fff66917700 (LWP 1330)]
[New Thread 0x7fff66116700 (LWP 1331)]
[New Thread 0x7fff61915700 (LWP 1332)]
[New Thread 0x7fff5f114700 (LWP 1333)]
[New Thread 0x7fff5c913700 (LWP 1334)]
[New Thread 0x7fff5a112700 (LWP 1335)]
[New Thread 0x7fff57911700 (LWP 1336)]
[New Thread 0x7fff55110700 (LWP 1337)]
[New Thread 0x7fff5290f700 (LWP 1338)]
[New Thread 0x7fff5010e700 (LWP 1339)]
[New Thread 0x7fff4d90d700 (LWP 1340)]
[New Thread 0x7fff4b10c700 (LWP 1341)]
[New Thread 0x7fff4890b700 (LWP 1342)]
[New Thread 0x7fff4610a700 (LWP 1343)]
[New Thread 0x7fff43909700 (LWP 1344)]
[New Thread 0x7fff43108700 (LWP 1345)]
[New Thread 0x7fff40907700 (LWP 1346)]
[New Thread 0x7fff3e106700 (LWP 1347)]
[New Thread 0x7fff3b905700 (LWP 1348)]
[New Thread 0x7fff37104700 (LWP 1349)]
[New Thread 0x7fff34903700 (LWP 1350)]
[New Thread 0x7fff32102700 (LWP 1351)]
[New Thread 0x7fff2f901700 (LWP 1352)]
[New Thread 0x7fff2d100700 (LWP 1353)]
[New Thread 0x7fff2a8ff700 (LWP 1354)]
[New Thread 0x7fff280fe700 (LWP 1355)]
[New Thread 0x7fff258fd700 (LWP 1356)]
[New Thread 0x7fff230fc700 (LWP 1357)]
[New Thread 0x7fff208fb700 (LWP 1358)]
[New Thread 0x7fff1e0fa700 (LWP 1359)]
[New Thread 0x7fff1b8f9700 (LWP 1360)]
[New Thread 0x7fff190f8700 (LWP 1361)]
[New Thread 0x7fff168f7700 (LWP 1362)]
[New Thread 0x7fff140f6700 (LWP 1363)]
[New Thread 0x7fff118f5700 (LWP 1364)]
[New Thread 0x7fff0f0f4700 (LWP 1365)]
[New Thread 0x7fff0c8f3700 (LWP 1366)]
[New Thread 0x7fff0a0f2700 (LWP 1367)]
[New Thread 0x7fff078f1700 (LWP 1368)]
[New Thread 0x7fff050f0700 (LWP 1369)]
[New Thread 0x7fff048ef700 (LWP 1370)]
[New Thread 0x7ffef126b700 (LWP 1371)]
[New Thread 0x7ffef0a2a700 (LWP 1372)]
[Thread 0x7fff0a0f2700 (LWP 1367) exited]
[Thread 0x7fff050f0700 (LWP 1369) exited]
[Thread 0x7fff048ef700 (LWP 1370) exited]
[Thread 0x7fff078f1700 (LWP 1368) exited]
[Thread 0x7fff0c8f3700 (LWP 1366) exited]
[Thread 0x7fff0f0f4700 (LWP 1365) exited]
[Thread 0x7fff118f5700 (LWP 1364) exited]
[Thread 0x7fff140f6700 (LWP 1363) exited]
[Thread 0x7fff168f7700 (LWP 1362) exited]
[Thread 0x7fff190f8700 (LWP 1361) exited]
[Thread 0x7fff1b8f9700 (LWP 1360) exited]
[Thread 0x7fff1e0fa700 (LWP 1359) exited]
[Thread 0x7fff208fb700 (LWP 1358) exited]
[Thread 0x7fff230fc700 (LWP 1357) exited]
[Thread 0x7fff258fd700 (LWP 1356) exited]
[Thread 0x7fff280fe700 (LWP 1355) exited]
[Thread 0x7fff2a8ff700 (LWP 1354) exited]
[Thread 0x7fff2d100700 (LWP 1353) exited]
[Thread 0x7fff2f901700 (LWP 1352) exited]
[Thread 0x7fff32102700 (LWP 1351) exited]
[Thread 0x7fff34903700 (LWP 1350) exited]
[Thread 0x7fff37104700 (LWP 1349) exited]
[Thread 0x7fff3b905700 (LWP 1348) exited]
[Thread 0x7fff3e106700 (LWP 1347) exited]
[Thread 0x7fff40907700 (LWP 1346) exited]
[Thread 0x7fff43108700 (LWP 1345) exited]
[Thread 0x7fff43909700 (LWP 1344) exited]
[Thread 0x7fff4610a700 (LWP 1343) exited]
[Thread 0x7fff4890b700 (LWP 1342) exited]
[Thread 0x7fff4b10c700 (LWP 1341) exited]
[Thread 0x7fff4d90d700 (LWP 1340) exited]
[Thread 0x7fff5010e700 (LWP 1339) exited]
[Thread 0x7fff5290f700 (LWP 1338) exited]
[Thread 0x7fff55110700 (LWP 1337) exited]
[Thread 0x7fff57911700 (LWP 1336) exited]
[Thread 0x7fff5a112700 (LWP 1335) exited]
[Thread 0x7fff5c913700 (LWP 1334) exited]
[Thread 0x7fff5f114700 (LWP 1333) exited]
[Thread 0x7fff61915700 (LWP 1332) exited]
[Thread 0x7fff66116700 (LWP 1331) exited]
[Thread 0x7fff66917700 (LWP 1330) exited]
[Thread 0x7fff69118700 (LWP 1329) exited]
[Thread 0x7fff6b919700 (LWP 1328) exited]
[Thread 0x7fff6e11a700 (LWP 1327) exited]
[Thread 0x7fff7291b700 (LWP 1326) exited]
[Thread 0x7fff7311c700 (LWP 1325) exited]
[Thread 0x7fff7591d700 (LWP 1324) exited]
[Thread 0x7fff7811e700 (LWP 1323) exited]
[Thread 0x7fff7c91f700 (LWP 1322) exited]
[Thread 0x7fff7d120700 (LWP 1321) exited]
[Thread 0x7fff7f921700 (LWP 1320) exited]
[Thread 0x7fff82122700 (LWP 1319) exited]
[Thread 0x7fff84923700 (LWP 1318) exited]
[Thread 0x7fff87124700 (LWP 1317) exited]
[Thread 0x7fff89925700 (LWP 1316) exited]
[Thread 0x7fff8c126700 (LWP 1315) exited]
[Thread 0x7fff8e927700 (LWP 1314) exited]
[Thread 0x7fff91128700 (LWP 1313) exited]
[Thread 0x7fff93929700 (LWP 1312) exited]
[Thread 0x7fff9612a700 (LWP 1311) exited]
[Thread 0x7fff9892b700 (LWP 1310) exited]
[Thread 0x7fff9d12c700 (LWP 1309) exited]
[Thread 0x7fff9d92d700 (LWP 1308) exited]
[New Thread 0x7fff048ef700 (LWP 1470)]
[New Thread 0x7fff050f0700 (LWP 1471)]
[New Thread 0x7fff078f1700 (LWP 1571)]
Train Epoch: 0 [0/60000 (0%)]   Loss: 2.309546
[New Thread 0x7fff0a0f2700 (LWP 1572)]

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x00007fffeead723d in _nv002114cupti () from /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so
(gdb) bt
#0  0x00007fffeead723d in _nv002114cupti () from /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so
#1  0x00007fffeebb2870 in _nv000969cupti () from /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so
#2  0x00007fffee8fb7ab in _nv000434cupti () from /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so
#3  0x00007ffef7624dc3 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fffee8fb4a8 in _nv029272cupti () from /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so
#5  0x00007fffee8fcea9 in _nv029260cupti () from /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so
#6  0x00007fffee8fd35f in _nv029287cupti () from /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so
#7  0x00007fffee8ffe11 in cuptiSubscribe () from /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so
#8  0x00007fffe562913b in libkineto_init () from /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so
#9  0x00007fffe3ac44f8 in torch::autograd::profiler::prepareProfiler(torch::autograd::profiler::ProfilerConfig const&, std::set<torch::autograd::profiler::ActivityType, std::less<torch::autograd::profiler::ActivityType>, std::allocator<torch::autograd::profiler::ActivityType> > const&) ()
   from /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so
#10 0x00007ffff34dec85 in void pybind11::cpp_function::initialize<void (*&)(torch::autograd::profiler::ProfilerConfig const&, std::set<torch::autograd::profiler::ActivityType, std::less<torch::autograd::profiler::ActivityType>, std::allocator<torch::autograd::profiler::ActivityType> > const&), void, torch::autograd::profiler::ProfilerConfig const&, std::set<torch::autograd::profiler::ActivityType, std::less<torch::autograd::profiler::ActivityType>, std::allocator<torch::autograd::profiler::ActivityType> > const&, pybind11::name, pybind11::scope, pybind11::sibling>(void (*&)(torch::autograd::profiler::ProfilerConfig const&, std::set<torch::autograd::profiler::ActivityType, std::less<torch::autograd::profiler::ActivityType>, std::allocator<torch::autograd::profiler::ActivityType> > const&), void (*)(torch::autograd::profiler::ProfilerConfig const&, std::set<torch::autograd::profiler::ActivityType, std::less<torch::autograd::profiler::ActivityType>, std::allocator<torch::autograd::profiler::ActivityType> > const&), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::operator()(pybind11::detail::function_call&) const () from /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so
#11 0x00007ffff317bd39 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) ()
   from /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so
#12 0x00000000004e7a53 in _PyMethodDef_RawFastCallKeywords ()
#13 0x00000000004e77d0 in _PyCFunction_FastCallKeywords ()
#14 0x000000000055e548 in _PyEval_EvalFrameDefault ()
#15 0x00000000004e90aa in _PyFunction_FastCallKeywords ()
#16 0x0000000000559e31 in _PyEval_EvalFrameDefault ()
#17 0x00000000004e90aa in _PyFunction_FastCallKeywords ()
#18 0x0000000000559e31 in _PyEval_EvalFrameDefault ()
#19 0x00000000004e90aa in _PyFunction_FastCallKeywords ()
#20 0x0000000000559e31 in _PyEval_EvalFrameDefault ()
#21 0x0000000000558c80 in _PyEval_EvalCodeWithName ()
#22 0x0000000000558a03 in PyEval_EvalCode ()
#23 0x0000000000624542 in ?? ()
#24 0x00000000006248ca in PyRun_FileExFlags ()
#25 0x0000000000624767 in PyRun_SimpleFileExFlags ()
#26 0x00000000005fce7b in ?? ()
#27 0x00000000005fcb1a in _Py_UnixMain ()
#28 0x00007ffff6cbd840 in __libc_start_main (main=0x4e4260 <main>, argc=2, argv=0x7fffffff9168, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffff9158) at ../csu/libc-start.c:291
#29 0x00000000005fca09 in _start ()

Environment

environment collected:

PyTorch version: 1.8.1+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 16.04.6 LTS (x86_64)
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
Clang version: Could not collect
CMake version: version 3.5.1

Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.2.89
GPU models and configuration: GPU 0: Tesla V100-SXM2-16GB
Nvidia driver version: 440.64.00
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] torch==1.8.1
[pip3] torchvision==0.8.2
[conda] Could not collect

@ptrblck any idea of this problem?