PyTorch profiler with Tensorboard not capturing Dataloader time

Issue → PyTorch profiler not capturing Dataloader time and runtime. Always shows 0.
Code used → I have used the code given in official PyTorch profiler documentation ( PyTorch documentation)

Hardware Used-> Nvidia AI100 gpu
PyTorch version-> 1.13.0+cu117
PyTorch tensorboard profiler version → 0.4.1

@ptrblck can you please help me out here.

I’m not familiar enough with the Kineto profiler and don’t know why it’s not showing the DataLoader workload. As an alternative, you could use nvtx ranges and profile your workload with Nsight Systems as described in this post.

1 Like

Hi @ptrblck , thanks for telling the alternative, I tried the nsys command and generated the output as well which I opened in Nsight systems but got nvtx and cuda errors.

Did you follow my tutorial and were you able to profile the example code using the provided commands?

Yes @ptrblck ,
I needed to modify the command a little bit as your command
nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s cpu --capture-range=cudaProfilerApi --stop-on-range-end=true --cudabacktrace=true -x true -o my_profile python
was giving the following error:-
unrecognised option ‘–stop-on-range-end=true’

so I changed it to
nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s cpu --capture-range=cudaProfilerApi --capture-range-end=stop-shutdown --cudabacktrace=true -x true -o my_profile python
I am getting these warnings like Not all NVTX events might have been collected etc. with the example you have shared.

@ptrblck it worked now, thanks once again, one last qsn can we add custom labels to parallel/async tasks as well? like when num_workers=2 in data loader
End goal is to figure out visually if a particular task is happening synchronously or asynchronously.

I would assume you could add nvtx ranges inside the Dataset.__getitem__ and use the worker id for the range tag. This should show up in the timeline for each worker of the DataLoader.
I haven’t tried this out yet, so let me know if it works.

1 Like

@ptrblck it worked, but the amount of data getting loaded isn’t changing when I change the prefetch factor keeping num of workers as 3.
for example in the below image 128 get items were called by each worker, and 64 additional by the first worker as there were only 128*3+64 images,shouldn’t with prefetch factor 2 the 64 get items call under batch 0 happen during data loading

I have the same symptoms :

  • Dataloader time not reported/catpured
  • I catpure 20 iteration, however tensorboard only shows 1 big and long iteration, (with all the time added up )


  • pytorch 1.13.1
  • Nvidia Titan, ubuntu 20.04, tensorboard 2.11.2

@Maxime_G please use this method

if you are starting the training in a separate thread make sure you use the “spawn” as multiprocessing startup method.

Tensorboard didn’t work for me, there is a bug in the output file generated using PyTorch profiler

Could you give me some advices to show the ''DATA LOADING"? I tried pytorch 1.10和1.13, both didn’t work. Below is my environment and test code.

>>> nsys status -e
Timestamp counter supported: Yes
Sampling Environment Check
Linux Kernel Paranoid Level = 1: OK
Linux Distribution = Ubuntu
Linux Kernel Version = 5.14.0-1056-oem: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Available
Sampling Environment: OK
import os
import torch
import nvtx
import time
from torch import nn
from torch.optim import SGD
from import Dataset, DataLoader, DistributedSampler

class TestDataset(Dataset):
    def __init__(self) -> None:
        self.x = torch.randn(num_samples, 10)
        self.y = torch.randn(num_samples, 1)

    # @nvtx.annotate("data loading",color='yellow')
    def __getitem__(self, index):
        torch.cuda.nvtx.range_push("data loading")
        x = self.x[index]
        y = self.y[index]
        return x,y

    def __len__(self):
        return num_samples

class Model(nn.Module):
    def __init__(self) -> None:
        super(Model, self).__init__()
        self.classifer = nn.Linear(10, 3)
        self.pred = nn.Linear(10, 3)

    # @nvtx.annotate("forward",color='blue')
    def forward(self, x):
        pred = self.pred(x)
        classifier_out = self.classifer(x)
        classifer_score = torch.softmax(classifier_out, dim=1)
        _, index = torch.max(classifer_score, dim=1)
        return pred[torch.arange(index.shape[0]), index]

def train():
    model = Model().to(device)
    optimizer = SGD(model.parameters(), 0.1)
    loss_fn = nn.MSELoss()
    dataset = TestDataset()
    # train_sampler = DistributedSampler(dataset)
    dataloader = DataLoader(dataset, batch_size = batch_size, num_workers = 2, sampler=None)#,sampler=train_sampler
    for epoch in range(1):
        for batch, (x, y) in enumerate(dataloader):
            if batch == 1: torch.cuda.cudart().cudaProfilerStart()
            if batch >= 1: torch.cuda.nvtx.range_push("iteration{}".format(batch))

            x =
            y =

            if batch >= 1: torch.cuda.nvtx.range_push("forward")
            output = model(x)
            if batch >= 1: torch.cuda.nvtx.range_pop()

            loss = loss_fn(output, y)

            if batch >= 1: torch.cuda.nvtx.range_push("backward")
            if batch >= 1: torch.cuda.nvtx.range_pop()
            if batch >= 1: torch.cuda.nvtx.range_push("opt.step()")
            if batch >= 1: torch.cuda.nvtx.range_pop()

            # print('pred.bias:', model.state_dict()['pred.bias'])
            print('classifer.bias:', model.state_dict()['classifer.bias'])
            if batch >= 1: torch.cuda.nvtx.range_pop()# iteration


if __name__ == "__main__":
    batch_size = 4
    num_samples = 20
    device = torch.device('cuda')

Run code:

nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s cpu --force-overwrite=true --capture-range=cudaProfilerApi --capture-range-end=stop-shutdown --cudabacktrace=true -x true --output=quickstart python

@ptrblck Hi, I find a way to visu Dataset ‘data loading’ label–add a parameter Dataloader(multiprocessing_context='spawn'), but this method will improve the memory usage, Is it the only solution to use nvtx in pytorch Dataloader?