PyTorch profiler with Tensorboard not capturing Dataloader time

Issue → PyTorch profiler not capturing Dataloader time and runtime. Always shows 0.
Code used → I have used the code given in official PyTorch profiler documentation ( PyTorch documentation)

Hardware Used-> Nvidia AI100 gpu
PyTorch version-> 1.13.0+cu117
PyTorch tensorboard profiler version → 0.4.1

@ptrblck can you please help me out here.

I’m not familiar enough with the Kineto profiler and don’t know why it’s not showing the DataLoader workload. As an alternative, you could use nvtx ranges and profile your workload with Nsight Systems as described in this post.

1 Like

Hi @ptrblck , thanks for telling the alternative, I tried the nsys command and generated the output as well which I opened in Nsight systems but got nvtx and cuda errors.

Did you follow my tutorial and were you able to profile the example code using the provided commands?

Yes @ptrblck ,
I needed to modify the command a little bit as your command
nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s cpu --capture-range=cudaProfilerApi --stop-on-range-end=true --cudabacktrace=true -x true -o my_profile python
was giving the following error:-
unrecognised option ‘–stop-on-range-end=true’

so I changed it to
nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s cpu --capture-range=cudaProfilerApi --capture-range-end=stop-shutdown --cudabacktrace=true -x true -o my_profile python
I am getting these warnings like Not all NVTX events might have been collected etc. with the example you have shared.

@ptrblck it worked now, thanks once again, one last qsn can we add custom labels to parallel/async tasks as well? like when num_workers=2 in data loader
End goal is to figure out visually if a particular task is happening synchronously or asynchronously.

I would assume you could add nvtx ranges inside the Dataset.__getitem__ and use the worker id for the range tag. This should show up in the timeline for each worker of the DataLoader.
I haven’t tried this out yet, so let me know if it works.

1 Like

@ptrblck it worked, but the amount of data getting loaded isn’t changing when I change the prefetch factor keeping num of workers as 3.
for example in the below image 128 get items were called by each worker, and 64 additional by the first worker as there were only 128*3+64 images,shouldn’t with prefetch factor 2 the 64 get items call under batch 0 happen during data loading