Understanding torch.utils.bottleneck results

I just finished running the bottleneck on my script and I’m pretty lost with one section. There is a section called autograd profiler output (CUDA mode) and another called autograd profiler output (CPU mode). The top events are IndexBackward and index_put_impl. I’m not sure what those refer to but most importantly are the times that it lists for these.

In CPU mode it shows this:

Self CPU time total: 341.996s
CUDA time total: 0.000us

In CUDA mode it shows this:

Self CPU time total: 355.418s
CUDA time total: 15.872ms

What exactly are these times. Should I be concerned that in CUDA mode there is still so much time on the CPU.

This model runs very slowly with very high memory use and I’m only using a batch size of 16. Any more and it crashes.

Any help would be greatly appreciated.

This might indicate that you are running into a CPU bottleneck, e.g. from data loading.
You could try to adapt the data loading measurements from the ImageNet example to check, if that’s the case or not.

So I played around with the timers in that code (which is awesome by the way, thank you) and also did a few little experiments with torch.save and torch.load. I tried to be slick and save the part which took the longest each time so that I could just pull it back up and train. Yeah, it doesn’t work that way.

What I did learn though was that my code takes the longest time when I do the following:

loss_dict = model(imgs, annotations)

When that part get’s executed it takes almost all my memory and things start to slow down. Maybe I should mention that I have a Nvidia Xavier. It have the 32GB model with a 512GB drive attached (this is where I store all my stuff). My dataset is about 8600 images (512 x 640). It’s a COCO dataset and I’m using a Faster RCNN - Resnet50.

It takes me about 30 hours to train this model. Perhaps that’s just what to expect from my gear but looking at others examples, somehow it just seems like this is slow. If it is slow then I’ve definitely done something wrong in my code.

Am I just expecting more than I should?

Which Xavier platform are you using? Xavier are embedded devices, which will not yield the same training performance as a server GPU.

I got the Xavier AGX Developer kit.

Looking at it now, this little guy will work wonders for my edge application when doing inference but when I look at specs on a server size GPU, it makes sense it would be slower when it comes to training.

I’ll have to make do with the times I’m seeing.