I just finished running the bottleneck on my script and I’m pretty lost with one section. There is a section called autograd profiler output (CUDA mode) and another called autograd profiler output (CPU mode). The top events are IndexBackward and index_put_impl. I’m not sure what those refer to but most importantly are the times that it lists for these.
In CPU mode it shows this:
Self CPU time total: 341.996s
CUDA time total: 0.000us
In CUDA mode it shows this:
Self CPU time total: 355.418s
CUDA time total: 15.872ms
What exactly are these times. Should I be concerned that in CUDA mode there is still so much time on the CPU.
This model runs very slowly with very high memory use and I’m only using a batch size of 16. Any more and it crashes.
This might indicate that you are running into a CPU bottleneck, e.g. from data loading.
You could try to adapt the data loading measurements from the ImageNet example to check, if that’s the case or not.
So I played around with the timers in that code (which is awesome by the way, thank you) and also did a few little experiments with torch.save and torch.load. I tried to be slick and save the part which took the longest each time so that I could just pull it back up and train. Yeah, it doesn’t work that way.
What I did learn though was that my code takes the longest time when I do the following:
loss_dict = model(imgs, annotations)
When that part get’s executed it takes almost all my memory and things start to slow down. Maybe I should mention that I have a Nvidia Xavier. It have the 32GB model with a 512GB drive attached (this is where I store all my stuff). My dataset is about 8600 images (512 x 640). It’s a COCO dataset and I’m using a Faster RCNN - Resnet50.
It takes me about 30 hours to train this model. Perhaps that’s just what to expect from my gear but looking at others examples, somehow it just seems like this is slow. If it is slow then I’ve definitely done something wrong in my code.
Looking at it now, this little guy will work wonders for my edge application when doing inference but when I look at specs on a server size GPU, it makes sense it would be slower when it comes to training.