Find bottleneck in DL training

Hi, I am training a deep learning model that performs transformations on the fly. However, training is very slow and I am trying to figure out the bottleneck - Disk I/O, CPU or/and GPU.

I am running a 16CPU, P100 GPU, 200gb persistent disk on GCP.

This is the output of htop

This is the output of iotop

Can anyone help me understand what the bottleneck is?
From the htop output - CPU usage seems ~100%, and from the iotop output I/O doesn’t seem to be the bottleneck.

I ran cprofile on a smaller dataset. Dataloader seems to be taking up 80% of the time.

So, what exactly is the bottleneck here?

I would recommend to time the DataLoader using the AverageMeter from the ImageNet example.
I’m not sure, if cprofile is giving the right profiles as CUDA operations are asynchronous and this it might accumulate the time on the blocking operation on the CPU side.

1 Like