I’ve a simple shallow neural network. When I train this model, GPU utilisation is just ~ 35%. Could anyone suggest what can cause this under-utilisation of resources? Training examples are fed to the model from 32 processes using the
torch.utils.data.DataLoader API with pinned memory. I use
torch.nn.Embedding with sparse gradients as input layer. The model in trained on Tesla P100 in GCP. The host machine has 16 vCPUs. According to cProfile, the
run_backward method takes the most time, and the CPU -> GPU transfer (the
cuda method) is second. According to the autograd profiler,
pin_memory takes the most time. Matrix multiplication (
bmm) is second.
I would appreciate any suggestions for improving GPU utilisation.