I’ve a simple shallow neural network. When I train this model, GPU utilisation is just ~ 35%. Could anyone suggest what can cause this under-utilisation of resources? Training examples are fed to the model from 32 processes using the torch.utils.data.DataLoader
API with pinned memory. I use torch.nn.Embedding
with sparse gradients as input layer. The model in trained on Tesla P100 in GCP. The host machine has 16 vCPUs. According to cProfile, the run_backward
method takes the most time, and the CPU -> GPU transfer (the cuda
method) is second. According to the autograd profiler, pin_memory
takes the most time. Matrix multiplication (bmm
) is second.
I would appreciate any suggestions for improving GPU utilisation.