Suboptimal GPU utilisation

I’ve a simple shallow neural network. When I train this model, GPU utilisation is just ~ 35%. Could anyone suggest what can cause this under-utilisation of resources? Training examples are fed to the model from 32 processes using the torch.utils.data.DataLoader API with pinned memory. I use torch.nn.Embedding with sparse gradients as input layer. The model in trained on Tesla P100 in GCP. The host machine has 16 vCPUs. According to cProfile, the run_backward method takes the most time, and the CPU -> GPU transfer (the cuda method) is second. According to the autograd profiler, pin_memory takes the most time. Matrix multiplication (bmm) is second.

I would appreciate any suggestions for improving GPU utilisation.