Backward time consumption linearly increases with batch_size

When I am training a simple resnet which involves convolutional network forward and backward on celebA dataset, I find out that the loss.backward() (MSE loss) whose time consumption increases linearly with batch_size. A simple profile is as follows:

  • Here is a part of backward profiling. The batch_size is set to 128, and GPU consumption is about 1.5GB/12GB.

  • Here is another backward profiling. The batch_size is set to 512, and GPU ram consumption is about 4.7GB/12GB.

I am sure the bottleneck is the loss.backward(). However, neither the CPU or GPU is fully used, it is weird why the backward time consumptio gets a linear increment. The parallel processing seems not working here.