When I am training a simple resnet which involves convolutional network forward and backward on celebA dataset, I find out that the loss.backward() (MSE loss) whose time consumption increases linearly with batch_size. A simple profile is as follows:
Here is another backward profiling. The batch_size is set to 512, and GPU ram consumption is about 4.7GB/12GB.
I am sure the bottleneck is the loss.backward(). However, neither the CPU or GPU is fully used, it is weird why the backward time consumptio gets a linear increment. The parallel processing seems not working here.