I wonder does the GPU memory usage rough has a linear relationship with the batch size used in training?
I was fine tune ResNet152. With a batch size 8, the total GPU memory used is around 4G and when the batch size is increased to 16 for training, the total GPU memory used is around 6G. The model itself takes about 2G. It seems to me the GPU memory consumption of training ResNet 152 is approximately 2G + 2G * batch_size / 8?
The batch size would increase the activation sizes during the forward pass, while the model parameter (and gradients) would still use the same amount of memory as they are not depending on the used batch size. This post explains the memory usage in more detail.