SGD: Batch-size - Convergence tradeoff on single GPU?

Pkpklong · October 13, 2019, 9:17pm

Hi everyone, pytorch and MLnewbie here,

I’m trying to do matrix factorization via SGD, where each row in my dataset describes one entry in the target matrix. I’m trying to make use of my GPU to maximize parallel computation. In order to achieve this, I know that a larger batch must be used. I’m running into an issue where I need to consider a batch size and accuracy tradeoff. The larger the batch size, the more the GPU can be utilized, but in exchange my model converges to a local minima due to a lack of noise in the batch.

So my question is, is there some way where I can achieve the same performance gains of using a larger batch sizes while retaining the convergence results of smaller batch sizes?

Any input would be appreciated. Thanks!

KFrank · October 14, 2019, 11:42am

Hi Alex!

You could try just adding some noise to the gradients computed
using the larger batch sizes.

(For additive noise this is the same as subtracting learning-rate
times noise from your parameters after the SGD optimizer step.)

You could consider applying multiplicative noise to you gradients
so that the size of your noise scales, in effect, with the size of
your gradients.

You would now have three interacting parameters to play with:
batch size, size (and type) of noise, and learning rate.

(I don’t know how well this is likely to work, but it is certainly
something people experiment with.)

Good luck.

K. Frank

Pkpklong · October 14, 2019, 4:28pm

Hi Frank!

Originally, I thought L2 regularization already acts as the additive noise, but that’s to the loss rather than to the gradient itself. I’ll definitely try to play around adding noise to runs with larger batch sizes and see how it goes. Thanks a lot for the suggestions!

I also had another idea of using multiprocessing during the training phase, where I have a Pool of processes and I can just send multiple small mini batches to it to send to the GPU, so that gradients can be calculated in parallel. It is contrasting HogWild, where there are multiple processes training the same model on the same mini-batch. Have you seen a pattern like this?