I’m trying to do matrix factorization via SGD, where each row in my dataset describes one entry in the target matrix. I’m trying to make use of my GPU to maximize parallel computation. In order to achieve this, I know that a larger batch must be used. I’m running into an issue where I need to consider a batch size and accuracy tradeoff. The larger the batch size, the more the GPU can be utilized, but in exchange my model converges to a local minima due to a lack of noise in the batch.
So my question is, is there some way where I can achieve the same performance gains of using a larger batch sizes while retaining the convergence results of smaller batch sizes?
Originally, I thought L2 regularization already acts as the additive noise, but that’s to the loss rather than to the gradient itself. I’ll definitely try to play around adding noise to runs with larger batch sizes and see how it goes. Thanks a lot for the suggestions!
I also had another idea of using multiprocessing during the training phase, where I have a Pool of processes and I can just send multiple small mini batches to it to send to the GPU, so that gradients can be calculated in parallel. It is contrasting HogWild, where there are multiple processes training the same model on the same mini-batch. Have you seen a pattern like this?