Training with large batch size across single node with multiple gpus

In facebook paper “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf, we can use large batch size if we scale the learning rate correctly. I want to do the following experiment:

  1. use a 4x1080Ti single machine with multiple gpu cards
  2. define a resnet101 model
  3. for each training iteration, sample a batch = 4096 images.
  4. divide batch among 4gpu card, i.e. each gpu process 4096/4 = 1024 images
  5. I cannot fit 1024 images into a gpu in one time, so 1024 is divided in 16 x 64 sub batches.
    the single gpu accumulate the gradients of these 16 sub batches.
  6. after each gpu accumulate 16 sub batches, they send the value to a master gpu card
  7. the master card accumulate gradients from all card and do a parameter update.

How can I do this in pytorch?

-Just run it in a normal training loop (making sure to wrap your net in DataParallel when you create it) and then only call optimizer.step() and zero_grad() after every N steps (where it looks like N is 16 in the above case? if you mean each sub-batch has 64 images). You’ll want to either scale the learning rate or the loss to take into account the fact that the gradients aren’t properly averaged over all the sub-batches, but given that you’re following that paper I suspect you’re already aware of these scalings :wink:

this is the current approach i am following. But i think this method does this:

  • for each sub iteration from 1 to 16:
    • scatter data(256 images to 4 gpu)
    • each gpu compute (64 images)
    • gather gradients. master gpu accumulate gradients from other 4 gpus.
  • master gpu update.

I am trying to cut down the scatter/gather time. How can I do:

  • scatter data (4096 images to 4 gpus)
  • each gpu compute and accumulate for 16 times (16x64 images)
  • gather gradients. master gpu accumulate gradients from other gpu.
  • master gpu update.

Thanks!

Ah, you’d probably want to look at the internals of nn.DataParallel–the forward method has a very straightforward (pun intended) scatter-replicate-apply-gather loop–I think it’s possible to scatter a large batch, apply N times on N slices of the scattered batch, and then gather the gradients? Not sure, might need to ask the Adam or another pro.