use a 4x1080Ti single machine with multiple gpu cards
define a resnet101 model
for each training iteration, sample a batch = 4096 images.
divide batch among 4gpu card, i.e. each gpu process 4096/4 = 1024 images
I cannot fit 1024 images into a gpu in one time, so 1024 is divided in 16 x 64 sub batches.
the single gpu accumulate the gradients of these 16 sub batches.
after each gpu accumulate 16 sub batches, they send the value to a master gpu card
the master card accumulate gradients from all card and do a parameter update.
-Just run it in a normal training loop (making sure to wrap your net in DataParallel when you create it) and then only call optimizer.step() and zero_grad() after every N steps (where it looks like N is 16 in the above case? if you mean each sub-batch has 64 images). You’ll want to either scale the learning rate or the loss to take into account the fact that the gradients aren’t properly averaged over all the sub-batches, but given that you’re following that paper I suspect you’re already aware of these scalings
Ah, you’d probably want to look at the internals of nn.DataParallel–the forward method has a very straightforward (pun intended) scatter-replicate-apply-gather loop–I think it’s possible to scatter a large batch, apply N times on N slices of the scattered batch, and then gather the gradients? Not sure, might need to ask the Adam or another pro.