I’m not sure which setup would work the best, but I would try to set batch_size=4*32
, such that each GPU gets a batch of 32 samples.
Your training will most likely differ a bit as described here.
PS: I’m not a huge fan of tagging people, since this might demotivate others to answer in this thread.