Distributed Data Parallel master memory overhead

I am training with a large amount of data per batch such that for batch_size = 1, I am observing the following:

  1. I cannot fit it on a single gpu.
  2. I can fit it on a machine with 2 gpus when using DataParallel. This is a little confusing to me since I have read that batch_size=1 cannot be used with DataParallel. Does this mean half of a batch is on each gpu, and therefore “disconnected” when seen by the model?
  3. I cannot fit it using DistributedDataParallel.

I then modified my data so that each batch contains a smaller amount of data – I cannot go any smaller in terms of data per batch. I observe that:

A) I can fit batch_size=2 on a single gpu, but not batch_size=3.
B) I can fit batch_size=4 on a machine with 2 gpus using DataParallel, but not batch_size=5
C) I can fit batch_size=1 using DistributedDataParallel, but not batch_size=2.

I was under the impression that DistributedDataParallel is more efficient than DataParallel from the tutorials I read, and thus somewhat counter to what I am observing. I am wondering if DistributedDataParallel has memory overhead that DataParallel does not have and therefore explains what I am observing, as I have not been able to find any references regarding this situation. It is also a bit odd to me that a single gpu using DistributedDataParallel fits less than a single gpu standalone. Is it correct to assume that there is some type of workload imbalance afflicting DistributedDataParallel? Even with workload imbalance, wouldn’t it still be better than that for DataParallel, which seems to be able to fit more data?

DistributedDataParallel in default uses extra buffer to synchronize gradients, thus consumes more memory (one extra copy of all grads) than what local training consumes.

In PyTorch 1.8, we added a new prototype flag “grad_as_bucket_view”, if you set it as True, it can save this extra copy of all grads, and should consume almost the same memory as local training.