I am training with a large amount of data per batch such that for batch_size = 1, I am observing the following:
- I cannot fit it on a single gpu.
- I can fit it on a machine with 2 gpus when using DataParallel. This is a little confusing to me since I have read that batch_size=1 cannot be used with DataParallel. Does this mean half of a batch is on each gpu, and therefore “disconnected” when seen by the model?
- I cannot fit it using DistributedDataParallel.
I then modified my data so that each batch contains a smaller amount of data – I cannot go any smaller in terms of data per batch. I observe that:
A) I can fit batch_size=2 on a single gpu, but not batch_size=3.
B) I can fit batch_size=4 on a machine with 2 gpus using DataParallel, but not batch_size=5
C) I can fit batch_size=1 using DistributedDataParallel, but not batch_size=2.
I was under the impression that DistributedDataParallel is more efficient than DataParallel from the tutorials I read, and thus somewhat counter to what I am observing. I am wondering if DistributedDataParallel has memory overhead that DataParallel does not have and therefore explains what I am observing, as I have not been able to find any references regarding this situation. It is also a bit odd to me that a single gpu using DistributedDataParallel fits less than a single gpu standalone. Is it correct to assume that there is some type of workload imbalance afflicting DistributedDataParallel? Even with workload imbalance, wouldn’t it still be better than that for DataParallel, which seems to be able to fit more data?