I used implementation of DataParallel from
https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
and DistributedDataParallel from Distributed Data Parallel — PyTorch master documentation
As shown in the DataParallel docs
Optional: Data Parallelism — PyTorch Tutorials 1.12.0+cu102 documentation I printed using DataParallel and DistributedDataParallel, the results are as shown below the batch size is 4 and images have 1 channel with 512x512 size.
------------------ DataParallel ------------------
In Model: input size torch.Size([2, 1, 512, 512])
In Model: input size torch.Size([2, 1, 512, 512])
Outside: input size torch.Size([4, 1, 512, 512]) output_size torch.Size([4, 1, 512, 512])
In Model: input size torch.Size([2, 1, 512, 512])
In Model: input size torch.Size([2, 1, 512, 512])
Outside: input size torch.Size([4, 1, 512, 512]) output_size torch.Size([4, 1, 512, 512])
In Model: input size torch.Size([1, 1, 512, 512])
Outside: input size torch.Size([1, 1, 512, 512]) output_size torch.Size([1, 1, 512, 512])
------------------ Distributed Data Parallel ------------------
In Model: input size torch.Size([4, 1, 512, 512])
Outside: input size torch.Size([4, 1, 512, 512]) output_size torch.Size([4, 1, 512, 512])
In Model: input size torch.Size([4, 1, 512, 512])
Outside: input size torch.Size([4, 1, 512, 512]) output_size torch.Size([4, 1, 512, 512])
In Model: input size torch.Size([4, 1, 512, 512])
Outside: input size torch.Size([4, 1, 512, 512]) output_size torch.Size([4, 1, 512, 512])
In Model: input size torch.Size([4, 1, 512, 512])
Outside: input size torch.Size([4, 1, 512, 512]) output_size torch.Size([4, 1, 512, 512])
In Model: input size torch.Size([1, 1, 512, 512])
Outside: input size torch.Size([1, 1, 512, 512]) output_size torch.Size([1, 1, 512, 512])
In Model: input size torch.Size([1, 1, 512, 512])
Outside: input size torch.Size([1, 1, 512, 512]) output_size torch.Size([1, 1, 512, 512])
As seen above, the dataparallel is taking 2-2 batch in each of the gpus whereas the DsitributedDataPrallel is taking 4 batch in each of the gpus.
So this DsitributedDataPrallel is taking exactly double the time as taken by DataParallel. Should I implement all Partition, DataPartitioner, partition_dataset, average_gradients like functions shown in the link (dist_tuto.pth/train_dist.py at gh-pages · seba-1511/dist_tuto.pth · GitHub)to make the DistributedDataParallel work?
And doing the things specified in this tutorial only (Getting Started with Distributed Data Parallel — PyTorch Tutorials 1.12.0+cu102 documentation) doesnot help??