I have been experimenting with multiple GPU training with 8 GPUs and I am running into some inconsistencies between my expected results and my actual results. I have been experimenting with both Windows (Server 2019) and Ubuntu 18.04 LTS. I have tried data parallel on both Windows and Ubuntu, and distributed data parallel on Ubuntu (single node + 8 GPUs).
I have run some metric tests to assess the speed of the training, which are shown below. I am basing this on one epoch.
Windows Data Parallel: ~1000 ms / step
Ubuntu Data Parallel: ~300 ms/step
Ubuntu Distributed Data Parallel: ~350 ms/step
There are a few areas of confusion:
Why does running the same code in Windows and Linux produce such different results?
I would expect distributed data parallel to be faster than data parallel or perhaps the same speed, but this is not the case in my situation.
So perhaps the optimization of the parallelization is different on Windows versus Ubuntu and may explain the difference. Though I am still surprised there is this much of a difference.
I am much more surprised at the difference between distributed data parallel and data parallel. There could be a problem in how I am distributed the data in the distributed data parallelization approach. My situation may be a bit atypical, but I am working with 3D data, but I am using patches of this data as the input. So for example, a single 3D file may contain 30 patches. I am using a custom dataset with a GetItem that based on a directory will load a 3d file. Then I use a custom sampler that loads the patch indices and has an iterator that passes patches when called. I feed this sampler to the DistributedSampler, which goes to the Dataloader of the GPU process.
On the surface this appears to work without conflict, though in this scenario since different patches could be part of the same file, multiple GPU processes could be trying to access the same file at the same time. To be honest I am not 100% sure if this is valid, though I believe the GPUs would be accessing the files in read-only mode, so I would think it would be ok. However, would this cause inefficiency and delay? Would I be better off distributing the 3D files among the GPU instead so multiple GPUs would not be accessing the same file?
I suppose that the overhead of distributing the work could also be a factor, but according to the PyTorch documentation, “Thus, even for single machine training, where your data is small enough to fit on a single machine, DistributedDataParallel is expected to be faster than DataParallel.” : https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
I will further note that I am using the native PyTorch distributed parallel library instead of NVIDIA Apex, but I am wondering if this would make a difference.
I could also experiment with the number of workers, but is there a general guideline on how to adjust this parameter on multiGPU training?
I would appreciate any insight and pointers. Thanks in advance.