Data splitting in DistributedDataParallel

juniper · January 15, 2020, 12:05am

Hi,

I’m trying to use DistributedDataParallel on a CPU-only machine with multiple cores.

The documentation for DDP (https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/distributed.py) states: "For multi-device modules and CPU modules, device_ids must be None or an empty list, and input data for the forward pass must be placed on the correct device. (default: all devices for single-device modules).

I want to parallelize training across CPU processes in a single machine. My dataset is an in-memory numpy array.

Would I have to manually separate this dataset into different subsets, and load each subset for each CPU process? Does splitting the input along the batch dimension work for CPU modules as well? I am using torch’s multiprocessing module to spawn processes to use with my DDP model.

Thank you.

PS. What’s the best practice for sharing an in-memory array across torch processes? That would be helpful as well.

Some additional observations:

When I print the loss in each process, the loss value is the same. If the data was being split properly by DDP, wouldn’t each process have a different loss value?

juniper · January 15, 2020, 3:03pm

From my experiments, it appears that for DDP using CPU processes, there is no splitting of data across the batch dimension across processes.

In the source code as well, if the model’s device_ids is None, then scattering is not performed in the forward() pass of the model.

Can someone more authoritative confirm this behavior?

mrshenli · January 16, 2020, 5:10pm

In the source code as well, if the model’s device_ids is None, then scattering is not performed in the forward() pass of the model.

Yes, this is correct.

Input data split only occurs in two situations:

When using DataParallel (single-process multi-thread)
Using DistributedDataParallel (DDP), and provide a device_ids list of multiple CUDA devices. In this case, each DDP process will operate on multiple devices and multiple model replicas, and hence need to split the input data. (This is not recommended, as this could be slow)

For the recommended use case of DDP (one device/replica per DDP process), DDP will NOT split input or distributed them into multiple processes. Each DDP process needs to read its own input data independently. You could try manually splitting those data (say on rank0) and pass them across processes though, if they are on the same machine. Or, I also saw many people using the DistributedSampler to load input data