DistributedDataParallel with single-process slower than sing-gpu

Jack5358 · January 13, 2020, 9:28am

Hi. I have a machine with multi-GPU.
And I a wrote training code with Single-Process Multi-GPU according to this docs.

Single-Process Multi-GPU
In this case, a single process will be spawned on each host/node and each process will operate on all the GPUs of the node where it’s running. To use DistributedDataParallel in this way, you can simply construct the model as the following:

But I found that it slower than just using single gpu. There must be something wrong in my code.

code

# file: main.py
torch.distributed.init_process_group(backend="nccl")

# dataset
class RandomDataset(Dataset):
    def __getitem__(self, index):
        return torch.randn(3,255,255),0

    def __len__(self):
        return 100
datasets = RandomDataset()
sampler = DistributedSampler(datasets)
dataloader = DataLoader(datasets,16,sampler=sampler)

# model
model = torch.nn.Sequential(
  torchvision.models.resnet101(False),
  torch.nn.Linear(1000,2)
).cuda()
model = DistributedDataParallel(model)

begin_time = time.time()
# training loop
for i in range(10):
    for x, y in dataloader:
        x = x.cuda()
        y = y.reshape(-1).cuda()
        optimizer.zero_grad()

        output = model(x)
        loss = critertion(output,y)
        loss.backward()
        optimizer.step()
print('Cost:',time.time()-begin_time)

launch with

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=1 main.py

Time cost:

DistributedDataParallel with single-process and 2-gpu	Single-gpu
22s	19s

GPU memory cost:

DistributedDataParallel with single-process and 2-gpu	Single-gpu
3101MiB(GPU 0) / 2895MiB(GPU 1)	4207MiB

I’ve been debuging and looking docs for hours. I’ll be appreciated that somebody have a look.
thanks.

mrshenli · January 17, 2020, 9:30pm

Hi @Jack5358

DistributedDataParallel’s single-process-multi-gpu mode is not recommended, because it does parameter replication, input split, output gather, etc. in every iteration, and Python GIL might get in the way. If you just have one machine, with one process per machine, then it will be very similar to DataParallel.

The recommended solution is to use single-process-single-gpu, which means, in your use case with two GPUs, you can spawn two processes, and each process exclusively works on one GPU. This should be faster than the current setup.

gmondaut · February 7, 2020, 9:59am

Hi @mrshenli
Do you have code reference for this recommended solution you are proposing? single-process-single-gpu?
Thanks
gmondaut

pritamdamania87 · February 9, 2020, 2:41am

You can find some examples under the “Multi-Process Single-GPU” section in https://pytorch.org/docs/stable/nn.html#distributeddataparallel.

xiefeiwhu · April 16, 2020, 3:23pm

Hello, I meet the same slower question. I use multi nodes and multi gpus, also with spawn. After checking the code, I find the time costs heavily in optimizer.step(). Any solutions?

mrshenli · April 16, 2020, 7:51pm

Hey @xiefeiwhu

optimizer.step() is not part of DDP forward-backward. Which optimizer are you using? and do you observe the same slowness in local training.

BTW, how did you measure the delay? You might need to use CUDA events to get accurate timing measures, as there could be pending ops in the CUDA stream so that time.time() cannot faithfully represent the time cost.

mrshenli · April 16, 2020, 7:53pm

Sorry, I missed this. Yes please checkout this example that uses device_ids=[rank] to specify which device DDP should use.

xiefeiwhu · April 17, 2020, 1:38am

I use Adam as the optimizer. After a carefully check with the code, the slowness seems to be the communication time between the nodes as my model is about 180M params, i.e. 760MB. The computation time is faster than the communication time. Then, I expand the nodes from 2 to 4 and the communication time is bigger but not 2 times which accelerates the training procedure to some extent.

pritamdamania87 · April 18, 2020, 12:50am

You can try ProcessGroupRoundRobin and see if it helps in reducing the communication time. Example usage: https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d.py#L1511. Note that this API is not officially supported yet.