DistributedDataParallel with single-process slower than sing-gpu

Hi. I have a machine with multi-GPU.
And I a wrote training code with Single-Process Multi-GPU according to this docs.

Single-Process Multi-GPU
In this case, a single process will be spawned on each host/node and each process will operate on all the GPUs of the node where it’s running. To use DistributedDataParallel in this way, you can simply construct the model as the following:

But I found that it slower than just using single gpu. There must be something wrong in my code.

code

# file: main.py
torch.distributed.init_process_group(backend="nccl")

# dataset
class RandomDataset(Dataset):
    def __getitem__(self, index):
        return torch.randn(3,255,255),0

    def __len__(self):
        return 100
datasets = RandomDataset()
sampler = DistributedSampler(datasets)
dataloader = DataLoader(datasets,16,sampler=sampler)

# model
model = torch.nn.Sequential(
  torchvision.models.resnet101(False),
  torch.nn.Linear(1000,2)
).cuda()
model = DistributedDataParallel(model)

begin_time = time.time()
# training loop
for i in range(10):
    for x, y in dataloader:
        x = x.cuda()
        y = y.reshape(-1).cuda()
        optimizer.zero_grad()

        output = model(x)
        loss = critertion(output,y)
        loss.backward()
        optimizer.step()
print('Cost:',time.time()-begin_time)

launch with

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=1 main.py

Time cost:

DistributedDataParallel with single-process and 2-gpu Single-gpu
22s 19s

GPU memory cost:

DistributedDataParallel with single-process and 2-gpu Single-gpu
3101MiB(GPU 0) / 2895MiB(GPU 1) 4207MiB

I’ve been debuging and looking docs for hours. I’ll be appreciated that somebody have a look.
thanks.

Hi @Jack5358

DistributedDataParallel’s single-process-multi-gpu mode is not recommended, because it does parameter replication, input split, output gather, etc. in every iteration, and Python GIL might get in the way. If you just have one machine, with one process per machine, then it will be very similar to DataParallel.

The recommended solution is to use single-process-single-gpu, which means, in your use case with two GPUs, you can spawn two processes, and each process exclusively works on one GPU. This should be faster than the current setup.

4 Likes

Hi @mrshenli
Do you have code reference for this recommended solution you are proposing? single-process-single-gpu?
Thanks
gmondaut

You can find some examples under the “Multi-Process Single-GPU” section in https://pytorch.org/docs/stable/nn.html#distributeddataparallel.

Hello, I meet the same slower question. I use multi nodes and multi gpus, also with spawn. After checking the code, I find the time costs heavily in optimizer.step(). Any solutions?

Hey @xiefeiwhu

optimizer.step() is not part of DDP forward-backward. Which optimizer are you using? and do you observe the same slowness in local training.

BTW, how did you measure the delay? You might need to use CUDA events to get accurate timing measures, as there could be pending ops in the CUDA stream so that time.time() cannot faithfully represent the time cost.

Sorry, I missed this. Yes please checkout this example that uses device_ids=[rank] to specify which device DDP should use.

I use Adam as the optimizer. After a carefully check with the code, the slowness seems to be the communication time between the nodes as my model is about 180M params, i.e. 760MB. The computation time is faster than the communication time. Then, I expand the nodes from 2 to 4 and the communication time is bigger but not 2 times which accelerates the training procedure to some extent.

You can try ProcessGroupRoundRobin and see if it helps in reducing the communication time. Example usage: https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d.py#L1511. Note that this API is not officially supported yet.