DistributedDataParallel with single-process slower than sing-gpu

Hi. I have a machine with multi-GPU.
And I a wrote training code with Single-Process Multi-GPU according to this docs.

Single-Process Multi-GPU
In this case, a single process will be spawned on each host/node and each process will operate on all the GPUs of the node where it’s running. To use DistributedDataParallel in this way, you can simply construct the model as the following:

But I found that it slower than just using single gpu. There must be something wrong in my code.

code

# file: main.py
torch.distributed.init_process_group(backend="nccl")

# dataset
class RandomDataset(Dataset):
    def __getitem__(self, index):
        return torch.randn(3,255,255),0

    def __len__(self):
        return 100
datasets = RandomDataset()
sampler = DistributedSampler(datasets)
dataloader = DataLoader(datasets,16,sampler=sampler)

# model
model = torch.nn.Sequential(
  torchvision.models.resnet101(False),
  torch.nn.Linear(1000,2)
).cuda()
model = DistributedDataParallel(model)

begin_time = time.time()
# training loop
for i in range(10):
    for x, y in dataloader:
        x = x.cuda()
        y = y.reshape(-1).cuda()
        optimizer.zero_grad()

        output = model(x)
        loss = critertion(output,y)
        loss.backward()
        optimizer.step()
print('Cost:',time.time()-begin_time)

launch with

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=1 main.py

Time cost:

DistributedDataParallel with single-process and 2-gpu Single-gpu
22s 19s

GPU memory cost:

DistributedDataParallel with single-process and 2-gpu Single-gpu
3101MiB(GPU 0) / 2895MiB(GPU 1) 4207MiB

I’ve been debuging and looking docs for hours. I’ll be appreciated that somebody have a look.
thanks.

Hi @Jack5358

DistributedDataParallel’s single-process-multi-gpu mode is not recommended, because it does parameter replication, input split, output gather, etc. in every iteration, and Python GIL might get in the way. If you just have one machine, with one process per machine, then it will be very similar to DataParallel.

The recommended solution is to use single-process-single-gpu, which means, in your use case with two GPUs, you can spawn two processes, and each process exclusively works on one GPU. This should be faster than the current setup.

Hi @mrshenli
Do you have code reference for this recommended solution you are proposing? single-process-single-gpu?
Thanks
gmondaut

You can find some examples under the “Multi-Process Single-GPU” section in https://pytorch.org/docs/stable/nn.html#distributeddataparallel.