Hi. I have a machine with multi-GPU.
And I a wrote training code with Single-Process Multi-GPU according to this docs.
Single-Process Multi-GPU
In this case, a single process will be spawned on each host/node and each process will operate on all the GPUs of the node where it’s running. To use DistributedDataParallel in this way, you can simply construct the model as the following:
But I found that it slower than just using single gpu. There must be something wrong in my code.
code
# file: main.py
torch.distributed.init_process_group(backend="nccl")
# dataset
class RandomDataset(Dataset):
def __getitem__(self, index):
return torch.randn(3,255,255),0
def __len__(self):
return 100
datasets = RandomDataset()
sampler = DistributedSampler(datasets)
dataloader = DataLoader(datasets,16,sampler=sampler)
# model
model = torch.nn.Sequential(
torchvision.models.resnet101(False),
torch.nn.Linear(1000,2)
).cuda()
model = DistributedDataParallel(model)
begin_time = time.time()
# training loop
for i in range(10):
for x, y in dataloader:
x = x.cuda()
y = y.reshape(-1).cuda()
optimizer.zero_grad()
output = model(x)
loss = critertion(output,y)
loss.backward()
optimizer.step()
print('Cost:',time.time()-begin_time)
DistributedDataParallel’s single-process-multi-gpu mode is not recommended, because it does parameter replication, input split, output gather, etc. in every iteration, and Python GIL might get in the way. If you just have one machine, with one process per machine, then it will be very similar to DataParallel.
The recommended solution is to use single-process-single-gpu, which means, in your use case with two GPUs, you can spawn two processes, and each process exclusively works on one GPU. This should be faster than the current setup.
Hello, I meet the same slower question. I use multi nodes and multi gpus, also with spawn. After checking the code, I find the time costs heavily in optimizer.step(). Any solutions?
optimizer.step() is not part of DDP forward-backward. Which optimizer are you using? and do you observe the same slowness in local training.
BTW, how did you measure the delay? You might need to use CUDA events to get accurate timing measures, as there could be pending ops in the CUDA stream so that time.time() cannot faithfully represent the time cost.
I use Adam as the optimizer. After a carefully check with the code, the slowness seems to be the communication time between the nodes as my model is about 180M params, i.e. 760MB. The computation time is faster than the communication time. Then, I expand the nodes from 2 to 4 and the communication time is bigger but not 2 times which accelerates the training procedure to some extent.