Thanks for your reply.
Yes, DataParallel
will split the batch in dim0
and send the chunks to each device. It will introduce additional overheads via the model copies and should be slower than DDP
. Could you share a minimal, executable code snippet showing that DDP
is slower than DP
, please?
I have the following code that runs slow on my data. I haven’t tested this on small sample data. Sorry that my code is not executable.
def main(gpu, cfg):
"""
The gpu from the main function's argument goes from 0 to world_size-1. Therefore, for the local
rank, for one node I used cfg.gpu = gpu, but for multi node I use cfg.gpu = gpu // len(cfg.gpus)
to create gpu local number from 0 to 7 per node as local rank.
"""
cfg.gpu = gpu # local rank in the case of one node training
cfg.gpu = gpu // len(cfg.gpus) # local rank in the case of multi node training
cfg.rank = gpu # global rank
torch.distributed.init_process_group(backend='nccl',
init_method=cfg.dist_url,
world_size=cfg.worl_size,
rank=cfg.rank)
torch.cuda.set_device(cfg.gpu)
torch.distributed.barrier()
train_dataset = getDataset('train')
valid_dataset = getDataset('valid')
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset,
num_replicas=cfg.world_size,
rank=cfg.rank,
shuffle=False,
drop_last=True)
train_dataloader = torch.utils.data.DataLoader(train_dataset,
collate_fn=my_collate,
worker_init_fn=worker_init_fn,
sampler=train_sampler)
valid_dataloader = torch.utils.data.DataLoader(valid_dataset,
collate_fn=my_collate,
sampler=None)
model = Net()
model.cuda(cfg.gpu)
model = torch.nn.parallel.DistributedDataParallel(model,
device_ids=[cfg.gpu],
output_device=cfg.gpu,
gradient_as_bucket_view=True)
optimizer = optimizer()
criterion = criterion().cuda(cfg.gpu)
scheduler = scheduler()
device = torch.device("cuda:{}".format(cfg.gpu))
runner = Runner(model,
model_device=device,
optimizer,
criterion,
scheduler,
train_dataloader,
valid_dataloader)
runner.train()
torch.distributed.destroy_process_group()
if __name__== "__main__":
cfg = getConfig()
os.environ["MASTER_ADDR"]=hostname
os.environ["MASTER_PORT]=portNumber
os.environ["CUDA_VISIBLE_DEVICES"]=",".join(map(str,cfg.gpus)) #cfg.gpus=[0,1,2,3,4,5,6,7] per node
cfg.["dist_url"]=f'env://{hostname}:{portNumber}'
torch.multiprocessing.spawn(main, nprocs=cfg.world_size, args=(cfg,))
I guess you are accidentally creating CUDA contexts on the default device and it’s running out of memory. Try to use torch.cuda.set_device
or mask the devices via CUDA_VISIBLE_DEVICES
.
I am using both of these (shown in the code above). I don’t know what else I should do. I’m stuck.