I use the following code, but I think it may be wrong!
lr_broadcast = torch.tensor([0.0]).cuda()
if distribute:
if local_rank == 0:
lr_broadcast = torch.tensor([optimizer.state_dict()[‘param_groups’][0][‘lr’]]).cuda()
dist.all_reduce(lr_broadcast)
optimizer.state_dict()[‘param_groups’][0][‘lr’] = lr_broadcast.item()
print(local_rank, optimizer.state_dict()[‘param_groups’][0][‘lr’])
Could you please format the code with proper indention?
If this line is with in the if local_rank == 0
block, then this is indeed wrong. You need to call all_reduce
on all processes, as it is a collective communication.
if local_rank == 0:
dist.all_reduce(lr_broadcast)
if distribute and local_rank == 0:
acc = verification(...)
# lr_sched is torch.optim.lr_scheduler.ReduceLROnPlateau()
lr_sched.step(acc)
lr_broadcast = torch.tensor([0.0]).cuda()
if distribute:
if local_rank == 0:
lr_broadcast = torch.tensor([optimizer.state_dict()['param_groups'][0]['lr']]).cuda()
dist.all_reduce(lr_broadcast)
optimizer.state_dict()['param_groups'][0]['lr'] = lr_broadcast.item()
print(local_rank, optimizer.state_dict()['param_groups'][0]['lr'])
When I run this code with 2 GPUs, I found two mistakes.
First, dist.all_reduct() failed. I try to print lr_broadcast after all_reduce, but different on GPU0 and GPU1.
Second, after I pass the lr_broadcast value to optimizer.state_dict, I print the lr in optimizer, but the value is differ from lr_broadcast.
So, I got confused.
I noticed the code called .cuda()
without specifying a GPU id. Did you set CUDA_VISIBLE_DEVICES
or call torch.cuda.set_device()
? Could you please share a self-contained repro that shows the error?
Well, I knew how to change my code to succeed with “all_reduce”, but I’m still a little comfused.
My code before:(fail)
from torch.nn.parallel import DistributedDataParallel as DDP
def main():
processes = []
for rank in range(world_size):
p = Process(target=run, args=(rank, world_size))
p.start()
processes.append(p)
for p in processes:
p.join()
def run(local_rank, world_size):
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29501'
torch.cuda.set_device(local_rank)
dist.init_process_group(world_size=world_size, rank=local_rank, backend='nccl')
net = Net().cuda()
criterion = ...
optimizer = ...
net = DDP(net, device_ids=[local_rank])
for ep in range(2):
# train
net.train()
for i, (images, labels) in enumerate(train_loader):
optimizer.zero_grad()
images = image.cuda()
labels = labels.cuda()
outputs = net(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# verification
if local_rank != 0:
val_acc = torch.tensor([0.0]).cuda()
# only do verification on rank0
if local_rank == 0:
net.eval()
# simplify verification simulation
input = torch.randn(8, 3, 112, 112).cuda()
with torch.no_grad():
output = net(input)
val_acc = torch.tensor([0.0009]).cuda()
dist.all_reduct(val_acc)
print(local_rank, val_acc)
main()
The code below shows:
1 tensor([0.0], device='cuda:1')
0 tensor([0.0009], device='cuda:0')
And, in epoch 2, local_rank 1 will hang up.
My solution is that, change the line “output = net(input)” to “output = net.module(input)”, it succeed.
0 tensor([0.0009], device='cuda:0')
1 tensor([0.0009], device='cuda:1')
So I want to know why this happen?
Could you please explain this
Hey @MRGAO1996
Does your model use any buffers (e.g., running mean in BatchNorm
)?
As you have wrapped the second forward with with torch.no_grad():
, rank 0 will skip the code in if torch.is_grad_enabled() and ...
branch (see DDP forward below), but the behavior of the next forward
will be different on two ranks, as rank 0 would skip _sync_params()
but rank 1 would execute that. But this method should only make a difference when your model has buffers.