How to decay lr with distributed training using test acc?

I use the following code, but I think it may be wrong!
lr_broadcast = torch.tensor([0.0]).cuda()
if distribute:
if local_rank == 0:
lr_broadcast = torch.tensor([optimizer.state_dict()[‘param_groups’][0][‘lr’]]).cuda()
dist.all_reduce(lr_broadcast)
optimizer.state_dict()[‘param_groups’][0][‘lr’] = lr_broadcast.item()
print(local_rank, optimizer.state_dict()[‘param_groups’][0][‘lr’])

@MRGAO1996

Could you please format the code with proper indention?

If this line is with in the if local_rank == 0 block, then this is indeed wrong. You need to call all_reduce on all processes, as it is a collective communication.

if local_rank == 0:
    dist.all_reduce(lr_broadcast)
if distribute and local_rank == 0:
    acc = verification(...)
    # lr_sched is torch.optim.lr_scheduler.ReduceLROnPlateau()
    lr_sched.step(acc)
lr_broadcast = torch.tensor([0.0]).cuda()
if distribute:
    if local_rank == 0:
        lr_broadcast = torch.tensor([optimizer.state_dict()['param_groups'][0]['lr']]).cuda()
    dist.all_reduce(lr_broadcast)
    optimizer.state_dict()['param_groups'][0]['lr'] = lr_broadcast.item()
    print(local_rank, optimizer.state_dict()['param_groups'][0]['lr'])

When I run this code with 2 GPUs, I found two mistakes.
First, dist.all_reduct() failed. I try to print lr_broadcast after all_reduce, but different on GPU0 and GPU1.
Second, after I pass the lr_broadcast value to optimizer.state_dict, I print the lr in optimizer, but the value is differ from lr_broadcast.
So, I got confused.

I noticed the code called .cuda() without specifying a GPU id. Did you set CUDA_VISIBLE_DEVICES or call torch.cuda.set_device()? Could you please share a self-contained repro that shows the error?

Well, I knew how to change my code to succeed with “all_reduce”, but I’m still a little comfused.

My code before:(fail)

from torch.nn.parallel import DistributedDataParallel as DDP

def main():
	processes = []
	for rank in range(world_size):
		p = Process(target=run, args=(rank, world_size))
		p.start()
		processes.append(p)
	for p in processes:
		p.join()

def run(local_rank, world_size):
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29501'
	torch.cuda.set_device(local_rank)
    dist.init_process_group(world_size=world_size, rank=local_rank, backend='nccl')
	net = Net().cuda()
	criterion = ...
	optimizer = ...
	net = DDP(net, device_ids=[local_rank])
	for ep in range(2):
		# train
		net.train()
		for i, (images, labels) in enumerate(train_loader):
			optimizer.zero_grad()
			images = image.cuda()
			labels = labels.cuda()
			outputs = net(images)
			loss = criterion(outputs, labels)
			loss.backward()
			optimizer.step()
		# verification
		if local_rank != 0:
            val_acc = torch.tensor([0.0]).cuda()
		# only do verification on rank0
		if local_rank == 0:
			net.eval()
			# simplify verification simulation
			input = torch.randn(8, 3, 112, 112).cuda()
			with torch.no_grad():
				output = net(input)
			val_acc = torch.tensor([0.0009]).cuda()
		dist.all_reduct(val_acc)
		print(local_rank, val_acc)

main()

The code below shows:

1 tensor([0.0], device='cuda:1')
0 tensor([0.0009], device='cuda:0')

And, in epoch 2, local_rank 1 will hang up.

My solution is that, change the line “output = net(input)” to “output = net.module(input)”, it succeed.

0 tensor([0.0009], device='cuda:0')
1 tensor([0.0009], device='cuda:1')

So I want to know why this happen?

Could you please explain this :slightly_smiling_face:

Hey @MRGAO1996

Does your model use any buffers (e.g., running mean in BatchNorm)?

As you have wrapped the second forward with with torch.no_grad():, rank 0 will skip the code in if torch.is_grad_enabled() and ... branch (see DDP forward below), but the behavior of the next forward will be different on two ranks, as rank 0 would skip _sync_params() but rank 1 would execute that. But this method should only make a difference when your model has buffers.