How to decay lr with distributed training using test acc?

MRGAO1996 · July 13, 2020, 1:40pm

I use the following code, but I think it may be wrong!
lr_broadcast = torch.tensor([0.0]).cuda()
if distribute:
if local_rank == 0:
lr_broadcast = torch.tensor([optimizer.state_dict()[‘param_groups’][0][‘lr’]]).cuda()
dist.all_reduce(lr_broadcast)
optimizer.state_dict()[‘param_groups’][0][‘lr’] = lr_broadcast.item()
print(local_rank, optimizer.state_dict()[‘param_groups’][0][‘lr’])

mrshenli · July 13, 2020, 3:49pm

@MRGAO1996

Could you please format the code with proper indention?

If this line is with in the if local_rank == 0 block, then this is indeed wrong. You need to call all_reduce on all processes, as it is a collective communication.

if local_rank == 0:
    dist.all_reduce(lr_broadcast)

MRGAO1996 · July 14, 2020, 1:49am

if distribute and local_rank == 0:
    acc = verification(...)
    # lr_sched is torch.optim.lr_scheduler.ReduceLROnPlateau()
    lr_sched.step(acc)
lr_broadcast = torch.tensor([0.0]).cuda()
if distribute:
    if local_rank == 0:
        lr_broadcast = torch.tensor([optimizer.state_dict()['param_groups'][0]['lr']]).cuda()
    dist.all_reduce(lr_broadcast)
    optimizer.state_dict()['param_groups'][0]['lr'] = lr_broadcast.item()
    print(local_rank, optimizer.state_dict()['param_groups'][0]['lr'])

When I run this code with 2 GPUs, I found two mistakes.
First, dist.all_reduct() failed. I try to print lr_broadcast after all_reduce, but different on GPU0 and GPU1.
Second, after I pass the lr_broadcast value to optimizer.state_dict, I print the lr in optimizer, but the value is differ from lr_broadcast.
So, I got confused.

mrshenli · July 14, 2020, 2:46pm

I noticed the code called .cuda() without specifying a GPU id. Did you set CUDA_VISIBLE_DEVICES or call torch.cuda.set_device()? Could you please share a self-contained repro that shows the error?

MRGAO1996 · July 15, 2020, 6:33am

Well, I knew how to change my code to succeed with “all_reduce”, but I’m still a little comfused.

My code before:(fail)

from torch.nn.parallel import DistributedDataParallel as DDP

def main():
	processes = []
	for rank in range(world_size):
		p = Process(target=run, args=(rank, world_size))
		p.start()
		processes.append(p)
	for p in processes:
		p.join()

def run(local_rank, world_size):
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29501'
	torch.cuda.set_device(local_rank)
    dist.init_process_group(world_size=world_size, rank=local_rank, backend='nccl')
	net = Net().cuda()
	criterion = ...
	optimizer = ...
	net = DDP(net, device_ids=[local_rank])
	for ep in range(2):
		# train
		net.train()
		for i, (images, labels) in enumerate(train_loader):
			optimizer.zero_grad()
			images = image.cuda()
			labels = labels.cuda()
			outputs = net(images)
			loss = criterion(outputs, labels)
			loss.backward()
			optimizer.step()
		# verification
		if local_rank != 0:
            val_acc = torch.tensor([0.0]).cuda()
		# only do verification on rank0
		if local_rank == 0:
			net.eval()
			# simplify verification simulation
			input = torch.randn(8, 3, 112, 112).cuda()
			with torch.no_grad():
				output = net(input)
			val_acc = torch.tensor([0.0009]).cuda()
		dist.all_reduct(val_acc)
		print(local_rank, val_acc)

main()

The code below shows:

1 tensor([0.0], device='cuda:1')
0 tensor([0.0009], device='cuda:0')

And, in epoch 2, local_rank 1 will hang up.

My solution is that, change the line “output = net(input)” to “output = net.module(input)”, it succeed.

0 tensor([0.0009], device='cuda:0')
1 tensor([0.0009], device='cuda:1')

So I want to know why this happen?

MRGAO1996 · July 16, 2020, 3:02am

Could you please explain this

mrshenli · July 16, 2020, 2:42pm

Hey @MRGAO1996

Does your model use any buffers (e.g., running mean in BatchNorm)?

As you have wrapped the second forward with with torch.no_grad():, rank 0 will skip the code in if torch.is_grad_enabled() and ... branch (see DDP forward below), but the behavior of the next forward will be different on two ranks, as rank 0 would skip _sync_params() but rank 1 would execute that. But this method should only make a difference when your model has buffers.

github.com

pytorch/pytorch/blob/d5ae4a07ef5b2e77cf51737fb0a3aafc2e71231d/torch/nn/parallel/distributed.py#L562-L590


def forward(self, *inputs, **kwargs):
    if self.require_forward_param_sync:
        self._sync_params()

    if self.device_ids:
        inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
        if len(self.device_ids) == 1:
            output = self.module(*inputs[0], **kwargs[0])
        else:
            outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs)
            output = self.gather(outputs, self.output_device)
    else:
        output = self.module(*inputs, **kwargs)

    if torch.is_grad_enabled() and self.require_backward_grad_sync:
        self.require_forward_param_sync = True
        # We'll return the output object verbatim since it is a freeform
        # object. We need to find any tensors in this object, though,
        # because we need to figure out which parameters were used during
        # this forward pass, to ensure we short circuit reduction for any

This file has been truncated. show original