Mac/Linux discrepancy and gradient clipping issue

pasawaya · July 23, 2018, 10:18pm

I’m working on a recurrent model and was running into NaN and inf loss values but only when training on an Ubuntu machine. When running the model on the Ubuntu machine, loss quickly approaches inflation and then nan:

[epoch 0]
0/4731 [00:00<?, ?it/s]  
1/4731 [00:00<1:13:42,  1.07it/s, acc=6.970%, acc_avg=6.970%, loss=5.097, loss_avg=5.097]  
2/4731 [00:01<1:12:55,  1.08it/s, acc=20.000%, acc_avg=13.485%, loss=00inf, loss_avg=00inf]  
4/4731 [00:03<1:01:55,  1.27it/s, acc=11.667%, acc_avg=15.909%, loss=00nan, loss_avg=00nan]  
5/4731 [00:03<58:01,  1.36it/s, acc=6.667%, acc_avg=14.061%, loss=00nan, loss_avg=00nan]  
6/4731 [00:04<57:29,  1.37it/s, acc=30.000%, acc_avg=16.717%, loss=00nan, loss_avg=00nan]  
7/4731 [00:05<52:24,  1.50it/s, acc=16.667%, acc_avg=16.710%, loss=00nan, loss_avg=00nan]  
8/4731 [00:06<1:00:44,  1.30it/s, acc=33.333%, acc_avg=18.788%, loss=00nan, loss_avg=00nan]  
10/4731 [00:07<56:47,  1.39it/s, acc=38.333%, acc_avg=20.697%, loss=00nan, loss_avg=00nan]  
11/4731 [00:08<56:05,  1.40it/s, acc=36.667%, acc_avg=22.149%, loss=00nan, loss_avg=00nan]

However, when running the identical code with identical training parameters (except on a Mac with a CPU rather than a GPU), training proceeds seemingly fine. Also strangely, the train accuracy drops by a huge margin on the Mac.

[epoch 0]
0/4731 [00:00<?, ?it/s]  
1/4731 [00:05<7:48:52,  5.95s/it, acc=7.727%, acc_avg=7.727%, loss=5.022, loss_avg=5.022]  
2/4731 [00:11<7:40:02,  5.84s/it, acc=10.455%, acc_avg=9.091%, loss=5.131, loss_avg=5.077]  
3/4731 [00:17<7:36:24,  5.79s/it, acc=10.000%, acc_avg=9.394%, loss=5.053, loss_avg=5.069]  
4/4731 [00:22<7:20:40,  5.59s/it, acc=8.182%, acc_avg=9.091%, loss=5.032, loss_avg=5.060]  
5/4731 [00:27<7:21:14,  5.60s/it, acc=11.515%, acc_avg=9.576%, loss=5.156, loss_avg=5.079]  
6/4731 [00:33<7:21:09,  5.60s/it, acc=9.848%, acc_avg=9.621%, loss=5.127, loss_avg=5.087]  
7/4731 [00:38<7:14:57,  5.52s/it, acc=8.333%, acc_avg=9.437%, loss=5.068, loss_avg=5.084]  
8/4731 [00:45<7:30:22,  5.72s/it, acc=6.818%, acc_avg=9.110%, loss=5.114, loss_avg=5.088]  
9/4731 [00:50<7:28:37,  5.70s/it, acc=4.848%, acc_avg=8.636%, loss=5.072, loss_avg=5.086]  
10/4731 [00:56<7:29:38,  5.71s/it, acc=8.333%, acc_avg=8.606%, loss=5.001, loss_avg=5.078]

Regardless, to remedy the issue on the Ubuntu machine which was going to perform the training, I tried to add gradient clipping to the model as follows:

def train(model, loader, criterion, optimizer, scheduler, device, clip=None, summary=None):
    loss_avg = RunningAverage()
    acc_avg = RunningAverage()

    model.train()

    with tqdm(total=len(loader)) as t:
        for i, (frames, label_map, centers, _) in enumerate(loader):
            frames, label_map, centers = frames.to(device), label_map.to(device), centers.to(device)

            outputs = model(frames, centers)
            loss = criterion(outputs, label_map)
            acc = accuracy(outputs, label_map)

            optimizer.zero_grad()
            loss.backward()
            if clip is not None:
                utils.clip_grad_norm_(model.parameters(), clip)
            optimizer.step()
            scheduler.step()

            loss_avg.update(loss.item())
            acc_avg.update(acc)

            if summary is not None:
                summary.add_scalar_value('Train Accuracy', acc)
                summary.add_scalar_value('Train Loss', loss.item())

            t.set_postfix(loss='{:05.3f}'.format(loss.item()), acc='{:05.3f}%'.format(acc * 100),
                          loss_avg='{:05.3f}'.format(loss_avg()), acc_avg='{:05.3f}%'.format(acc_avg() * 100))
            t.update()

        return loss_avg(), acc_avg()

I then tried to train the model with clip values of 100, 10, 1, 0.25, 0.1, 0.01, all the way down to 0.000001 and ran into the exact same output. I verified that the clip_norm function was being called by placing a print statement within that if statement. I then inspected model.named_parameters to verify that all the correct layers were included (which they were).

Any ideas as to (1) what’s causing the training to proceed fine on the Mac but the gradients to explode on the Ubuntu machine, (2) why the accuracy also decreases by a large amount on the Mac, and (3) why gradient clipping isn’t mitigating the issue?