Torch.cuda.amp and Accuracy

bfeeny · August 13, 2020, 11:04pm

I converted my training loop to use AMP, and I notice my accuracy numbers are now all 0. What needs to be changed to work with AMP? z was calculated within a with_autocast(): clause so I thought it should be fine, but apparently it is not.

        print("Training..............................\r", end='') 
        train_iter = iter(train_loader)
        next_batch = train_iter.next()
        next_batch = [_.cuda(non_blocking=True) for _ in next_batch ]
        for idx in range(len(train_loader)):
            image, meta, y=next_batch
            if(param['cache_on']):
                if(epoch == 0):
                    print(f"Loading Cache using train_loader {idx*train_loader.batch_size + image.shape[0]}\r", end='')   

            if idx + 1 != len(train_loader): 
                # start copying data of next batch
                next_batch = train_iter.next()
                next_batch = [ _.cuda(non_blocking=True) for _ in next_batch]


            optim.zero_grad()
            # AMP
            with autocast():
                z = model((image, meta))
                y_smo = y.float() * (1 - param['label_smoothing']) + 0.5 * param['label_smoothing']
                loss = criterion(z, y_smo.unsqueeze(1))

            # Before AMP
#             loss.backward()
            # AMP
            scaler.scale(loss).backward()


            # Before AMP
#             optim.step()
            # AMP
            scaler.step(optim)
            scaler.update()
            
            pred = torch.sigmoid(torch.round(z)) # round off sigmoid to obtain predictions  
            correct += (pred.cpu() == y.cpu().unsqueeze(1)).sum().item()  # tracking number of correctly predicted samples
            epoch_loss += loss.item()
        train_acc = correct / len(train_idx)

ptrblck · August 15, 2020, 5:59am

Do you suspect the accuracy calculation to be wrong or is the loss also not decreasing using amp?
In the latter case, could you check the target calculation and see, if the expected values are calculated? If not, could you move the target calculation outside of the autocast region and recheck it?

Also, what kind of model are you using at the moment and I assume the training was working fine without amp?

bfeeny · August 15, 2020, 7:39am

@ptrblck Accuracy is fine if I remove the autocast clause. Its just an EfficientNet B1 model. The output of Accuracy is just 0, which is fine, I am not actually using that metric for this model, but obviously something has happened to z and perhaps there is a process to reverse it and so I can use it to further calculate my accuracy. I can dig in and see where the number is being fudged, but I figure there may be something obvious I was doing wrong.

ptrblck · August 16, 2020, 2:27am

Which value are you using for the label smoothing and which criterion are you using?

Are you using a sigmoid at the end of your model and nn.BCELoss?
If so, could you remove the sigmoid and use nn.BCEWithLogitsLoss?

The pred calculation looks a bit weird. If you are already using a sigmoid output,

pred = torch.sigmoid(torch.round(z))

would create values in [0.5, 0.73], wouldn’t it?

bfeeny · August 16, 2020, 3:00am

@ptrblck I am using BCE With Logits because regular BCE is not half-precision supported (I got an error to that effect when switching to half precision). So I am using BCE w/ Logits and just tagging sigmoid on. I suspected sigmoid, but I created a float16 tensor and ran it through sigmoid fine on GPU. Sigmoid on CPU complains about half precision. I am going to capture every value and find out what exactly is the issue.

I have since moved my label smoothing outside of the clause, so I modify y before the clause. My label smoothing is just basic, I am just modifying y in a basic way, there is nothing more to it than that. For loss its BCE with Logits Loss.

bfeeny · August 16, 2020, 3:09am

How stupid of me! When switching from BCE to BCEwL I mixed it up and I am doing sigmoid(round(x)) instead of round(sigmoid(x))!!!