Noob question about how Loss functions work in PyTorch

qavvv · February 11, 2020, 6:21am

I am new to PyTorch and making a transition from TF/keras to PyTorch so I apologize if my question is very basic.

I have started with some tutorials and in one of them I have the following training loop function where it iterates over the dataset, trains the model for 1 epoch computes the loss and updates the weights.

So far so good. The problem I have is when I try to return the loss value from this function and print its value after each epoch (see below) the model seems to be not working anymore.

    for epoch in range(EPOCHS):
        train_loop_fn(train_data_loader, model, optimizer, device, scheduler)
#      loss =  train_loop_fn(train_data_loader, model, optimizer, device, scheduler)
        outputs, targets = eval_loop_fn(valid_data_loader, model, device)
        
        spear = []
        for jj in range(targets.shape[1]):
            p1 = list(targets[:, jj])
            p2 = list(outputs[:, jj])
            
            coef, _ = np.nan_to_num(stats.spearmanr(p1, p2))
            spear.append(coef)
        
        spear_mean = np.mean(spear)
        
#print(f'spear={spear_mean}\tLoss:{loss.item()}')
        print(f'spear={spear_mean}')

When I don’t return the loss function, I can clearly see that the model is training because the spear_mean keeps increasing epoch by epoch but after I uncomment the lines in images and try to print loss.item() the code no longer follows a general pattern and I get random values of spear_mean and loss.item() in every epoch. I don’t really see what is going wrong here and why would returning the loss value mess things up here. Any help is greatly appreciated. Thank you in advance.

ptrblck · February 11, 2020, 6:26am

If you didn’t change the training loop besides returning the loss, nothing should change.
How reproducible is this effect? I.e. if you compare the initial code to the modified one (with returned loss) for e.g. 10 runs, what’s the mean and stddev of the final accuracy?

qavvv · February 11, 2020, 3:50pm

Thank you for your quick reply. It is reproducible. Refer to https://www.kaggle.com/mhviraf/qa-bert-pytorch?scriptVersionId=28476451; versions 2 and 8 work perfectly fine. versions 3 through 7 doesn’t. I don’t know how familiar you are with Kaggle notebooks but if you click on the “8 commits” link on the top left of the page it lets you see the difference between versions 7 and 8.

I ran the original version twice and both times the mean_spearman goes up to 0.54 after about 10 epochs. Whilst, I ran the modified version where the loss is returned for 5 times and every time spear_mean jumps back and forth between -0.05 and 0.16 through epochs.

One thing that is very suspicious is that when I return the loss it iterates very fast over epochs and the runtime becomes about 10sec/epoch while in the original code the runtime is about 400sec/epoch.

ptrblck · February 12, 2020, 6:36am

Thanks for the notebook!
Your indentation of the return loss statement is wrong as you would return after a single iteration, which also explains the runtime difference.