Torch.no_grad() affecting outputs/loss

Hi all. I am running an autoencoder type model with MSELoss at the end. The problem arose when I noticed that my training loss was in the order of 100k and my validation loss was around 0.8. I tried running validation code with the training set for a sanity check and was still seeing the dramatic difference. I also tried removing all my batchnorms for a sanity check, those didn’t do anything.

However when I removed the with torch.no_grad(): and ran the validation code, my errors started to make sense (for the validation data as well as training). This seems to be a pytorch problem however I’m not sure if I’m not using something correctly. I’ve used torch.no_grad() before and it worked fine, the only difference right now is that I usually use python3 whereas right now I’m using python2.7 with pytorch 0.4.1

For now I’ll remove the no_grad() scope however I would like to use it since it speeds up validation computation. Does anyone have any ideas?

Edit: Using miniconda2

UPDATE:

It seems after looking carefully at the outputs that the loss with the scope with torch.no_grad(): is actually the correct loss. And both MSELoss and L1Loss return a sum loss and not a mean loss. Essentially my outputs are of size (bsize,5000,3) and input the same size and when I take the L1Loss or the MSELoss I get something that if I divide by bsize x 5000 x 3 will be in the order of what I am seeing when I see the loss with the no_grad() scope.

This is a bigger problem and I am not sure how to solve it, I don’t really want to just divide by the size and dimensions of the input, that seems too hacky.

Could you post a small code snippet showing this problem?
I would like to reproduce this issue, as it might be a bug, and couldn’t reproduce it so far.

for epoch in range(n_epochs):
    model.train()
    tloss = []
    for b in range(train_data.shape[0] // bsize):
        tx = torch.from_numpy(train_data[b*bsize:b*bsize+bsize]).to(device)
        tx_hat = model(tx,ts,tD,tU) # the input (tx) is of size (bsize,5000,3), and output is the same
        loss = loss_fn(tx,tx_hat)
        optim.zero_grad()
        loss.backward()
        optim.step()
        scheduler.step()
        tloss.append(loss.item())
        
    # validate
    model.eval()
    vloss = []
    with torch.no_grad(): 
        for b in range(val_data.shape[0]//bsize):
            tx = torch.from_numpy(train_data[b*bsize:b*bsize+bsize]).to(device)
            tx_hat = model(tx,ts,tD,tU)
            loss = loss_fn(tx,tx_hat)
            vloss.append(loss.item())

It started working fine when I wrote my own L1 loss
loss_fn = torch.abs(out - targets).sum() / (out.size(0)*out.size(1)*out.size(2))

I also have never seen this problem before and when I just generated random tensors of the same size and applied the L1Loss() function I get sensible answers. And the fact that it works correctly torch.no_grad() scope makes me believe that it must be some weird bug. My network is kinda of complicated and involves a lot of slicing, however I can’t share the details of it publicly, sorry.

Edit: I also used the same model in python3 (but on a different dataset) and I was getting sensible outputs, so I think it might be python2 specific.

I recently stumbled upon the same issue.
The same exact training and validation procedure, same parameters, same network, same data (as a sanity check) but still, the training loss was 3 orders of magnitude higher than the validation one, i.e. 50 during training and 0.05 in validation.

While trying to figure out what is happening I tried the following :

  • As mentioned above, removing torch.no_grad() during validation, did actually bring them to the same range, namely the large one.
  • Calculating the training error with a custom mse function instead of F.mse_loss(), brought them both to the smaller value.

What I am working on right now includes a 3-step training procedure and I am really not sure what causes that behavior as I have never encountered the same problem before the last step. The only difference between them is that the network for the first two does not have a recurrent layer and is not required to do a multi-step prediction by using its own predictions as next inputs recursively.

Could it be something related to the fact I am concatenating the recursive costs and then apply the mse function on that structure?

I think the important thing to note is that the larger one (for me at least) was the incorrect one. And it probably is the same in your case. So the easiest fix is to use your own custom loss function (which is what I did and it’s been working ever since).

I guess you’re right but it still strikes me as very peculiar, particularly since it doesn’t look like you had the same “conditions” as me.
We might be missing something.

Could you check the shapes of the output and target you are passing to nn.MSELoss?
We recently had some issues regarding unwanted broadcasting, which worked fine, but calculated a wrong loss value.
E.g. if your output is [10, 1], while your target is [10], the intermediate value will be broadcasted to [10, 10] which is probably not what you expect.

I checked but the sizes are consistent, [150] network output, and [150] target value.

That’s weird. As far as I understand the training and validation losses are different if you are using nn.MSELoss but similar if you implement this loss function manually. Is that correct?
If so, could you post your implementation, and if possible a small code snippet reproducing this issue?

I tried to remove the unnecessary parts, hope I didn’t make it more confusing than it has to be.
For the following functions, namely using F.mse_loss() for both, and using exactly the same data, I get the following output:

def train_multistep_single_block(training_dataset_dict, testing_dataset_dict, optimizer, network, epochs, scaler):

    # Setting up parameters
    block_size = 10
    training_losses = []
    validation_losses = []
    horizon = 5
    network.train()

    for epoch in range(epochs):
        dataset_losses = []

        for d_idx, dataset1 in training_dataset_dict.items():

            # note : dataset related initialization omitted
            
            block_losses = []

            for block in range(1, horizon-1):

                # note: input pre-processing omitted

                block_input = (torch.tensor(current_relative)).view(60)
                control_input = (torch.tensor(current_relative[:,3:])).view(30)

                pred_out = torch.zeros(30)
                real_out = torch.zeros(30)

                optimizer.zero_grad()
                net_hidden = network.initHidden()

                # Initializing the recurrent inputs
                rec_in = torch.tensor(block_input)
                rec_ffo = torch.tensor(control_input)

                for bl_ind in range(1, horizon):

                    rec_in = rec_in.view(60)
                    out = network(torch.tensor(rec_in).float(), rec_ffo.float(), net_hidden)
                    pred_out = torch.cat((pred_out,out))

                    # note: pre-processing of the target value is omitted

                    target = torch.tensor(next_relative[:,:3]).float()
                    real_out = torch.cat((real_out,target.view(30)))

                    # Reconstructing inputs based on the previous prediction
                    rec_in = torch.cat((out.view(10,3), torch.tensor(next_relative[:,3:]).float()), 1) # 9
                    rec_ffo = torch.tensor(next_relative[:,3:]).view(30)

                loss = F.mse_loss(torch.tensor(real_out),torch.tensor(pred_out))
                # loss = custom_loss.mse(torch.tensor(real_out),torch.tensor(pred_out))
                block_losses.append(loss.item())
                loss.backward()

                optimizer.step()
                data_past = current_block

            d_loss = sum(block_losses)/len(block_losses)
            print('(%d, %f)' %(d_idx,d_loss))
            dataset_losses.append(d_loss)

        epoch_loss = sum(dataset_losses)/len(dataset_losses)
        training_losses.append(epoch_loss)
        print('epoch loss',epoch, epoch_loss)

        if ((epoch ) % 10 == 0):
            print("----------------------------------------------------")
            print("\t Validation round.")
            print("----------------------------------------------------")
            validation_losses.append(testing_multi_step_prediction(testing_dataset_dict, network, scaler))

    return training_losses, validation_losses
def testing_multi_step_prediction(testing_dataset_dict, network, scaler):

    network.eval()
    validation_loss = []
    dataset_losses = []
    block_size = 10
    horizon = 5
    
    with torch.no_grad():
        for d_idx, dataset1 in testing_dataset_dict.items():

            # note : dataset related initialization omitted

            block_losses = []

            for block in range(1, tot_blocks-horizon-1):

                # note: input pre-processing omitted

                block_input = (torch.tensor(current_relative)).view(60)
                control_input = (torch.tensor(current_relative[:,3:])).view(30)

                pred_out = torch.zeros(30)
                real_out = torch.zeros(30)

                net_hidden = network.initHidden()
                
                # Initializing the recurrent inputs
                rec_in = torch.tensor(block_input)
                rec_ffo = torch.tensor(actual_forces)

                for bl_ind in range(1, horizon):

                    rec_in = rec_in.view(60)
                    out = network(torch.tensor(rec_in).float(), rec_ffo.float(), net_hidden)

                    pred_out = torch.cat((pred_out,out))
                    
                    # note: pre-processing of the target value is omitted

                    target = torch.tensor(next_relative[:,:3]).float()
                    real_out = torch.cat((real_out,target.view(30)))
                      
                    # Reconstructing inputs based on the previous predictions
                    rec_in = torch.cat((out.view(10,3), torch.tensor(next_relative[:,3:]).float()), 1)
                    rec_ffo = torch.tensor(next_relative[:,3:]).view(30)

                loss = F.mse_loss(torch.tensor(real_out),torch.tensor(pred_out))
                block_losses.append(loss.item())

                data_past = current_block        
            d_loss = sum(block_losses)/len(block_losses)
            print('(%d, %f)' %(d_idx,d_loss))
            dataset_losses.append(d_loss)
            
    validation_loss = sum(dataset_losses)/len(dataset_losses)
    print("Mean %d-step prediction error is :  " %(horizon))
    print(validation_loss)
    return validation_loss   

output :

(0, 53.842871)
(1, 5.945963)
(2, 7.037170)
(3, 6.868914)
(4, 3.127133)
(5, 13.591754)
(6, 10.479313)
(7, 3.686508)
(8, 4.541882)
(9, 3.013874)
(10, 36.150578)
(11, 26.826874)
(12, 20.820959)
(13, 8.622623)
(14, 14.105398)
('epoch loss', 0, 14.57745425715238)
----------------------------------------------------
	 Validation round.
----------------------------------------------------
(0, 0.626295)
(1, 0.123750)
(2, 0.321406)
(3, 0.196078)
(4, 0.119878)
(5, 0.228515)
(6, 0.172134)
(7, 0.085186)
(8, 0.151922)
(9, 0.049054)
(10, 0.337698)
(11, 0.206890)
(12, 0.219769)
(13, 0.109976)
(14, 0.109200)
Mean 5-step prediction error is :  0.249765358297

Where the output is of the form (dataset_id, dataset_loss).
When I am using this function instead:

def mse(prediction, target):
    loss = (prediction - target)**2
    ls = (loss.sum() / len(target))
    return ls

The losses are in the same range.