Loss Changes with torch.no_grad() ?!?

Daniel_Dsouza · November 28, 2018, 5:53pm

Hi,
I’m using an _evaluate_model method I wrote up to try and log the training, dev and test data loss during each epoch.

The function is :

def _evaluate_model(some_datacut, name_of_datacut):
curr_epoch_loss = 0
with torch.no_grad():
for sentence, tags in some_datacut:

        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = torch.LongTensor([tag_to_ix[t] for t in tags])

        loss = model.neg_log_likelihood(sentence_in, targets)
        curr_epoch_loss+=loss
    print(name_of_datacut+" Loss ",curr_epoch_loss.item())
return

I was seeing something funny when printing the values out.
The EPOCH loss is on the training dataset.
The 3 evaluate calls on training data are just to show how the loss values keep changing, even though I iterate throught the same training data and the model does not change.

for epoch in range(300):

_train_model(epoch)
_evaluate_model(training_data,"TRAINING_1") 
_evaluate_model(training_data,"TRAINING_2") 
_evaluate_model(training_data,"TRAINING_3") 
_evaluate_model(dev_data,"DEV")  
_evaluate_model(testing_data,"TESTING")

I get this!

Epoch 1 Loss : 26.947420120239258
TRAINING_1 Loss 26.533092498779297
TRAINING_2 Loss 26.546777725219727
TRAINING_3 Loss 26.50927734375
DEV Loss 8.607915878295898
TESTING Loss 9.82412338256836

Epoch 2 Loss : 26.41505241394043
TRAINING_1 Loss 25.940399169921875
TRAINING_2 Loss 26.062667846679688
TRAINING_3 Loss 25.959278106689453
DEV Loss 8.684854507446289
TESTING Loss 9.67626667022705

Epoch 3 Loss : 26.273456573486328
TRAINING_1 Loss 25.535442352294922
TRAINING_2 Loss 25.73399543762207
TRAINING_3 Loss 25.672330856323242
DEV Loss 8.726755142211914
TESTING Loss 9.62713623046875

I was expecting the EPOCH loss and the 3 iterations of the Training loss to be exactly equal! Help!

dpernes · November 28, 2018, 5:56pm

Are you computing the training loss in a single forward pass or are you using mini-batches?

Daniel_Dsouza · November 28, 2018, 5:58pm

Single Forward pass. It’s actually just 2 sentences in the training data. A toy example to build out the architecture. There’s no mini-batching happening!

Do tell how that’d affect it though I’d love to know why you asked me that

ptrblck · November 28, 2018, 9:09pm

Are you using any nn.BatchNorm or nn.Dropout layers?
These layers might change the loss even during evaluation.
Call model.eval() before the evaluation and check the losses again.

dpernes · November 29, 2018, 11:25am

Well, a common (erroneous) approach to compute the loss of the dataset is summing the losses for all minibatches and dividing this sum by the total number of minibatches. Unless the size of the dataset is a multiple the batch size, this will not be the same as summing the loss for each example in the dataset and dividing this sum by the total number of examples. Moreover, if minibatches are sampled randomly, computing the loss using the former approach may produce different results in two consecutive calls.

Toy example:

Consider a dataset with N = 3 examples A, B, C with losses LA = 0.5, LB = 0.1, LC = 1.1 and a batch size of M=2. Note that N is not a multiple of M.

Scenario 1: The first minibatch is (A, B) and the second is C. Thus, the loss of the first minibatch is (0.5+0.1)/2 = 0.3 and the loss of the second is 1.1. Therefore, the average loss per minibatch is (0.3+1.1)/2=0.7

Scenario 2: The first mini batch is (B, C) and the second is A. Using analogous computations, you get an average loss per minibatch of 0.55.

Regarding your problem, make sure you call model.eval() before computing the loss if you are using any Dropout or BatchNorm layers, like @ptrblck said.

Daniel_Dsouza · December 3, 2018, 8:04pm

Alright! So the model.eval() made sense for the loss changing due to the Dropout ( since it bypasses that model in eval), but I initially had no Dropout and the model.eval mode still didn’t work for me.

Turns out I was randomly initializing the “initial state” of my LSTM Hence the randomness in the losses for the same data over and over again. I fixed that by initializing with zeros and added a Dropout layer ( all just to learn how it works) and then used the eval method and sure enough I get this!

The Code ( from the POS example in Pytorch Docs):

model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

with torch.no_grad():
inputs = prepare_sequence(training_data[0][0], word_to_ix)
tag_scores = model(inputs)

def _train(epoch):
epoch_loss = 0
for sentence, tags in training_data:
    model.zero_grad()
    model.hidden = model.init_hidden()

    sentence_in = prepare_sequence(sentence, word_to_ix)
    targets = prepare_sequence(tags, tag_to_ix)
    tag_scores = model(sentence_in)
    loss = loss_function(tag_scores, targets)
    epoch_loss+=loss
    
    loss.backward()
    optimizer.step()
    
print("EPOCH ",epoch,":",epoch_loss.item())
return
def _evaluate(dataset):
epoch_loss = 0
for sentence, tags in dataset:
    model.zero_grad()
    model.hidden = model.init_hidden()

    sentence_in = prepare_sequence(sentence, word_to_ix)
    targets = prepare_sequence(tags, tag_to_ix)
    tag_scores = model(sentence_in)
    loss = loss_function(tag_scores, targets)
    epoch_loss+=loss
    # REMOVED THE BACKPROP LOSS CODE FOUND IN _TRAIN METHOD

print("EPOCH RERUN ON TRAINING :",epoch_loss.item())
for epoch in range(300): # again, normally you would NOT do 300 epochs, it is toy data
model.train()
_train(epoch)
model.eval()
_evaluate(training_data)
_evaluate(training_data)
print(“\n”)

And I get this :

EPOCH 0 : 2.2410192489624023
EPOCH RERUN ON TRAINING : 2.171501636505127
EPOCH RERUN ON TRAINING : 2.171501636505127

EPOCH 1 : 2.1357598304748535
EPOCH RERUN ON TRAINING : 2.1666126251220703
EPOCH RERUN ON TRAINING : 2.1666126251220703

EPOCH 2 : 2.1036343574523926
EPOCH RERUN ON TRAINING : 2.1608502864837646
EPOCH RERUN ON TRAINING : 2.1608502864837646

EPOCH 3 : 2.153252363204956
EPOCH RERUN ON TRAINING : 2.1550378799438477
EPOCH RERUN ON TRAINING : 2.1550378799438477

EPOCH 4 : 2.242952823638916
EPOCH RERUN ON TRAINING : 2.1500635147094727
EPOCH RERUN ON TRAINING : 2.1500635147094727

EPOCH 5 : 2.2232353687286377
EPOCH RERUN ON TRAINING : 2.1464695930480957
EPOCH RERUN ON TRAINING : 2.1464695930480957

EPOCH 6 : 2.2370853424072266
EPOCH RERUN ON TRAINING : 2.1424150466918945
EPOCH RERUN ON TRAINING : 2.1424150466918945

EPOCH 7 : 2.152524948120117
EPOCH RERUN ON TRAINING : 2.1402010917663574
EPOCH RERUN ON TRAINING : 2.1402010917663574

EPOCH 8 : 2.0829806327819824
EPOCH RERUN ON TRAINING : 2.137516736984253
EPOCH RERUN ON TRAINING : 2.137516736984253

EPOCH 9 : 2.1681880950927734
EPOCH RERUN ON TRAINING : 2.1328389644622803
EPOCH RERUN ON TRAINING : 2.1328389644622803

Which is great! Since After EPOCH loss, the model backprops so we know that the RERUN loss won’t be the same as the EPOCH loss. We also see that the RERUN loss is deterministic in eval model!

My last question is that : what changes in-between epochs?
Why isn’t the current EPOCH loss equal to the RERUN loss from the previous epoch ?