Output evaluation loss after every n-batches instead of epochs with pytorch

Pizzabread · March 31, 2021, 3:13pm

I have 2 epochs with each around 150000 batches. I would like to output the evaluation every 10000 batches.

How can I do so?

My train loop:

best_valid_loss = float('inf')
train_losses=[]
valid_losses=[]

for epoch in range(params['epochs']):
     
      print('\n Epoch {:} / {:}'.format(epoch + 1, params['epochs']))
      
      #train model
      train_loss = train(scheduler, optimizer)
      
      #evaluate model
      valid_loss = evaluate()
      
      #save the best model
      if valid_loss < best_valid_loss:
          best_valid_loss = valid_loss
          torch.save(model.state_dict(), model_file)
      
      # append training and validation loss
      train_losses.append(train_loss)
      valid_losses.append(valid_loss)
      
      print(f'\nTraining Loss: {train_loss:.3f}')
      print(f'Validation Loss: {valid_loss:.3f}')

This is the train() function called above:

def train(scheduler, optimizer):

  t0 = datetime.datetime.utcnow()
  
  model.train()

  total_loss, total_accuracy = 0, 0  
  step_loss = 0
  
  # iterate over batches
  for step, batch in enumerate(train_data_loader):
    
    # progress update after every 50 batches.
    if step % 1000 == 0 and not step == 0:

      # Calculate elapsed time in seconds.
      elapsed = (datetime.datetime.utcnow() - t0).total_seconds()

      print('  Batch {:>5,}  of  {:>5,} in {}. Step loss={}'.format(step, len(train_data_loader), elapsed, step_loss / 100))
      step_loss = 0

    # push the batch to gpu
    batch = [r.to(device) for r in batch]
 
    textq, maskq, text1, mask1, text2, mask2, labels = batch

    # clear previously calculated gradients 
    model.zero_grad()        

    # get model predictions for the current batch
    v1, v2 = model(textq, maskq, text1, mask1, text2, mask2)

    sim = cos_sim(v1, v2)
    nan_count = torch.count_nonzero(torch.isnan(sim))
    if nan_count.item() > 0:
      print("Oops, have {} nans in similarity".format(nan_count))

    # compute the loss between actual and predicted values
    loss = cosine_embedding_loss(v1, v2, labels, margin=MARGIN)

    # add on to the total loss
    total_loss += loss.item()
    step_loss += loss.item()

    # backward pass to calculate the gradients
    loss.backward()

    # clip the the gradients to 1.0. It helps in preventing the exploding gradient problem
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

    # update parameters
    optimizer.step()

    scheduler.step()

  # compute the training loss of the epoch
  avg_loss = total_loss / len(train_data_loader)

  #returns the loss
  return avg_loss

omarfoq · March 31, 2021, 5:24pm

Hi,

You should change your function train. If you have an issue doing this, please share your train function, and we can adapt it to do evaluation after few batches, in all cases I think you train function look like

def train(scheduler, optimizer):
    for x, y in train_loader:
        x = x.to(device)
        y = y.to(device)
        
       optimizer.zero_grad()
       outs = model(x)
       loss = criterion(outs, y)
       
       loss.backward()

       optimizer.step()
       scheduler.step()

You can update it and have something like

def train(scheduler, optimizer):
    for batch_idx, (x, y) in enumerate(train_loader):
        x = x.to(device)
        y = y.to(device)
        
       optimizer.zero_grad()
       outs = model(x)
       loss = criterion(outs, y)
       
       loss.backward()

       optimizer.step()
       scheduler.step()

       if batch_idx % log_freq == batch_idx -1:
           valid_loss = evaluate() 
           print(f"Batch {batch_idx} / {len(train_loader)} | Valid Loss = {valid_loss}")
           if valid_loss < best_valid_loss:
               best_valid_loss = valid_loss
               torch.save(model.state_dict(), model_file)

Pizzabread · March 31, 2021, 7:35pm

Great, thanks so much! I added the train function in my original post!

I added the following to the train function but it doesn’t work.

  if step % 200 == 0 and not step == 0:
    valid_loss = evaluate() 
    print(f"Batch {step} / {len(train_data_loader)} | Valid Loss = {valid_loss}")
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), model_file)

Would be very happy if you could help me with this one, thanks!

omarfoq · March 31, 2021, 7:38pm

Hello,

What do you mean by it doesn’t work, maybe 200 is larger then then number of batches in your dataset, try some smaller value.

Pizzabread · March 31, 2021, 7:46pm

The output stays the same as before. The added part doesn’t seem to influence the output.
Batch wise 200 should work. I changed it to 2 anyways but still no change in the output. Not sure, what’s wrong at this point…

Pizzabread · March 31, 2021, 7:49pm

Nevermind, I think I found my mistake! I added the code outside of the loop :’), now it works, thanks!!

omarfoq · March 31, 2021, 7:53pm

Hello,

I am not usre if I understand you, but it seems for me that the code is working as expected, it logs every 100 batches.

Maybe your question is why the loss is not decreasing, if that’s your question, I think you maybe should change the learning rate or check if the used architecture is correct. Also seems that you are trying to build a text retrieval system. Check if your batches are drawn correctly.

Pizzabread · March 31, 2021, 7:58pm

@omarfoq sorry for the confusion! It works now! I added the code block outside of the loop so it did not catch it. Now everything works, thank you!