Output evaluation loss after every n-batches instead of epochs with pytorch

I have 2 epochs with each around 150000 batches. I would like to output the evaluation every 10000 batches.

How can I do so?

My train loop:

best_valid_loss = float('inf')

for epoch in range(params['epochs']):
      print('\n Epoch {:} / {:}'.format(epoch + 1, params['epochs']))
      #train model
      train_loss = train(scheduler, optimizer)
      #evaluate model
      valid_loss = evaluate()
      #save the best model
      if valid_loss < best_valid_loss:
          best_valid_loss = valid_loss
          torch.save(model.state_dict(), model_file)
      # append training and validation loss
      print(f'\nTraining Loss: {train_loss:.3f}')
      print(f'Validation Loss: {valid_loss:.3f}')

This is the train() function called above:

def train(scheduler, optimizer):

  t0 = datetime.datetime.utcnow()

  total_loss, total_accuracy = 0, 0  
  step_loss = 0
  # iterate over batches
  for step, batch in enumerate(train_data_loader):
    # progress update after every 50 batches.
    if step % 1000 == 0 and not step == 0:

      # Calculate elapsed time in seconds.
      elapsed = (datetime.datetime.utcnow() - t0).total_seconds()

      print('  Batch {:>5,}  of  {:>5,} in {}. Step loss={}'.format(step, len(train_data_loader), elapsed, step_loss / 100))
      step_loss = 0

    # push the batch to gpu
    batch = [r.to(device) for r in batch]
    textq, maskq, text1, mask1, text2, mask2, labels = batch

    # clear previously calculated gradients 

    # get model predictions for the current batch
    v1, v2 = model(textq, maskq, text1, mask1, text2, mask2)

    sim = cos_sim(v1, v2)
    nan_count = torch.count_nonzero(torch.isnan(sim))
    if nan_count.item() > 0:
      print("Oops, have {} nans in similarity".format(nan_count))

    # compute the loss between actual and predicted values
    loss = cosine_embedding_loss(v1, v2, labels, margin=MARGIN)

    # add on to the total loss
    total_loss += loss.item()
    step_loss += loss.item()

    # backward pass to calculate the gradients

    # clip the the gradients to 1.0. It helps in preventing the exploding gradient problem
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

    # update parameters


  # compute the training loss of the epoch
  avg_loss = total_loss / len(train_data_loader)

  #returns the loss
  return avg_loss


You should change your function train. If you have an issue doing this, please share your train function, and we can adapt it to do evaluation after few batches, in all cases I think you train function look like

def train(scheduler, optimizer):
    for x, y in train_loader:
        x = x.to(device)
        y = y.to(device)
       outs = model(x)
       loss = criterion(outs, y)


You can update it and have something like

def train(scheduler, optimizer):
    for batch_idx, (x, y) in enumerate(train_loader):
        x = x.to(device)
        y = y.to(device)
       outs = model(x)
       loss = criterion(outs, y)


       if batch_idx % log_freq == batch_idx -1:
           valid_loss = evaluate() 
           print(f"Batch {batch_idx} / {len(train_loader)} | Valid Loss = {valid_loss}")
           if valid_loss < best_valid_loss:
               best_valid_loss = valid_loss
               torch.save(model.state_dict(), model_file)
Great, thanks so much! I added the train function in my original post!

I added the following to the train function but it doesn’t work.

  if step % 200 == 0 and not step == 0:
    valid_loss = evaluate() 
    print(f"Batch {step} / {len(train_data_loader)} | Valid Loss = {valid_loss}")
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), model_file)

Would be very happy if you could help me with this one, thanks!


What do you mean by it doesn’t work, maybe 200 is larger then then number of batches in your dataset, try some smaller value.

The output stays the same as before. The added part doesn’t seem to influence the output.
Batch wise 200 should work. I changed it to 2 anyways but still no change in the output. Not sure, what’s wrong at this point…

Nevermind, I think I found my mistake! I added the code outside of the loop :’), now it works, thanks!!


I am not usre if I understand you, but it seems for me that the code is working as expected, it logs every 100 batches.

Maybe your question is why the loss is not decreasing, if that’s your question, I think you maybe should change the learning rate or check if the used architecture is correct. Also seems that you are trying to build a text retrieval system. Check if your batches are drawn correctly.

@omarfoq sorry for the confusion! It works now! :partying_face:I added the code block outside of the loop so it did not catch it. :sweat_smile: Now everything works, thank you!

