Model learns very slowly but what is this solution?!

i stumbled upon
this post and all is well, learning rate too low etc., but then i read:


I thus modified the pytorch implementation by moving the 3 lines: optimizer.zero_grad() , loss.backward() , optimizer.step() inside the (training) batches loop instead of running them after the loop. As a result, this is the edited code that performs training and testing at each epoch:

# Iterate over epochs
for epoch in range(1, n_epochs+1):
    train_loss = 0
    model.train()

    train_predictions = []
    train_true_labels = []

    # Iterate over training batches
    for i, (inputs, labels) in enumerate(train_loader):
        inputs, labels = Variable(inputs).to(device), Variable(labels).to(device)

        optimizer.zero_grad() # set gradients to zero

        preds = model(inputs)
        preds.to(device)

        # Compute the loss and accumulate it to print it afterwards
        loss = loss_criterion(preds, labels)
        train_loss += loss.detach()

        pred_values, pred_encoded_labels = torch.max(preds.data, 1)
        pred_encoded_labels = pred_encoded_labels.cpu().numpy()

        train_predictions.extend(pred_encoded_labels)
        train_true_labels.extend(labels)

        loss.backward()       # backpropagate and compute gradients
        optimizer.step()      # perform a parameter update


    # Evaluate on development test
    predictions = []
    true_labels = []
    dev_loss = 0

    model.eval()
    for i, (inputs, labels) in enumerate(dev_loader):
        inputs, labels = Variable(inputs).to(device), Variable(labels).to(device)

        preds = model(inputs)
        preds.to(device)

        loss = loss_criterion(preds, labels)
        dev_loss += loss.detach()

        pred_values, pred_encoded_labels = torch.max(preds.data, 1)
        pred_encoded_labels = pred_encoded_labels.cpu().numpy()

        predictions.extend(pred_encoded_labels)
        true_labels.extend(labels)

I thus trained again the network using my default configuration (Adam, lr=0.001) and surprisingly I obtained a convergence at epoch 22 (see images below). I think the issue was there, do you agree? Do you have any additional advice? Thanks again!


what? what? so this person reordered optimizer.zero_grad() into the next loop iteration just after the datapoint has been loaded? how should this affect convergence at all? that can’t be real, right?

Just some quick comments:

  • You’re not even stating what you’re training to learn. For example, a basic classifier is generally easier to train than, say, a Variational Autoencoder.

  • Depending on your setup (i.e., what you’re trying to learn), are you sure you’re using the correct loss_criterion. Most likely, but it is an easy trap to fall into.

  • You didn’t post the code of you’re model. There are a bunch of things that can go wrong even if there’s not error and the model is at least somewhat training (e.g., see this post)

  • Your code looks “old”. From the docs: “The Variable API has been deprecated: Variables are no longer necessary to use autograd with tensors. Autograd automatically supports Tensors with requires_grad set to True.”

to address the comments:
primarily this question is based on the linked SO post, the line (-----) separated section is an excerpt from the post illustrating what was really surprising to me.

You’re not even stating what you’re training to learn

relation classification task for NLP
convolutional neural network using PyTorch, and I’m trying to select the best hyper-parameters […]

are you sure you’re using the correct loss_criterion

i am not sure what loss OP used, but the validation loss seemed to still decrease from epoch to epoch

there’s not error and the model is at least somewhat training

the grime OP had was mostly about why the model was training so slow

Your code looks “old”

yes, the post is from almost 4 years ago according to SO

my question is mainly concerned about the line separated section, which can also be found in context in update 2 of the SO post. OP was allegedly experimenting with moving optimizer.zero_grad() around.

to my understanding, the resulting training loop is basically semantically equivalent to the more common sequence where the calculated grads are zero’ed at the END of each step, but OP allegedly achieved significantly faster convergence.

Q: is there a misunderstanding on my part, or did the OP probably change around something else that actually fixed the problem, and his reordering of optimizer.zero_grad() was pointless?