Loss increasing and decreasing randomly. Where am i suppose to be backward the loss?

I am implementing a dependency parsing model using PyTorch and little bit confused about the situation that I explained below.
When calculating loss and backward the model; I tried different things.

  • When I use the code below exactly, and make batch size 1 (1 batch in iteration):
    • Loss looks like decreasing, however the predictions are not getting well after 20 epochs.
  • When I use the code below exactly, and make batch size 100(0):
    • I get an error: RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
    • As the error message said, when I use inside_loss.backward(retain_graph=True) instead of inside_loss.backward(), the execution tooks too long. And the loss is both increasing and decreasing randomly.
  • When I comment out the inside_loss and uncomment the lines after the for loop:
    • The loss is not changing.

The code is here:

def forward(self, x,):
# x is the scores of my vocab. It has grad so cannot change its values.
# So take the clone
x_prime = x.clone()

# loss = Variable(torch.zeros(1), requires_grad=True)
loss_v = torch.zeros(1)

for i in range(x_prime.size(0)):

    # Some operations that changes x_prime's values
    # Calculate sentence_probs and sentence_scores from x_prime's values

    eisner_values = eisner_torch(sentence_probs)

    # Changed x_prime above; so update the x
    x = x_prime

    # Get the gold dependencies
    gold_deps = return_gold_deps(i, sentence_to_dependencies)

    if gold_deps is None:

    mask = np.greater(np.asarray(eisner_values), -1)

    # Calculate hinge loss
    inside_loss = hinge(sentence_scores, eisner_values, gold_deps, mask, 1)

    # Calculate total loss in the batch
    loss_v += inside_loss.data


    # Optimizer step
    if self.opt is not None:

# loss.data = loss_v.data
# loss.backward()
# Optimizer Step
return loss_v

I use the Adam optimizer for this task:

model_opt = NoamOpt(model_size=d_model, factor=1, warmup=200,
                    torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))

What are the problems in here?
How can I solve that issue?

Thanks in advance.

I’m not sure, how your code works exactly, but the usage of loss is wrong.
Variables are deprecated since PyTorch 0.4.0, so you can now use tensors.
However, you don’t need to initialize the loss tensor with requires_grad=True, since the loss is usually an output of your criterion and will require gradients, if you don’t detach the computation graph at some point.
The usage of .data is dangerous and you are manipulating the underlying data at the moment without Autograd tracking any of these operations.

It seems you are calculating the inside_loss in a loop, which should explain the “backward a second time” error.
E.g. if you feed some outputs from the initial loop as inputs to the next iteration, these tensors will still be connected to the computation graph. The next backward call will thus try to compute the gradients for the complete computation graph, which would result in both iterations.
Depending on your use case, you could .detach() the tensors to stop the backward pass at this point.

@ptrblck, thank you for your answer. I have a few followup questions about what you say.

  • The inside_loss parameter requries gradient however inside the batch, I am summing that inside_loss values and create the “loss” variable. So, the loss not requires any gradient. Because of that I used that one and then commented out it. What should I do?

  • You are saying that “The usage of .data is dangerous”, however I am using a code like x = x_prime; is it dangerous too?

  • And the last one; yes the “backward a second time” is meaningful; however if I “detach” the tensor inside_loss; how can I took the backward of it? It means “take the inside_loss out of the graph”, isn’t it?

Thank you for your answers;

Have a nice day.