RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn after 22k iterations

erenovic · July 22, 2021, 9:56pm

Hello everyone, I keep getting this error but couldn’t find the reason. The error occurs during the training steps, but it appears over 22k iterations after the training starts.

def train_one_step(data, model, optimizer, device):
    # data is a list of images
    
    model.train()

    avg_loss = torch.zeros((1)).to(device)
    avg_loss2 = torch.zeros((1)).to(device)
    
    x = data[0]
    hidden_state = None

    for i in range(1, len(data)):
       x, hidden_state, state_loss2  = model(x, data[i], hidden_state, train=True)

        avg_loss += calculate_loss(x, data[i])
        avg_loss2 += state_loss2


    avg_loss /= len(data)
    avg_loss2 /= len(data)
    
    loss = 10 * avg_loss +  avg_loss2

    optimizer.zero_grad()
    print(loss.requires_grad)
    loss.backward()
    optimizer.step()
    
    return avg_loss.item(), avg_loss2.item(), loss.item()

Here calculate loss is a function simply calculating the means squared error using torch.mean((x - data[i])**2). As I have previously stated, the model is successfully trained for 22k iterations but afterwards loss.requires_grad suddenly turns to False (I have observed that it was True in previous iterations).
I did not provide any further info about my model but are there any reasons for requires_grad to turn False without manually setting it?

ptrblck · July 23, 2021, 4:37am

Detaching e.g. the output from the computation graph would set the requires_grad attribute to False, which should not happen “automatically” after a specific number of iterations.
Are you using conditions inside the forward method, which could pick another code path and could either directly detach tensors or use non-differentiable operations, or any data-dependent control flow which could do the same?

erenovic · July 26, 2021, 5:45am

Hey Piotr, thanks for the swift answer, as I had no further explanation, I tried to debug it myself. The error was completely on me, I failed when I was loading the data and it somehow caused this error. But again, thanks for the help.