Weights not updating during training


I’m having a strange issue where not a single one of my model’s parameters is getting updated. I saw another answer that’s similar to the problem I have (Model.parameters() is None while training), but the solution that’s given doesn’t quite solve my problem, because notably every one of my model’s parameter grads is None.

for p in model.parameters():
    if p.grad is not None:

I.e. the above loop won’t print out anything for my model. And even after calling loss.backward() and optimizer.step(), the parameter grads all remain None.

I’ve verified that a loss is being calculated, it just seems that optimization isn’t taking place, even if the backward() and step() functions are definitely getting run.

This is my optimization code:

def step(batch, model, criterion, optimizer=None):
    # let go of old gradients
    X = batch["X"].to(DEVICE)
    y = batch["y"].to(DEVICE)

    ## Forward Pass ##
    predictions = model(inputs)

    ## Calculate Loss ##
    loss = criterion(predictions, y)

    if optimizer is not None:
        # backward pass + optimize
    return loss

def train_model(model=None, lr=0.01):
    criterion = nn.CrossEntropyLoss().to(DEVICE)
    params = list(filter(lambda p: p.requires_grad, model.parameters()))
    optimizer = torch.optim.Adam(params=params, lr=lr)

    for epoch in range(1, N_EPOCHS+1):
        for i, batch in enumerate(tqdm(train_loader)):
            loss = step(batch, model, criterion, optimizer=optimizer)

model = NeuralNet()

The only thing I can think of is that it’s that how I’ve split my optimization code into functions, rather than keeping it all in the same scope. I was able to get a toy neural net with toy randn()-generated data to train properly in a single training loop that isn’t spread out across multiple functions (the model was initialized within the same scope it was trained).

I have a couple pre-trained torch.nn modules in NeuralNet(): a pre-trained Embedding layer and a pre-trained ResNet, both of which have frozen weights by setting requires_grad to False.

What could be causing this issue?

can you try with this step-function:

def step(batch, model, criterion, optimizer=None):
    x = batch["X"].to(DEVICE)
    y = batch["y"].to(DEVICE)

    ## Forward Pass ##
    predictions = model(inputs)

    ## Calculate Loss ##
    loss = criterion(predictions, y)

    if optimizer is not None:
        # backward pass + optimize
    return loss

I don’t know whether it is a difference if you zero the grads by the optimizer’s function or the model’s function, but I usually do it with the optimizer and this works for me.

Thanks for replying! I tried that, and it didn’t make a difference. The parameter grad attributes are all still None.

Where do you get inputs from? Is it a typo and it should be x instead?
Also, remove the if condition on the optimizer and just call loss.backward() and optimizer.step().

Oops, yes. That’s a typo. It was supposed to be x, I tried to make my code more generic.

Sorry, I left this out. The reason I have the if condition is because I reuse the same step() function for validation, except with step(optimizer=None), so that neither loss.backward() nor optimizer.step() get called.

Thanks for the clarification.
Your code seems to work. I’ve created a small dummy example where a very simple model fits some random noise.
Both approaches work, i.e. the plain code and using your functions.
As the code is quite long, I’ve created a gist.
Could you think about other issues, where you might have reset your gradients or the optimizer?

1 Like

Thank you so much for your help! I also tried training a simple mock network last night and came to a similar conclusion: my optimization code is valid, but my actual (not mock) network definition must be faulty in some way. I’d rather not post my network definition classes (classes that inherit from nn.Module) at the moment. I will create mock nn.Module classes to track down which part of my network is causing the bug, and I’ll post again for future readers once I figure it out.

@ptrblck I discovered the source of the issue!

I’ve had a memory leak that has plagued me while using an LSTM, and to fix the issue, my solution was to use copy.copy() (from the Python standard lib) on tensors in a few places. But in using copy.copy(), it appears to have disrupted the end-to-end nature of the network, preventing proper backpropagation. When I removed copy.copy(), I found that backpropagation works now.

But I still have the memory leak problem and can’t complete a single batch if I’m not copy()ing, so this isn’t a perfect solution. Do you have a suggestion for an alternative solution to the memory leak that doesn’t break backpropagation?

A simplified version of my code is in the following gist:

In particular, pay attention to lines 49-52. This is where I’m calling copy.copy().

Thank you again for your help.

Skimming through your code, I couldn’t find anything obviously wrong.

Could your issue be related to this one?

To investigate it further, could you tell which OS, GPU (if used), CUDA + cuDNN version you are using?

Right, and I don’t have the memory leak when I run the code on the CPU, only when I switch to the GPU.

I don’t think this is related to that particular bug, because my leak happens in the forward pass (that issue is specifically about .backward()). My memory leak occurs anytime I accumulate tensors. “Accumulate” could mean, e.g.: assigning a tensor at the index of another tensor (lines 49-52), iteratively summing losses with +=, or appending a tensor to a list. Somehow this accumulation prevents the tensors from being garbage collected.

These are the details of my hardware, drivers, and software:

  • OS
    • Ubuntu 16.04.5 LTS
  • GPU
    • NVIDIA Quadro GP100
  • CUDA
    • 9.0.176
  • cuDNN
    • 6.0.21
  • PyTorch
    • 0.4.0

If this problem comes from below the code as you’re suggesting, is there a short-term fix I can use in the code to get around the memory leak without disrupting backprop?

Perhaps copy.copy() is copying too shallowly, leaving out some important attributes that make backprop work. Do you know what attributes those might be?

Given this hypothesis of shallow copying being the culprit for the backprop bug, I’ve tried to use copy.deepcopy() for lines 49-52. But using copy.deepcopy() leads to the following error:

RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment

Remember that each node you save in a list stores the whole computation graph assigned to it.
This is quite a common issue regarding memory leaks.
E.g. if you save the loss for debugging/printing purposes with losses += loss instead of losses += loss.item(), the whole computation graph will be stored and cannot be freed.
Could you check for these issues?

That makes perfect sense for appending to lists and summing. But what about with lines 49-52, assigning tensors at an index in another tensor? There shouldn’t be a memory leak. But in that case, we do want to preserve the computation graph.

For some additional information, when I tried turning off cudnn, the memory leak still occurred:
torch.backends.cudnn.enabled = False

I just wanted to add that I also found that the parameters from my model did not update when I did model2 = copy.deepcopy(model).