Gradient become None

Hello everyone. I am facing an issue. I am explaining what I am trying to do.
I have a Traffic and Road sign dataset that contains 43 classes. I am trying to classify the images. I am using the resnet34 pre-trained model. I have AMD RX6600 GPU that I use for running the model. For running the model on my AMD GPU I am using Pytorch Directml and using this code

import torch_directml
dml = torch_directml.device()

to find the device. Using this dml instance, I push the mode and training data to the GPU. But the problem is weights do not update. After a lot for debugging, I found that the model grad becomes none in the training loop when using GPU. But in the CPU it works totally fine.
image
When I want to see the grad values, I found this issue. When the model is in CPU, this print statement prints some numbers. But when I run same code in GPU the following error occurs.


you can see p.grad become none. And that’s why when I use the optimizer.step() nothing is updated and the model does not learn anything. Can anyone help me with this issue?
Thanks in advance

Hi,
Please try to post a minimum executable snippet enclosed within ``` the reproduces the error.

Ok, I will. Thank you for your information.

You training loop is wrong, torch.no_grad() turn of dynamic graph (gradient) so of course no gradient is updated.

You use torch.no_grad() when you want to test (or validate) the model not training it.

Take a look at this tutorial.

General practice: use as much as text possible instead of images

  1. It can be copied and pasted more easily
  2. Text is more standard and robust whereas image size, resolution might affect readability

Ok, so I have figured out something. After one backward pass, I printed the value using this code

for param in base_model.parameters():
    print((param.grad.data).cpu().sum())

when the model is in the CPU, it prints sum numbers.

But when the model is in the GPU this error happens.

tensor(-4.5475e-13)
tensor(2.2737e-13)
tensor(4.5475e-13)
tensor(-1.3588e-06)
tensor(-1.8833e-06)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[91], line 2
      1 for param in base_model.parameters():
----> 2     print((param.grad.data).cpu().sum())

AttributeError: 'NoneType' object has no attribute 'data'

Some of the model parameters are none. I think this is the root cause of the issue.

My manual one iteration code is given below

num_classes = 2
num_epochs = 40
learning_rate = 1e-4
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(base_model.parameters(), lr=learning_rate)

#taking only the first batch
for batch in dataloaders['train']:
    batch = {k: v.to(dml) for k, v in batch.items()}
    break

#forward pass
base_model.train()
outputs = base_model(**batch)
labels = batch['Type']

#backward pass
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()

#to see any parameters get updated
for param in base_model.parameters():
    print((param.grad.data).cpu().sum()) 

Can you please tell me why this happens when I run the model in GPU?

Is the model on GPU ? If so, whats the point of .cpu() ?

when the model is on the GPU I cannot print the model param. That’s why I have to use .cpu() to print the value otherwise it throws an error.