Model param.grad is None, how to debug?

Hello @ptrblck,
Can you please review the following snippet for where I might be going wrong?

model = back_bone_model # A 4 Layered CONV model.
meta_opt = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss(reduction='mean')
for iteration in range(iterations):
    train_loss, train_acc = 0, 0
    for task in range(meta_batch_size):
        # Copy model parameters. You don't want to train on model in inner loop :)
        learner = deepcopy(model) 
        batch = tasksets.train.sample() 
        loss_val, acc_val = inner_loop(learner, batch, loss_fn, 1, device) # Gradient steps: 1
        train_loss += loss_val/meta_batch_size
        train_acc += acc_val/meta_batch_size
        del learner                                                                                                                                                                                                                                 
    train_loss.backward()
    meta_opt.step()

A few details on the inner_loop().
I am taking a few samples (a batch) and updating the parameters of learner() in the inner_loop(). I am accumulating the loss over a few iterations of these batches, and then updating the weights of model() (main model) using the accumulated loss.

Now, the graph seems to be breaking somewhere in between, as model parameters have .grad as None (for all parameters). (Meaning, that train_loss.backward() is not updating weights. I am not able to figure out where I might be going wrong, thus, I would really appreciate if you are able to point me the mistake in my code. Thank you!

Based on the code snippet it seems that this line of code might be causing the issue:

        # Copy model parameters. You don't want to train on model in inner loop :)
        learner = deepcopy(model) 

which also explains in the comment that model won’t be trained.
This example also shows the issue:

# standard training
model = nn.Linear(1, 1)
x = torch.randn(1, 1)
out = model(x)
loss = out.mean()
loss.backward()

for name, param in model.named_parameters():
    print(name, param.grad)
> weight tensor([[-0.1550]])
  bias tensor([1.])

# set .grad attributes to None
model.zero_grad(set_to_none=True)
for name, param in model.named_parameters():
    print(name, param.grad)
> weight None
  bias None

# use deepcopy
learner = copy.deepcopy(model)
out = learner(x)
loss = out.mean()
loss.backward()

for name, param in model.named_parameters():
    print(name, param.grad)
> weight None
  bias None

for name, param in learner.named_parameters():
    print(name, param.grad)
> weight tensor([[-0.1550]])
  bias tensor([1.])
1 Like

Thanks for replying @ptrblck.

I am trying to implement MAML. As you suggested, it might be deepcopy that might be causing the issue. But, I need it in my code. Furthermore, when I try the example you added above, both model() and learner() parameters are updating if I don’t set .grad to None. I guess, it is because I was not trying to use the loss computed from learner() in model().

What I am trying to do is, update the weights of the deepcopied model learner() using the training samples. Now, I make use of these updated weights to compute loss on new set of samples. I accumulate this loss, and make use of it to update the weights of the original model model(). Thus, as you might have guessed, in the outer loop, it will be a second order differentiation.
grad(theta - alpha*grad(learner)).

Would I have to store all the gradients of the model() before computing loss on learner()? Or, would I be somehow able to maintain two computational graphs that share a common loss? If I am just approaching the problem naively, can you possibly point out a rough sketch?

Once again, thanks for replying!

1 Like

I feel incredibly stupid for getting stuck on this.

I’m coding an alternate residual block for a CNN that naively learns to weigh the skip connection.
Unfortunately the weight never changes during training (even though loss drops), and upon inspection weight.grad is always None.

What am I doing wrong?

class Block(nn.Module):
    def __init__(self):
        super().__init__()
        
        self.conv0 = Conv2d()
        self.conv1 = Conv2d()
        self.conv2 = Conv2d()
        self.bn0 = nn.BatchNorm2d()
        self.bn1 = nn.BatchNorm2d()
        self.bn2 = nn.BatchNorm2d()
        
        self.weight = nn.Parameter(torch.zeros(1))

    def forward(self, inputs):
        # Convolutions
        x = relu(self.bn0(self.conv0(inputs)))
        x = relu(self.bn1(self.conv1(x)))
        x = self.bn2(self.conv2(x))

        # Weighted skip connection
        w = torch.sigmoid(self.weight)
        x = w*x + (1-w)*inputs
        
        return x

Your block seems to work fine and the weight parameter also gets a valid gradient using:

class Block(nn.Module):
    def __init__(self):
        super().__init__()
        
        self.conv0 = nn.Conv2d(1, 1, 1)
        self.conv1 = nn.Conv2d(1, 1, 1)
        self.conv2 = nn.Conv2d(1, 1, 1)
        self.bn0 = nn.BatchNorm2d(1)
        self.bn1 = nn.BatchNorm2d(1)
        self.bn2 = nn.BatchNorm2d(1)
        
        self.weight = nn.Parameter(torch.zeros(1))

    def forward(self, inputs):
        # Convolutions
        x = F.relu(self.bn0(self.conv0(inputs)))
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.bn2(self.conv2(x))

        # Weighted skip connection
        w = torch.sigmoid(self.weight)
        x = w*x + (1-w)*inputs
        
        return x


model = Block()
x = torch.randn(1, 1, 4, 4)
out = model(x)
out.mean().backward()
model.weight.grad
> tensor([0.0996])

I’d been trying to debug by sending inputs through the entire model, but your response inspired me to do some unit testing and I shortly discovered a leftover debug line in another module that was causing a detach.
Thanks!

Mr ptrblck,I have got a problem just like that.My model’s grad is always none, I try many method but can’t solve it, here is my code.@ptrblck
class GNNM(torch.nn.Module):

def __init__(self):

    super(GNNM,self).__init__()

    self.layer1 = torch.nn.Linear(44,60)

    self.relu1 = torch.nn.Sigmoid()

    self.layer2 = torch.nn.Linear(60,11)

    self.relu2 = torch.nn.Sigmoid()

    self.layer4 = torch.nn.Linear(11,5)

   

def forward(self,x):

    x = self.layer1(x)

    x = self.relu1(x)

    x = self.layer2(x)

    x = self.relu2(x)

    x = self.layer4(x)

    return x

class Data_reader(Dataset):

def __init__(self,train_fea,train_lab):

    super().__init__()

    self.train_fea = train_fea

    self.train_lab = train_lab

   

def __getitem__(self, index):

    fea = torch.tensor(self.train_fea[index]).to(torch.float32)

    lab = int(self.train_lab[index])

    lab = torch.tensor(lab).long()-1

    return fea,lab

def __len__(self):

    return int(len(self.train_fea))

for i,(trs_fea,trs_lab) in enumerate(train_loader):

        pred = model(trs_fea)
       
        loss = F.cross_entropy(pred,trs_lab,reduce='none')

        loss = torch.tensor(loss,requires_grad=True)

        print(loss.is_leaf)

        print('loss',loss)

        print(type(loss))

        loss.requires_grad=True

        loss.retain_grad()

        loss.backward(gradient=loss.grad)

        print('grad',loss.grad)

        for name,param in model.named_parameters():

            if param.grad is not None:

                print('grad')

                print(name,param.grad.sum())

            else:

                print('no grad')

                print(name,param.grad)

        optimizer.step()

        optimizer.zero_grad()

here is my result

image

When you recast loss to itself, loss = torch.tensor(loss, requires_grad=True) you break the computational graph (which is why you’re getting None as the gradient).

Try,

        loss = F.cross_entropy(pred,trs_lab,reduce='none')
        loss.backward(gradient=loss.grad) 

and see if param.grad is equal to None.

ohohoh,how can I forget that,I will try,thx

@ptrblck could you help me see this problems, thanks a lot
input = batch * max_length * 3; However loss.backward() param.grad is None.
code is as follow

class Classifier(nn.Module):

def __init__(self, max_len, dropout=0.1):

    super().__init__()
    # ua = torch.FloatTensor([1])

    self.para = nn.Parameter(Variable(torch.FloatTensor(np.random.randint(1, 100, size=(2, 1))),
                                      requires_grad=True))

    # c = torch.nn.Parameter(torch.FloatTensor([1]))
    #
    # self.c_var = nn.Parameter(Variable(torch.FloatTensor(np.random.randint(1, 100, size=1)), requires_grad=True))
    # c = torch.nn.Parameter(torch.FloatTensor([1])), requires_grad=True))

    self.lstm_layer = torch.nn.LSTM(2, 64, 2, batch_first=True)
    self.linear_1 = nn.Linear(2, 1)
    self.linear_2 = nn.Linear(max_len, 1)

    # self.register_parameter("Ablah", self.para)

def forward(self, mels):

    # package = nn.utils.rnn.pack_padded_sequence(input, lengths.cpu(), batch_first=self.batch_first,
    #                                             enforce_sorted=False)

    # batch * max_len * dimension (LJT)
    # 内部温度的变化量
    max_length = mels.shape[1]
    # print('max_length', max_length)
    delta_t = torch.diff(mels, axis=1)[:, :, 2]
    delta_t = torch.cat((torch.zeros(mels.shape[0], 1), delta_t), axis=1)

    # J - T  batch * dimension
    # print(self.c_var.data.numpy())
    in_energy = self.para[0][0] * delta_t + 0 - torch.diff(mels)[:, :, 1] * self.para[1][0]
    to_energy = in_energy.cumsum(dim=1)

    # 降维
    in_energy = torch.unsqueeze(in_energy, 2)
    to_energy = torch.unsqueeze(to_energy, 2)

    # 将特征 tensor

    # batch
    feature_energy = torch.cat((in_energy, to_energy), axis=2)
    # 提高精度
    feature_energy = feature_energy.to(torch.float32)

    # print(feature_energy.shape)
    # b * len * 2

    # 获取 nan 元素个数
    real_length_list = max_length - torch.isnan(feature_energy).int().sum(axis=1)[:, 1]
    feature_energy = feature_energy.nan_to_num()

    # print(lengths, sorted_indices, input.shape)
    # os._exit(1)

    feature_energy_packed = torch.nn.utils.rnn.pack_padded_sequence(feature_energy, real_length_list, batch_first=True, enforce_sorted=False)
    # encoder_outputs_packed, (h_last, c_last) = self.lstm(embed_input_x_packed)

    # input_feature, hidden_layer, batch_first=True B * max_len * dimension
    # print(feature_energy)
    # W_ii|W_if|W_ig|W_io), of shape (4*hidden_size, input_size) (128, 2)
    # weight_hh_l (W_hi|W_hf|W_hg|W_ho) (hidden_size, hidden_size)
    # (N,L,D∗H)

    lstem_out, (h_n, c_n) = self.lstm_layer(feature_energy_packed)
    output, lens = nn.utils.rnn.pad_packed_sequence(lstem_out, batch_first=True, total_length=max_length)
    print('pad_packed_sequence.shape', output.shape)

    # max_len = lstem_out.shape[1]
    # b * max_len * out

    # hidden_size = 64
    linear_1 = torch.nn.Linear(64, 1)
    linear_1_out = linear_1(output) # batch * max_len * 1
    linear_1_out = torch.squeeze(linear_1_out)
    #
    linear_2 = torch.nn.Linear(max_length, 2) #
    out = linear_2(linear_1_out)

    return out

Your model is working fine if you use the initialized linear layers instead of recreating them in the forward method:

class Classifier(nn.Module):
    def __init__(self, max_len, dropout=0.1):
        super().__init__()
        self.para = nn.Parameter(torch.FloatTensor(np.random.randint(1, 100, size=(2, 1))),
                                          requires_grad=True)
    
        self.lstm_layer = torch.nn.LSTM(2, 64, 2, batch_first=True)
        self.linear_1 = nn.Linear(64, 1)
        self.linear_2 = nn.Linear(max_len, 1)
    
    def forward(self, mels):
        max_length = mels.shape[1]
        delta_t = torch.diff(mels, axis=1)[:, :, 2]
        delta_t = torch.cat((torch.zeros(mels.shape[0], 1), delta_t), axis=1)
        in_energy = self.para[0][0] * delta_t + 0 - torch.diff(mels)[:, :, 1] * self.para[1][0]
        to_energy = in_energy.cumsum(dim=1)
    
        in_energy = torch.unsqueeze(in_energy, 2)
        to_energy = torch.unsqueeze(to_energy, 2)
    
        feature_energy = torch.cat((in_energy, to_energy), axis=2)
        feature_energy = feature_energy.to(torch.float32)
    
        real_length_list = max_length - torch.isnan(feature_energy).int().sum(axis=1)[:, 1]
        feature_energy = feature_energy.nan_to_num()
    
        feature_energy_packed = torch.nn.utils.rnn.pack_padded_sequence(feature_energy, real_length_list, batch_first=True, enforce_sorted=False)
    
        lstem_out, (h_n, c_n) = self.lstm_layer(feature_energy_packed)
        output, lens = nn.utils.rnn.pad_packed_sequence(lstem_out, batch_first=True, total_length=max_length)
        
        linear_1_out = self.linear_1(output)
        linear_1_out = torch.squeeze(linear_1_out)
        out = self.linear_2(linear_1_out)
    
        return out


model = Classifier(10)

mels = torch.randn(1, 10, 10)
out = model(mels)
out.mean().backward()

for name, param in model.named_parameters():
    print(name)
    print(name, param.grad.abs().sum())

However train process is wrong. and param.grad is none when loss.backward() is finished.

def model_fn(batch, model, criterion, device) → object:

'''
Forward a batch through the model
:param batch:
:param model:
:param criterion:
:param device:
:return:
'''

mels, labels = batch
mels = mels.to(device)

labels = labels.to(device)
labels = torch.squeeze(labels)

# print(labels)
#
outs = model(mels)

loss = criterion(outs, labels)

# Get the speaker id with highest probability.
preds = outs.argmax(1)
# Compute accuracy.
accuracy = torch.mean((preds == labels).float())

return loss, accuracy

train_npy, train_tag_np, valid_npy, valid_tag_np = split_train_test(df_all, tag, step1_react - 2, 0.85)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"[Info]: Use {device} now!")

# print('train_npy', train_npy.shape)
# print('valid_npy', valid_npy.shape)
# print('train_tag_np', train_tag_np.shape)
# print('valid_tag_np', valid_tag_np.shape)

train_loader, valid_loader = get_dataloader(train_npy, train_tag_np,
                                            valid_npy, valid_tag_np, batch_size=12, n_workers=1)

train_iterator = iter(train_loader)

# print(list(train_iterator))

print(f"[Info]: Finish loading data!", flush=True)

maxlen = 100
model = Classifier(maxlen).to(device)
# print(list(model.parameters()))

criterion = nn.CrossEntropyLoss()
optimizer = AdamW(model.parameters(), lr=1e-8)
# scheduler = get_cosine_schedule_with_warmup(optimizer, warmup_steps, total_steps)

print(f"[Info]: Finish creating model!", flush=True)
best_accuracy = -1.0
best_state_dict = None

pbar = tqdm(total=valid_steps, ncols=0, desc="Train", unit=" step")

for step in range(total_steps):

    # Get data

    batch = next(train_iterator)
    # except StopIteration:
    #     train_iterator = iter(train_loader)
    #     # print(list(train_iterator))
    #     batch = next(train_iterator)

    mels, labels = batch
    loss, accuracy = model_fn(batch, model, criterion, device)

    batch_loss = loss.item()
    batch_accuracy = accuracy.item()

    # Updata model
    loss.backward()
    # optimizer.step 针对每一个mini-batch 调整学习率
    optimizer.step()

    # scheduler.step 针对每一个epoch
    optimizer.zero_grad()

Are you also seeing a None gradient using my code and the proposed fixes?
If not, please share a minimal, executable code snippet which would reproduce the issue.

Thank you for your response!
The problem is not solved.

When loss.backward() is finish, the grad is none.

the grad is None as
tensor([[nan],
[nan]])

@ptrblck

Other parameter has grad, only self.para has no grad

@ptrblck @wml1993

Summary

Actually, I have the same issue when training the model in steps with range(0, total_steps). I know the issue could be solved by setting the initial iter as 1, just like:

for step in range(1, total_steps):

I guess the loss.backward() will have some specific condition at step 0, which would cause NoneType gradient for the parameters in models. But I still not figure out the primary cause to this issue. Hope this setting would help you to find the reason.

Sorry about the above wrong information, changing the iter is useless. Bug is still here.

No, Autograd doesn’t use a specific condition based on the step.
Could you share a minimal and executable code snippet which would reproduce the issue by wrapping it into three backticks ```, please?

Thanks for your quickly reply. Since the code have many files, can I send the whole project to your email?