Model param.grad is None, how to debug?

asura · June 12, 2021, 4:30pm

Hello @ptrblck,
Can you please review the following snippet for where I might be going wrong?

model = back_bone_model # A 4 Layered CONV model.
meta_opt = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss(reduction='mean')
for iteration in range(iterations):
    train_loss, train_acc = 0, 0
    for task in range(meta_batch_size):
        # Copy model parameters. You don't want to train on model in inner loop :)
        learner = deepcopy(model) 
        batch = tasksets.train.sample() 
        loss_val, acc_val = inner_loop(learner, batch, loss_fn, 1, device) # Gradient steps: 1
        train_loss += loss_val/meta_batch_size
        train_acc += acc_val/meta_batch_size
        del learner                                                                                                                                                                                                                                 
    train_loss.backward()
    meta_opt.step()

A few details on the inner_loop().
I am taking a few samples (a batch) and updating the parameters of learner() in the inner_loop(). I am accumulating the loss over a few iterations of these batches, and then updating the weights of model() (main model) using the accumulated loss.

Now, the graph seems to be breaking somewhere in between, as model parameters have .grad as None (for all parameters). (Meaning, that train_loss.backward() is not updating weights. I am not able to figure out where I might be going wrong, thus, I would really appreciate if you are able to point me the mistake in my code. Thank you!

ptrblck · June 12, 2021, 10:41pm

Based on the code snippet it seems that this line of code might be causing the issue:

        # Copy model parameters. You don't want to train on model in inner loop :)
        learner = deepcopy(model)

which also explains in the comment that model won’t be trained.
This example also shows the issue:

# standard training
model = nn.Linear(1, 1)
x = torch.randn(1, 1)
out = model(x)
loss = out.mean()
loss.backward()

for name, param in model.named_parameters():
    print(name, param.grad)
> weight tensor([[-0.1550]])
  bias tensor([1.])

# set .grad attributes to None
model.zero_grad(set_to_none=True)
for name, param in model.named_parameters():
    print(name, param.grad)
> weight None
  bias None

# use deepcopy
learner = copy.deepcopy(model)
out = learner(x)
loss = out.mean()
loss.backward()

for name, param in model.named_parameters():
    print(name, param.grad)
> weight None
  bias None

for name, param in learner.named_parameters():
    print(name, param.grad)
> weight tensor([[-0.1550]])
  bias tensor([1.])

asura · June 13, 2021, 5:28am

Thanks for replying @ptrblck.

I am trying to implement MAML. As you suggested, it might be deepcopy that might be causing the issue. But, I need it in my code. Furthermore, when I try the example you added above, both model() and learner() parameters are updating if I don’t set .grad to None. I guess, it is because I was not trying to use the loss computed from learner() in model().

What I am trying to do is, update the weights of the deepcopied model learner() using the training samples. Now, I make use of these updated weights to compute loss on new set of samples. I accumulate this loss, and make use of it to update the weights of the original model model(). Thus, as you might have guessed, in the outer loop, it will be a second order differentiation.
grad(theta - alpha*grad(learner)).

Would I have to store all the gradients of the model() before computing loss on learner()? Or, would I be somehow able to maintain two computational graphs that share a common loss? If I am just approaching the problem naively, can you possibly point out a rough sketch?

Once again, thanks for replying!

ayhyap · July 4, 2021, 10:38pm

I feel incredibly stupid for getting stuck on this.

I’m coding an alternate residual block for a CNN that naively learns to weigh the skip connection.
Unfortunately the weight never changes during training (even though loss drops), and upon inspection weight.grad is always None.

What am I doing wrong?

class Block(nn.Module):
    def __init__(self):
        super().__init__()
        
        self.conv0 = Conv2d()
        self.conv1 = Conv2d()
        self.conv2 = Conv2d()
        self.bn0 = nn.BatchNorm2d()
        self.bn1 = nn.BatchNorm2d()
        self.bn2 = nn.BatchNorm2d()
        
        self.weight = nn.Parameter(torch.zeros(1))

    def forward(self, inputs):
        # Convolutions
        x = relu(self.bn0(self.conv0(inputs)))
        x = relu(self.bn1(self.conv1(x)))
        x = self.bn2(self.conv2(x))

        # Weighted skip connection
        w = torch.sigmoid(self.weight)
        x = w*x + (1-w)*inputs
        
        return x

ptrblck · July 4, 2021, 11:37pm

Your block seems to work fine and the weight parameter also gets a valid gradient using:

class Block(nn.Module):
    def __init__(self):
        super().__init__()
        
        self.conv0 = nn.Conv2d(1, 1, 1)
        self.conv1 = nn.Conv2d(1, 1, 1)
        self.conv2 = nn.Conv2d(1, 1, 1)
        self.bn0 = nn.BatchNorm2d(1)
        self.bn1 = nn.BatchNorm2d(1)
        self.bn2 = nn.BatchNorm2d(1)
        
        self.weight = nn.Parameter(torch.zeros(1))

    def forward(self, inputs):
        # Convolutions
        x = F.relu(self.bn0(self.conv0(inputs)))
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.bn2(self.conv2(x))

        # Weighted skip connection
        w = torch.sigmoid(self.weight)
        x = w*x + (1-w)*inputs
        
        return x


model = Block()
x = torch.randn(1, 1, 4, 4)
out = model(x)
out.mean().backward()
model.weight.grad
> tensor([0.0996])

ayhyap · July 5, 2021, 2:15pm

I’d been trying to debug by sending inputs through the entire model, but your response inspired me to do some unit testing and I shortly discovered a leftover debug line in another module that was causing a detach.
Thanks!

life_word · April 18, 2022, 11:26pm

Mr ptrblck,I have got a problem just like that.My model’s grad is always none, I try many method but can’t solve it, here is my code.@ptrblck
class GNNM(torch.nn.Module):

def __init__(self):

    super(GNNM,self).__init__()

    self.layer1 = torch.nn.Linear(44,60)

    self.relu1 = torch.nn.Sigmoid()

    self.layer2 = torch.nn.Linear(60,11)

    self.relu2 = torch.nn.Sigmoid()

    self.layer4 = torch.nn.Linear(11,5)

   

def forward(self,x):

    x = self.layer1(x)

    x = self.relu1(x)

    x = self.layer2(x)

    x = self.relu2(x)

    x = self.layer4(x)

    return x

class Data_reader(Dataset):

def __init__(self,train_fea,train_lab):

    super().__init__()

    self.train_fea = train_fea

    self.train_lab = train_lab

   

def __getitem__(self, index):

    fea = torch.tensor(self.train_fea[index]).to(torch.float32)

    lab = int(self.train_lab[index])

    lab = torch.tensor(lab).long()-1

    return fea,lab

def __len__(self):

    return int(len(self.train_fea))

life_word · April 18, 2022, 11:30pm

for i,(trs_fea,trs_lab) in enumerate(train_loader):

        pred = model(trs_fea)
       
        loss = F.cross_entropy(pred,trs_lab,reduce='none')

        loss = torch.tensor(loss,requires_grad=True)

        print(loss.is_leaf)

        print('loss',loss)

        print(type(loss))

        loss.requires_grad=True

        loss.retain_grad()

        loss.backward(gradient=loss.grad)

        print('grad',loss.grad)

        for name,param in model.named_parameters():

            if param.grad is not None:

                print('grad')

                print(name,param.grad.sum())

            else:

                print('no grad')

                print(name,param.grad)

        optimizer.step()

        optimizer.zero_grad()

life_word · April 18, 2022, 11:32pm

here is my result

AlphaBetaGamma96 · April 18, 2022, 11:42pm

life_word:

        loss = F.cross_entropy(pred,trs_lab,reduce='none')

        loss = torch.tensor(loss,requires_grad=True)

        print(loss.is_leaf)

        print('loss',loss)

        print(type(loss))

        loss.requires_grad=True

        loss.retain_grad()

        loss.backward(gradient=loss.grad)

When you recast loss to itself, loss = torch.tensor(loss, requires_grad=True) you break the computational graph (which is why you’re getting None as the gradient).

Try,

        loss = F.cross_entropy(pred,trs_lab,reduce='none')
        loss.backward(gradient=loss.grad)

and see if param.grad is equal to None.

life_word · April 18, 2022, 11:43pm

ohohoh，how can I forget that,I will try,thx

wml1993 · August 9, 2022, 2:45am

@ptrblck could you help me see this problems, thanks a lot
input = batch * max_length * 3; However loss.backward() param.grad is None.
code is as follow

class Classifier(nn.Module):

def __init__(self, max_len, dropout=0.1):

    super().__init__()
    # ua = torch.FloatTensor([1])

    self.para = nn.Parameter(Variable(torch.FloatTensor(np.random.randint(1, 100, size=(2, 1))),
                                      requires_grad=True))

    # c = torch.nn.Parameter(torch.FloatTensor([1]))
    #
    # self.c_var = nn.Parameter(Variable(torch.FloatTensor(np.random.randint(1, 100, size=1)), requires_grad=True))
    # c = torch.nn.Parameter(torch.FloatTensor([1])), requires_grad=True))

    self.lstm_layer = torch.nn.LSTM(2, 64, 2, batch_first=True)
    self.linear_1 = nn.Linear(2, 1)
    self.linear_2 = nn.Linear(max_len, 1)

    # self.register_parameter("Ablah", self.para)

def forward(self, mels):

    # package = nn.utils.rnn.pack_padded_sequence(input, lengths.cpu(), batch_first=self.batch_first,
    #                                             enforce_sorted=False)

    # batch * max_len * dimension (LJT)
    # 内部温度的变化量
    max_length = mels.shape[1]
    # print('max_length', max_length)
    delta_t = torch.diff(mels, axis=1)[:, :, 2]
    delta_t = torch.cat((torch.zeros(mels.shape[0], 1), delta_t), axis=1)

    # J - T  batch * dimension
    # print(self.c_var.data.numpy())
    in_energy = self.para[0][0] * delta_t + 0 - torch.diff(mels)[:, :, 1] * self.para[1][0]
    to_energy = in_energy.cumsum(dim=1)

    # 降维
    in_energy = torch.unsqueeze(in_energy, 2)
    to_energy = torch.unsqueeze(to_energy, 2)

    # 将特征 tensor

    # batch
    feature_energy = torch.cat((in_energy, to_energy), axis=2)
    # 提高精度
    feature_energy = feature_energy.to(torch.float32)

    # print(feature_energy.shape)
    # b * len * 2

    # 获取 nan 元素个数
    real_length_list = max_length - torch.isnan(feature_energy).int().sum(axis=1)[:, 1]
    feature_energy = feature_energy.nan_to_num()

    # print(lengths, sorted_indices, input.shape)
    # os._exit(1)

    feature_energy_packed = torch.nn.utils.rnn.pack_padded_sequence(feature_energy, real_length_list, batch_first=True, enforce_sorted=False)
    # encoder_outputs_packed, (h_last, c_last) = self.lstm(embed_input_x_packed)

    # input_feature, hidden_layer, batch_first=True B * max_len * dimension
    # print(feature_energy)
    # W_ii|W_if|W_ig|W_io), of shape (4*hidden_size, input_size) (128, 2)
    # weight_hh_l (W_hi|W_hf|W_hg|W_ho) (hidden_size, hidden_size)
    # (N,L,D∗H)

    lstem_out, (h_n, c_n) = self.lstm_layer(feature_energy_packed)
    output, lens = nn.utils.rnn.pad_packed_sequence(lstem_out, batch_first=True, total_length=max_length)
    print('pad_packed_sequence.shape', output.shape)

    # max_len = lstem_out.shape[1]
    # b * max_len * out

    # hidden_size = 64
    linear_1 = torch.nn.Linear(64, 1)
    linear_1_out = linear_1(output) # batch * max_len * 1
    linear_1_out = torch.squeeze(linear_1_out)
    #
    linear_2 = torch.nn.Linear(max_length, 2) #
    out = linear_2(linear_1_out)

    return out

ptrblck · August 9, 2022, 3:54am

Your model is working fine if you use the initialized linear layers instead of recreating them in the forward method:

class Classifier(nn.Module):
    def __init__(self, max_len, dropout=0.1):
        super().__init__()
        self.para = nn.Parameter(torch.FloatTensor(np.random.randint(1, 100, size=(2, 1))),
                                          requires_grad=True)
    
        self.lstm_layer = torch.nn.LSTM(2, 64, 2, batch_first=True)
        self.linear_1 = nn.Linear(64, 1)
        self.linear_2 = nn.Linear(max_len, 1)
    
    def forward(self, mels):
        max_length = mels.shape[1]
        delta_t = torch.diff(mels, axis=1)[:, :, 2]
        delta_t = torch.cat((torch.zeros(mels.shape[0], 1), delta_t), axis=1)
        in_energy = self.para[0][0] * delta_t + 0 - torch.diff(mels)[:, :, 1] * self.para[1][0]
        to_energy = in_energy.cumsum(dim=1)
    
        in_energy = torch.unsqueeze(in_energy, 2)
        to_energy = torch.unsqueeze(to_energy, 2)
    
        feature_energy = torch.cat((in_energy, to_energy), axis=2)
        feature_energy = feature_energy.to(torch.float32)
    
        real_length_list = max_length - torch.isnan(feature_energy).int().sum(axis=1)[:, 1]
        feature_energy = feature_energy.nan_to_num()
    
        feature_energy_packed = torch.nn.utils.rnn.pack_padded_sequence(feature_energy, real_length_list, batch_first=True, enforce_sorted=False)
    
        lstem_out, (h_n, c_n) = self.lstm_layer(feature_energy_packed)
        output, lens = nn.utils.rnn.pad_packed_sequence(lstem_out, batch_first=True, total_length=max_length)
        
        linear_1_out = self.linear_1(output)
        linear_1_out = torch.squeeze(linear_1_out)
        out = self.linear_2(linear_1_out)
    
        return out


model = Classifier(10)

mels = torch.randn(1, 10, 10)
out = model(mels)
out.mean().backward()

for name, param in model.named_parameters():
    print(name)
    print(name, param.grad.abs().sum())

wml1993 · August 9, 2022, 5:05am

However train process is wrong. and param.grad is none when loss.backward() is finished.

def model_fn(batch, model, criterion, device) → object:

'''
Forward a batch through the model
:param batch:
:param model:
:param criterion:
:param device:
:return:
'''

mels, labels = batch
mels = mels.to(device)

labels = labels.to(device)
labels = torch.squeeze(labels)

# print(labels)
#
outs = model(mels)

loss = criterion(outs, labels)

# Get the speaker id with highest probability.
preds = outs.argmax(1)
# Compute accuracy.
accuracy = torch.mean((preds == labels).float())

return loss, accuracy

train_npy, train_tag_np, valid_npy, valid_tag_np = split_train_test(df_all, tag, step1_react - 2, 0.85)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"[Info]: Use {device} now!")

# print('train_npy', train_npy.shape)
# print('valid_npy', valid_npy.shape)
# print('train_tag_np', train_tag_np.shape)
# print('valid_tag_np', valid_tag_np.shape)

train_loader, valid_loader = get_dataloader(train_npy, train_tag_np,
                                            valid_npy, valid_tag_np, batch_size=12, n_workers=1)

train_iterator = iter(train_loader)

# print(list(train_iterator))

print(f"[Info]: Finish loading data!", flush=True)

maxlen = 100
model = Classifier(maxlen).to(device)
# print(list(model.parameters()))

criterion = nn.CrossEntropyLoss()
optimizer = AdamW(model.parameters(), lr=1e-8)
# scheduler = get_cosine_schedule_with_warmup(optimizer, warmup_steps, total_steps)

print(f"[Info]: Finish creating model!", flush=True)
best_accuracy = -1.0
best_state_dict = None

pbar = tqdm(total=valid_steps, ncols=0, desc="Train", unit=" step")

for step in range(total_steps):

    # Get data

    batch = next(train_iterator)
    # except StopIteration:
    #     train_iterator = iter(train_loader)
    #     # print(list(train_iterator))
    #     batch = next(train_iterator)

    mels, labels = batch
    loss, accuracy = model_fn(batch, model, criterion, device)

    batch_loss = loss.item()
    batch_accuracy = accuracy.item()

    # Updata model
    loss.backward()
    # optimizer.step 针对每一个mini-batch 调整学习率
    optimizer.step()

    # scheduler.step 针对每一个epoch
    optimizer.zero_grad()

ptrblck · August 9, 2022, 5:08am

Are you also seeing a None gradient using my code and the proposed fixes?
If not, please share a minimal, executable code snippet which would reproduce the issue.

wml1993 · August 9, 2022, 6:08am

Thank you for your response!
The problem is not solved.

When loss.backward() is finish, the grad is none.

the grad is None as
tensor([[nan],
[nan]])

wml1993 · August 9, 2022, 6:09am

@ptrblck

Other parameter has grad, only self.para has no grad

Snake_Zhu · February 20, 2023, 1:14am

@ptrblck @wml1993

Summary

Actually, I have the same issue when training the model in steps with range(0, total_steps). I know the issue could be solved by setting the initial iter as 1, just like:

for step in range(1, total_steps):

I guess the loss.backward() will have some specific condition at step 0, which would cause NoneType gradient for the parameters in models. But I still not figure out the primary cause to this issue. Hope this setting would help you to find the reason.

Sorry about the above wrong information, changing the iter is useless. Bug is still here.

ptrblck · February 20, 2023, 1:18am

No, Autograd doesn’t use a specific condition based on the step.
Could you share a minimal and executable code snippet which would reproduce the issue by wrapping it into three backticks ```, please?

Snake_Zhu · February 20, 2023, 1:27am

Thanks for your quickly reply. Since the code have many files, can I send the whole project to your email?