P.data.grad is None after backwards

Hey,
I’m trying to implement the optimization algorithems by myself and to compare them . However, not sure why, but the backwards flow doesnt update the gradients and they left as None values.

My optimizer :

class MyOptimizer:

  def __init__(self, parameters,lr=0.001, momentum=0.9):
    self.momentum = momentum
    self.layers_data_list=[]
    for layer_params in list(parameters):
      layer_dict = dict()
      layer_dict['params']=layer_params
      layer_dict['momentum']=momentum
      layer_dict['velocity']=None
      layer_dict['lr']=lr
      self.layers_data_list.append(layer_dict)

      
  def step_sgd(self):
    for layer_data in self.layers_data_list:
      for p in layer_data['params']:
        if p.data.grad is None: #Update : tried if p.grad is None as suggested in the comments
          print("grad is None")
          continue
        lr=layer_data['lr']
        d_p=p.grad.data
        p.data.add_(-lr,d_p)

  def zero_grad(self):
    for layer_data in self.layers_data_list:
      for p in layer_data['params']:
        if p.grad is not None:
          p.grad.zero_()
the training part isnt so unusual, and it worked on different notebooks that I used :

def train_and_eval(optimizer,net,optimizer_step,GPU=False):
  loss_function = nn.CrossEntropyLoss()
  epochs=100
  train_loss_per_epoch=[]
  test_loss_per_epoch=[]
  for epoch in range(epochs) :
    print("[train]-----epoch "+str(epoch+1)+" -----")
    train_loss = 0.0
    test_loss = 0.0
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get inputs from data 
        inputs, labels = data

        if GPU:
            inputs = inputs.cuda()  # -- For GPU
            labels = labels.cuda()  # -- For GPU

        optimizer.zero_grad()
        outputs = net(inputs)
        loss = loss_function(outputs, labels)
        loss.backward()
        if optimizer_step == "sgd":
          optimizer.step_sgd()

        train_loss += loss.item()

        # print statistics
        running_loss += loss.item()
        if (i + 1) % 200 == 0:
            print('[%d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 200))
            running_loss = 0.0

    train_loss_per_epoch.append(train_loss / len(trainloader))
    print('[%d] train loss: %.3f' %
          (epoch + 1, train_loss / len(trainloader)))

    # test
    print("[test]-----epoch "+str(epoch+1)+" -----")
    for i, data in enumerate(testloader, 0):

        # get inputs from data 
        inputs, labels = data

        if GPU:
            inputs = inputs.cuda()  # -- For GPU
            labels = labels.cuda()  # -- For GPU

        outputs = net(inputs)
        loss = loss_function(outputs, labels)
        test_loss += loss.item()
    test_loss_per_epoch.append(test_loss / len(testloader))
    print('[%d] test loss: %.3f' %
          (epoch + 1, test_loss / len(testloader)))
  
  return train_loss_per_epoch,test_loss_per_epoch

def runOptimizerTest(optimizer_step,GPU):
  if GPU :
    net=CNN().cuda()
  else :
    net=CNN()
  optimizer=MyOptimizer(net.parameters())
  return train_and_eval(optimizer,net,optimizer_step,GPU)
 
 
optimizer_step="sgd"
train_loss_per_epoch_sgd,test_loss_per_epoch_sgd=runOptimizerTest(optimizer_step,GPU=False)

thanks

Don’t use the .data attribute, as the manipulation of the underlying data might yield some side effects.
If you want to check the gradient of a parameter, you could access it via print(param.grad).

Also, you are mixing p.data.grad and p.grad.data. :wink:

To add to the comment above, you want to do the sgd step within with torch.no_grad(): block instead of using .data.

1 Like

First of all thank u for the fast response.
So I changed my if and I’m checking if p.grad is None: but it seems that the grad still is None. I added also a print inside the if, and the result is the same.

Can you give a bit more code?
In particular, do you send the net to cuda after giving the parameters to the optimizer for example?
Do you have a complete example that we can run to check this?

I added the whole code I used.

So the problem is that layer_dict['params'] is not a list, but just a Tensor.
So when you do for p in layer_data['params']:, you actually slice the Tensor along the 0th dimension. Since this returns new Tensors, the .grad field is not set for them.

You can either do layer_dict['params']=[layer_params,] or replace for p in layer_data['params']: by p = layer_data['params'].

instead of using for p in layer_data['params'] I worked on the tensor level -> p=layer_data['params'] as u suggested.

  def zero_grad(self):
    for layer_data in self.layers_data_list:
      if layer_data['params'].grad is not None:
        layer_data['params'].grad.zero_()

  def step_sgd(self):
    '''Stochastic gradient descent'''
    for layer_data in self.layers_data_list:
      p=layer_data['params']
      lr=layer_data['lr']
      d_p=p.grad.data
      p.data.add_(-lr,d_p)

it seems that now I’m in overfitting state. Still, I’l be happy if u can explain why I cant iterate over the tensor and run p.grad per weight and instead we do this action on the tensor level :tensor.grad

It is because the .grad field is associated with the Tensor, not an entry in the Tensor.
So if you index the Tensor, you get some entries into it, but these are a new Tensor, and so with a different .grad field.

But the entries inside the tensor got .grad func. I thought that its better to work on the tensor level in order to benefit from GPU performance (matrix multiplication) but it seems that I’m wrong. Why then each of the entires in the tensor has .grad func ?

I did the changes that u suggested, now indeed I see that the grads are set and not None. However , it seems that now I’m getting overfitting very fast and my test loss increases :slight_smile:
[1] train loss: 2.105
[1] test loss: 2.390
[2] train loss: 1.235
[2] test loss: 5.240
[3] train loss: 0.511
[3] test loss: 7.921
[4] train loss: 0.167
[4] test loss: 10.697
[5] train loss: 0.033
[5] test loss: 11.992
and so on…

.grad_fn is very different from .grad. You have a .grad_fn because this new Tensor was obtained in a differentiable way (using the slicing operation on the original one).
You indeed want to work with the full Tensor, not the individual elements !

For the overfitting, you can try to increase the training set size, l2 regularization, momentum, etc (the usual suspects :D)

Thank you for all the help !