Grad is always none

Hi, I need some help trying to make my model pass through gradients properly.
In my model, I have a series of conv layers, then linear layers. After the linear layers spit out an 2x8x8 grid, I apply torch.abs, perform a cumulative sum operation on the grid, and upsample the grid to the size of 2x128x128. Then, I perform a grid_sample() operation to the source image using the upsampled grid, and finally the loss is computed by comparing the original image to the deformed image. In addition I have a l1 regularization term in the loss function, applied to the initial 2x8x8 grid as originally outputted by linear layers.
But when I print the gradients through each of these consecutive operations, none of them has a gradient (although I set the retain_grad to true). Does anyone have suggestions as to what might be the problem here? Thanks!

class Net(nn.Module):
  def __init__(self, grid_size):
    super().__init__()
    self.conv = get_conv(grid_size)
    self.flatten = nn.Flatten()
    self.linear1 = nn.Sequential(nn.Linear(80,20),nn.ReLU(),)
    self.linear2 = nn.Linear(20, 2*grid_size*grid_size)
    self.upsampler = nn.Upsample(size = [IMAGE_SIZE, IMAGE_SIZE], mode = 'bilinear')
    #self.linear2.bias = nn.Parameter(init_grid(grid_size).view(-1))
    #self.linear2.weight.data.fill_(float(0))
    self.grid_offset_x = torch.tensor(float(0), requires_grad=True)
    self.grid_offset_y = torch.tensor(float(0), requires_grad=True)
    self.grid_offset_x = nn.Parameter(self.grid_offset_x)
    self.grid_offset_y = nn.Parameter(self.grid_offset_y)
    self.grid_size = grid_size
  def forward(self, x, src_batch, checker_board=False):
    print(f'X gradient1: {get_tensor_info(x)}')
    x = self.conv(x)
    print(f'X gradient2: {get_tensor_info(x)}')
    x = self.flatten(x)
    print(f'X gradient3: {get_tensor_info(x)}')
    x = self.linear1(x)
    print(f'X gradient4: {get_tensor_info(x)}')
    x = self.linear2(x)
    print(f'X gradient5: {get_tensor_info(x)}')
    #enforce axial monotinicity using the abs operation
    x = torch.abs(x)
    print(f'X gradient after abs(): {get_tensor_info(x)}')
    batch, grid = x.shape
    x = x.view(batch, 2,self.grid_size,self.grid_size)
    #perform the cumsum operation to restore the original grid from the differential grid
    x = cumsum_2d(x, self.grid_offset_x, self.grid_offset_y)
    print(f'X gradient after cumsum(): {get_tensor_info(x)}')
    #Upsample the grid_size x grid_size warp field to image_size x image_size warp field
    x = self.upsampler(x)
    print(f'X gradient after upsampling: {get_tensor_info(x)}')
    x = x.permute(0,2,3,1)
    if checker_board:
      source_image = apply_checkerboard(src_batch, IMAGE_SIZE)
    #calculate target estimation
    x = nn.functional.grid_sample(src_batch.unsqueeze(0).permute([1,0,2,3]), x)

    return x```

The result of this gradient check is 
X gradient1: requires_grad(False) is_leaf(True) retains_grad(None) grad_fn(None) grad(None)
X gradient2: requires_grad(True) is_leaf(False) retains_grad(None) grad_fn(<ReluBackward0 object at 0x7f5f1003d3d0>) grad(None)
X gradient3: requires_grad(True) is_leaf(False) retains_grad(None) grad_fn(<ViewBackward object at 0x7f5f123eed50>) grad(None)
X gradient4: requires_grad(True) is_leaf(False) retains_grad(None) grad_fn(<ReluBackward0 object at 0x7f5f123eed10>) grad(None)
X gradient5: requires_grad(True) is_leaf(False) retains_grad(None) grad_fn(<AddmmBackward object at 0x7f5f1004fe10>) grad(None)
X gradient after abs(): requires_grad(True) is_leaf(False) retains_grad(None) grad_fn(<AbsBackward object at 0x7f5f123eed10>) grad(None)
X gradient after cumsum(): requires_grad(True) is_leaf(False) retains_grad(None) grad_fn(<PermuteBackward object at 0x7f5f12350450>) grad(None)
X gradient after upsampling: requires_grad(True) is_leaf(False) retains_grad(None) grad_fn(<UpsampleBilinear2DBackward1 object at 0x7f5f1003d3d0>) grad(None)
Total_loss gradient: requires_grad(True) is_leaf(False) retains_grad(True) grad_fn(<MulBackward0 object at 0x7f5f10031b50>) grad(None)
total_loss gradient: requires_grad(True) is_leaf(False) retains_grad(True) grad_fn(<MulBackward0 object at 0x7f5f10031750>) grad(None)

And this is my loss function
```def Total_Loss (target_image, warped_image, grid_size, Lambda):
  batch,W,H = warped_image.shape
  #print(f'Warp_field gradient: {get_tensor_info(warp_field)}')

  L2_Loss_f = nn.MSELoss()
  L2_Loss = 1/2 * L2_Loss_f(target_image, warped_image)
  
  Total_loss = L2_Loss
  Total_loss.retain_grad()
  print(f'Total_loss gradient: {get_tensor_info(Total_loss)}')

  return Total_loss```

Appreciate your help

Could you add the missing definitions to the posted code to make it executable as well as the input shapes, so that we could try to reproduce it, please?

Sure, here is a minimized version of my model. I’ve annotated the input sizes on top of each function definitions. Thanks a lot for your help.



#define the alignnet model
def get_conv(grid_size):
  model = nn.Sequential (
      nn.MaxPool2d (2),
      nn.Conv2d (2, 20, 5),
      nn.ReLU(),
      nn.MaxPool2d (2),
      nn.Conv2d (20, 20, 5),
      nn.ReLU(),
      nn.MaxPool2d (2),
      nn.Conv2d (20, 20, 2),
      nn.ReLU(),
      nn.MaxPool2d (2),
      nn.Conv2d (20, 20, 5),
      nn.ReLU(),
   
  )
  return model
def get_tensor_info(tensor):
  info = []
  for name in ['requires_grad', 'is_leaf', 'retains_grad', 'grad_fn', 'grad']:
    info.append(f'{name}({getattr(tensor, name, None)})')
  return ' '.join(info)
#initialize the differential grid
#the parameter learn offset will define whether or not to learn the offset values during training
def init_grid(grid_size=8):
  #spacing of the grid
  #-1 is because we have a1 = -1 (and thus there are grid_size - 1 "spacing" grids)
  delta = 2/(grid_size-1)
  np_grid = np.arange(grid_size, dtype=float)
  np_grid = np.full_like(np_grid,float(delta))
  ts_grid_x = torch.FloatTensor(np_grid).to(DEVICE)
  ts_grid_y = torch.FloatTensor(np_grid).to(DEVICE)
  diff_i_grid_y, diff_i_grid_x = torch.meshgrid(ts_grid_x,ts_grid_y)
  diff_grid = torch.stack([diff_i_grid_x, diff_i_grid_y])
  diff_grid = diff_grid.view(2*grid_size*grid_size)
  return diff_grid

#perform cumsum operation on a 2d batch of inputs
#takes in grid tensors of shape batch x 2 x grid x grid 
#return grid tensors of shape batch x 2 x grid x grid 
def cumsum_2d(grid, grid_offset_x, grid_offset_y):
  batch_size, dim, grid_1, grid_2 = grid.shape
  grid[:,0,:,0] = -1 
  grid[:,1,0,:] = -1 

  Integrated_grid_x = torch.cumsum(grid[:,0], dim = 2) + grid_offset_x
  Integrated_grid_y = torch.cumsum(grid[:,1], dim = 1) + grid_offset_y
  Integrated_grid = torch.stack([Integrated_grid_x, Integrated_grid_y])
  Integrated_grid = Integrated_grid.permute([1,0,2,3])

  return Integrated_grid

class Net(nn.Module):
  def __init__(self, grid_size):
    super().__init__()
    self.conv = get_conv(grid_size)
    self.flatten = nn.Flatten()
    self.linear1 = nn.Sequential(nn.Linear(80,20),nn.ReLU(),)
    self.linear2 = nn.Linear(20, 2*8*8)
    self.upsampler = nn.Upsample(size = [128, 128], mode = 'bilinear')
    self.linear2.bias = nn.Parameter(init_grid(8).view(-1))
    self.linear2.weight.data.fill_(float(0))
    self.grid_offset_x = torch.tensor(float(0), requires_grad=True)
    self.grid_offset_y = torch.tensor(float(0), requires_grad=True)
    self.grid_offset_x = nn.Parameter(self.grid_offset_x)
    self.grid_offset_y = nn.Parameter(self.grid_offset_y)
    self.grid_size = grid_size

#input shape (n could be any numbers):
#x: nx2x128x128
#src_batch: nx128x128
  def forward(self, x, src_batch):
    print(f'X gradient1: {get_tensor_info(x)}')
    x = self.conv(x)
    print(f'X gradient2: {get_tensor_info(x)}')
    x = self.flatten(x)
    print(f'X gradient3: {get_tensor_info(x)}')
    x = self.linear1(x)
    print(f'X gradient4: {get_tensor_info(x)}')
    x = self.linear2(x)
    print(f'X gradient5: {get_tensor_info(x)}')
    #enforce axial monotinicity using the abs operation
    x = torch.abs(x)
    print(f'X gradient after abs(): {get_tensor_info(x)}')
    batch, grid = x.shape
    x = x.view(batch, 2,self.grid_size,self.grid_size)
    #perform the cumsum operation to restore the original grid from the differential grid
    x = cumsum_2d(x, self.grid_offset_x, self.grid_offset_y)
    print(f'X gradient after cumsum(): {get_tensor_info(x)}')
    #Upsample the grid_size x grid_size warp field to image_size x image_size warp field
    x = self.upsampler(x)
    print(f'X gradient after upsampling: {get_tensor_info(x)}')
    x = x.permute(0,2,3,1)

    #calculate target estimation
    x = nn.functional.grid_sample(src_batch.unsqueeze(0).permute([1,0,2,3]), x)

    return x
#target_image: n x128x128
#warped_image: n x128x128 (output of the Net model)
def Total_Loss (target_image, warped_image, grid_size=8, Lambda=1e-5):
  batch,W,H = warped_image.shape
  #print(f'Warp_field gradient: {get_tensor_info(warp_field)}')

  L2_Loss_f = nn.MSELoss()
  L2_Loss = 1/2 * L2_Loss_f(target_image, warped_image)
  
  Total_loss = L2_Loss
  Total_loss.retain_grad()
  print(f'Total_loss gradient: {get_tensor_info(Total_loss)}')

  return Total_loss

Thank you for the code.

Could you also please provide some driver code which does the training, and using which you can observe the no-gradient thing?

The issue could be with the way that the training is driven, and without your code which does this part, it is not clear (at least, to me) how to debug this.

Thanks for the response. I made a minimal version of my function that run a single epoch. Please let me know if you need anything else. Thanks!

#aug_batch, src_batch, tar_batch input form: nx128x128
#run the model for a single epoch
#data_loader can either be a single data_loader or a iterable list of data loaders (if so, source loader should also be a list of data loaders)
def run_epoch(model, optimizer,  aug_batch, src_batch, tar_batch, grid_size=8):

      aug_batch = torch.tensor(aug_batch, dtype=torch.float32)
      tar_batch = torch.tensor(tar_batch, dtype=torch.float32)
      src_batch = torch.tensor(src_batch, dtype=torch.float32)

      #run forward propagation
      tar_est = warp(model, src_batch, aug_batch, grid_size)        
      tar_est = tar_est.squeeze(dim=1)
      total_loss = Total_Loss(tar_batch, tar_est, grid_size, 1e-5)

      total_loss.backward()
      optimizer.step()
      optimizer.zero_grad()
      loss_list.append(total_loss)

  return loss_list


def warp(model, src_batch, aug_batch, grid_size):
  input_image = torch.stack([src_batch, aug_batch])
  input_image  = input_image.permute([1,0,2,3])
  #run the network
  target_est = model.forward(input_image, src_batch)

  return target_est```

And where is the code that invokes run_epoch?

If you don’t provide (a possibly simplified version of) complete executable code, I really don’t see the point in starting to debug the rest of your code, because the bug may be hiding in the part of the code that you did not provide.

So, please, provide a complete set of executable code (ideally, minimized to remove extraneous stuff) that exhibits the behaviour that you want to address.

Thanks, here is my code that runs run_epoch. I did not include this part because it takes inputs from my custom made dataloaders (a feature which I excluded from the minimal model), and other than that all it does is loop (but I should have, since I just noticed the optimizer is here). Besides, are you going to make minimal inputs yourself or do you need the images that I used for the model? Really appreciate your help.

def run_model(model,src_batch, tar_batch, grid_size=8,):
  optimizer = optim.Adam(model.parameters(), lr=1e-3)
  epoch_loss = []
  for i in range(10):
    loss_list = run_epoch(model, optimizer, src_batch, tar_batch , grid_size)
    avg_epoch_loss = sum(loss_list) / len(loss_list)
    print(f'Loss in Epoch {i}: {avg_epoch_loss}')

Thank you for the code.

If you could tell the shape of your inputs, we can construct random tensors to match that shape, and that is usually enough to flush out many bugs.

Edited to add: Wait, where is the code that invokes run_model? Perhaps the bug is in the way you construct model?

I use another function to deal with some irregularities arising from dataloaders, but for this I guess this is enough for the minimal model. So run_model will be equivalent to what you usually do in train(). Thanks

#with src_batch and tar_batch in shape 5x128x128
model = Net(8)
run_model(model, src_batch, tar_batch)```

Thank you.

Could you check if you observe the “Grad is None” behaviour if you do the following:

src_batch = torch.randn(5, 128, 128)
tar_batch = torch.randn(5, 128, 128)

model = Net(8)
run_model(model, src_batch, tar_batch)

Yep, still the same behavior. I ran this cell and it will print out the grads are none . Thanks

import torch 
import torchvision
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import skimage
from zipfile import ZipFile
import h5py 
from imgaug import augmenters as iaa
from pathlib import Path
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
from torchvision.utils import make_grid
from torch.utils.data import Dataset, DataLoader, TensorDataset
from torch.utils.data import random_split
from torch.autograd import Variable
from torchsummary import summary

#define the alignnet model
def get_conv(grid_size):
  model = nn.Sequential (
      nn.MaxPool2d (2),
      nn.Conv2d (2, 20, 5),
      nn.ReLU(),
      nn.MaxPool2d (2),
      nn.Conv2d (20, 20, 5),
      nn.ReLU(),
      nn.MaxPool2d (2),
      nn.Conv2d (20, 20, 2),
      nn.ReLU(),
      nn.MaxPool2d (2),
      nn.Conv2d (20, 20, 5),
      nn.ReLU(),
   
  )
  return model
def get_tensor_info(tensor):
  info = []
  for name in ['requires_grad', 'is_leaf', 'retains_grad', 'grad_fn', 'grad']:
    info.append(f'{name}({getattr(tensor, name, None)})')
  return ' '.join(info)
#initialize the differential grid
#the parameter learn offset will define whether or not to learn the offset values during training
def init_grid(grid_size=8):
  #spacing of the grid
  #-1 is because we have a1 = -1 (and thus there are grid_size - 1 "spacing" grids)
  delta = 2/(grid_size-1)
  np_grid = np.arange(grid_size, dtype=float)
  np_grid = np.full_like(np_grid,float(delta))
  ts_grid_x = torch.FloatTensor(np_grid)
  ts_grid_y = torch.FloatTensor(np_grid)
  diff_i_grid_y, diff_i_grid_x = torch.meshgrid(ts_grid_x,ts_grid_y)
  diff_grid = torch.stack([diff_i_grid_x, diff_i_grid_y])
  diff_grid = diff_grid.view(2*grid_size*grid_size)
  return diff_grid

#perform cumsum operation on a 2d batch of inputs
#takes in grid tensors of shape batch x 2 x grid x grid 
#return grid tensors of shape batch x 2 x grid x grid 
def cumsum_2d(grid, grid_offset_x, grid_offset_y):
  batch_size, dim, grid_1, grid_2 = grid.shape
  grid[:,0,:,0] = -1 
  grid[:,1,0,:] = -1 

  Integrated_grid_x = torch.cumsum(grid[:,0], dim = 2) + grid_offset_x
  Integrated_grid_y = torch.cumsum(grid[:,1], dim = 1) + grid_offset_y
  Integrated_grid = torch.stack([Integrated_grid_x, Integrated_grid_y])
  Integrated_grid = Integrated_grid.permute([1,0,2,3])

  return Integrated_grid

class Net(nn.Module):
  def __init__(self, grid_size):
    super().__init__()
    self.conv = get_conv(grid_size)
    self.flatten = nn.Flatten()
    self.linear1 = nn.Sequential(nn.Linear(80,20),nn.ReLU(),)
    self.linear2 = nn.Linear(20, 2*8*8)
    self.upsampler = nn.Upsample(size = [128, 128], mode = 'bilinear')
    self.linear2.bias = nn.Parameter(init_grid(8).view(-1))
    self.linear2.weight.data.fill_(float(0))
    self.grid_offset_x = torch.tensor(float(0), requires_grad=True)
    self.grid_offset_y = torch.tensor(float(0), requires_grad=True)
    self.grid_offset_x = nn.Parameter(self.grid_offset_x)
    self.grid_offset_y = nn.Parameter(self.grid_offset_y)
    self.grid_size = grid_size

#input shape (n could be any numbers):
#x: nx2x128x128
#src_batch: nx128x128
  def forward(self, x, src_batch):
    print(f'X gradient1: {get_tensor_info(x)}')
    x = self.conv(x)
    print(f'X gradient2: {get_tensor_info(x)}')
    x = self.flatten(x)
    print(f'X gradient3: {get_tensor_info(x)}')
    x = self.linear1(x)
    print(f'X gradient4: {get_tensor_info(x)}')
    x = self.linear2(x)
    print(f'X gradient5: {get_tensor_info(x)}')
    #enforce axial monotinicity using the abs operation
    x = torch.abs(x)
    print(f'X gradient after abs(): {get_tensor_info(x)}')
    batch, grid = x.shape
    x = x.view(batch, 2,self.grid_size,self.grid_size)
    #perform the cumsum operation to restore the original grid from the differential grid
    x = cumsum_2d(x, self.grid_offset_x, self.grid_offset_y)
    print(f'X gradient after cumsum(): {get_tensor_info(x)}')
    #Upsample the grid_size x grid_size warp field to image_size x image_size warp field
    x = self.upsampler(x)
    print(f'X gradient after upsampling: {get_tensor_info(x)}')
    x = x.permute(0,2,3,1)

    #calculate target estimation
    x = nn.functional.grid_sample(src_batch.unsqueeze(0).permute([1,0,2,3]), x)

    return x
#target_image: n x128x128
#warped_image: n x128x128 (output of the Net model)
def Total_Loss (target_image, warped_image, grid_size=8, Lambda=1e-5):
  batch,W,H = warped_image.shape
  #print(f'Warp_field gradient: {get_tensor_info(warp_field)}')

  L2_Loss_f = nn.MSELoss()
  L2_Loss = 1/2 * L2_Loss_f(target_image, warped_image)
  
  Total_loss = L2_Loss
  Total_loss.retain_grad()
  print(f'Total_loss gradient: {get_tensor_info(Total_loss)}')

  return Total_loss


#aug_batch, src_batch, tar_batch input form: nx128x128
#run the model for a single epoch
#data_loader can either be a single data_loader or a iterable list of data loaders (if so, source loader should also be a list of data loaders)
def run_epoch(model, optimizer,  aug_batch, src_batch, tar_batch, grid_size=8):

  aug_batch = torch.tensor(aug_batch, dtype=torch.float32)
  tar_batch = torch.tensor(tar_batch, dtype=torch.float32)
  src_batch = torch.tensor(src_batch, dtype=torch.float32)

  #run forward propagation
  tar_est = warp(model, src_batch, aug_batch, grid_size)        
  tar_est = tar_est.squeeze(dim=1)
  total_loss = Total_Loss(tar_batch, tar_est, grid_size, 1e-5)

  total_loss.backward()
  optimizer.step()
  optimizer.zero_grad()




def warp(model, src_batch, aug_batch, grid_size):
  input_image = torch.stack([src_batch, aug_batch])
  input_image  = input_image.permute([1,0,2,3])
  #run the network
  target_est = model.forward(input_image, src_batch)

  return target_est

def run_model(model,src_batch, tar_batch, grid_size=8,):
  optimizer = optim.Adam(model.parameters(), lr=1e-3)
  epoch_loss = []
  for i in range(10):
    run_epoch(model, optimizer, src_batch, tar_batch , grid_size)
   
src_batch = torch.randn(5, 128, 128)
tar_batch = torch.randn(5, 128, 128)

model = Net(8)
run_model(model, src_batch, tar_batch)```

Why do you expect these gradients not to be None? You are checking whether the input to the model has grad values set. Why do you expect these grad values to be set by the training procedure?

Shouldn’t gradients be tracked throughout each all forward-passing functions until it reaches the loss function so it can apply partial derivatives to calculate the weight?

Gradients should be tracked for the parameters of the model such as weights and biases. The training step is supposed to learn these gradients so that it can adjust these parameters in the general direction (which is the negation of the gradient) that will cause the loss to decrease.

The input is not something that the training is supposed to improve or change in any manner, at least for the models that I have come across so far. So why should the training keep track of gradients for the inputs?

Oh, thanks for pointing that out. I guess I had a misunderstanding of the gradients (actually its my very first model :rofl:). But for example if we are passing inputs through something like.
L = nn.Linear(somethingsomething)
x = L(x)
x = x^2
x = log(x)

shouldn’t gradients be calculated first with respect to x=log(x) and x=x^2, then the linear layer, rather than directly getting loss derivatives wrt whatever loss values outputted by the loss function? Thanks!

The best way to find this out is by writing the few lines of code that will check this, and passing in random tensors of the correct shape.

Yeah I checked with the codes below
and none of the grads are set. If so, how can we check that the gradients (or partial derivatives in this case precisely) are calculated for functions outside of the linear layer? (I want to ensure the exact same thing for my above model)

def forward(model,x):
  model(x)
  print(x.grad)
  x = x**2
  print(x.grad)
  x = x/ 10
  print(x.grad)
  return x
model = nn.Sequential(nn.Linear(10,2), nn.Linear(2,1))
x = torch.randn(10, requires_grad=True)
x = forward(model,x)
loss = torch.norm(x)
loss.backward()
x = forward(model, x)```

I am not sure I understand this question.

What is the function in the code above for which you want to compute gradients, and with respect to which variables do you want to compute the gradients of this function?

I think there is a misunderstanding how gradients are calculated.

No, forward activations are stored if they are needed for the gradient calculation, but you need to finish the entire forward pass before calculating the gradients in the backward pass.
In your code snippets you are checking the .grad of activations in the forward method, which won’t work because:

  • backward() wasn’t called yet, so the gradients are not calculated
  • you are accessing the .grad attribute of non-leaf tensors, which are not stored by default as explained in the warning you should see:
UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations.

This code should work using your simple model:

model = nn.Sequential(nn.Linear(10,2), nn.Linear(2,1))

x = torch.randn(1, 10)
x1 = model(x)
x1.retain_grad()
x2 = x1**2
x2.retain_grad()
x3 = x2/ 10
x3.retain_grad()

loss = torch.norm(x3)
loss.backward()

print(x1.grad)
print(x2.grad)
print(x3.grad)

Thanks a lot for the insight @gphilip and @ptrblck . I have one last question. Ultimately, I want to verify that the partial derivatives for my functions after the last linear layer is being passed on correctly. Currently , I am trying to check this feature by verifying the grads over the outputs of each functions. My question is–can I verify that all grads are working properly by checking the grad of the leaf tensor (the weights)? In other words, will the computational tree break down and not yield any grads for the leaf tensor if my intermediate functions are not passing on gradients, or will it simply calculate the gradients directly with respect to the loss function (disregarding the gradients of the intermediate functions)? Thank you very much.