Gradient backpropagation for a custom tensor

mamunm · January 16, 2022, 12:42am

I have an NN architecture for molecular property prediction. The feature vectors for each atom go through its own MLP (based on atomic identity, i.e., atomic number). Then they are summed together to get the molecular energy. I use ModuleDict to keep track of those MLP. For molecular energies, I initialize an empty tensor with requires_grad = True, and then I populate the tensor going through each atom. My question is will it cause problems during backpropagation as pytorch doesn’t allow in-place operation (I use .data to bypass the problem but am not sure about the caveat of such bypassing). Below is a code snippet of the part that I am confused about,

atomic_energies = torch.zeros((self.batch_size, max_n_atoms),
            device=self.device, dtype=torch.float, requires_grad=True)
        
 for i in range(self.batch_size):
       for j in range(n_atoms):
            atomic_energies.data[i][j] = self.atomic_mlp_dict[species[i][j].item()](feature_matrix[i][j])

How should I verify that the gradients are being propagated correctly? Currently, the loss doesn’t change significantly for 10-20 epochs. How to debug this kind of problem in Pytorch? Any help is highly appreciated.

anantguptadbl · January 16, 2022, 12:09pm

@mamunm

You can monitor the grads by applying hooks on the tensor. These hooks are triggered during the backward propagation

If the grad values are not changing then it can mean that the loss is not connected to the parameter that you are monitoring, or that the model is already at the maximum level of convergence

import numpy as np
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

class random_model(nn.Module):
    def __init__(self, num_layers):
        super(random_model, self).__init__()
        self.layer1 = nn.Sequential(nn.Linear(100, 20), nn.BatchNorm1d(20), nn.ReLU())
        self.layer2 = nn.Linear(20, 1)
    
    def forward(self, x):
        x = self.layer1(x)
        x = self.layer2(x)
        return x
    
model = random_model(10)
loss = torch.nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

grad_hook = model.layer2.weight.register_hook(lambda grad: print("Grad is {0}".format(grad)))

X = torch.rand(100, 100)
y = torch.rand(100)

for cur_epoch in range(100):
    optimizer.zero_grad()
    output = model(X)
    cur_loss = loss(output, y)
    cur_loss.backward()
    optimizer.step()
    print("Epoch {0} Loss is {1}".format(cur_epoch, cur_loss.item()))

googlebot · January 16, 2022, 6:09pm

such use of .data doesn’t support gradient flow, normally you have to use torch.stack (or cat) for such merging

mamunm · January 16, 2022, 6:17pm

@anantguptadbl Thanks for the suggestion. I tried it and you’re right the weight is not changing. I need to modify the code using @googlebot’s suggestion.