Gradient backpropagation for a custom tensor

I have an NN architecture for molecular property prediction. The feature vectors for each atom go through its own MLP (based on atomic identity, i.e., atomic number). Then they are summed together to get the molecular energy. I use ModuleDict to keep track of those MLP. For molecular energies, I initialize an empty tensor with requires_grad = True, and then I populate the tensor going through each atom. My question is will it cause problems during backpropagation as pytorch doesn’t allow in-place operation (I use .data to bypass the problem but am not sure about the caveat of such bypassing). Below is a code snippet of the part that I am confused about,

atomic_energies = torch.zeros((self.batch_size, max_n_atoms),
            device=self.device, dtype=torch.float, requires_grad=True)
        
 for i in range(self.batch_size):
       for j in range(n_atoms):
            atomic_energies.data[i][j] = self.atomic_mlp_dict[species[i][j].item()](feature_matrix[i][j])

How should I verify that the gradients are being propagated correctly? Currently, the loss doesn’t change significantly for 10-20 epochs. How to debug this kind of problem in Pytorch? Any help is highly appreciated.

@mamunm

You can monitor the grads by applying hooks on the tensor. These hooks are triggered during the backward propagation

If the grad values are not changing then it can mean that the loss is not connected to the parameter that you are monitoring, or that the model is already at the maximum level of convergence

import numpy as np
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

class random_model(nn.Module):
    def __init__(self, num_layers):
        super(random_model, self).__init__()
        self.layer1 = nn.Sequential(nn.Linear(100, 20), nn.BatchNorm1d(20), nn.ReLU())
        self.layer2 = nn.Linear(20, 1)
    
    def forward(self, x):
        x = self.layer1(x)
        x = self.layer2(x)
        return x
    
model = random_model(10)
loss = torch.nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

grad_hook = model.layer2.weight.register_hook(lambda grad: print("Grad is {0}".format(grad)))

X = torch.rand(100, 100)
y = torch.rand(100)

for cur_epoch in range(100):
    optimizer.zero_grad()
    output = model(X)
    cur_loss = loss(output, y)
    cur_loss.backward()
    optimizer.step()
    print("Epoch {0} Loss is {1}".format(cur_epoch, cur_loss.item()))
1 Like

such use of .data doesn’t support gradient flow, normally you have to use torch.stack (or cat) for such merging

1 Like

@anantguptadbl Thanks for the suggestion. I tried it and you’re right the weight is not changing. I need to modify the code using @googlebot’s suggestion.