Hello. I’m currently trying to debug a model after observing that the loss remains constant throughout the entire training process. My first intuition was to check if the gradients are flowing or not, and so after doing some research I stumbled on the torch.autograd.gradcheck
function.
I tried to play around with the function to get a feel for how it works with:
torch.autograd.gradcheck(self.criterion, inputs=(input1, input2, input3))
which gave me the error:
*** RuntimeError: Jacobian mismatch for output 0 with respect to input 1,
numerical:tensor([[0.],
[0.],
[0.],
...,
[0.],
[0.],
[0.]])
analytical:tensor([[-0.0730],
[ 0.0224],
[-0.0197],
...,
[ 0.0574],
[-0.0140],
[-0.0738]])
I’m having a bit of trouble interpreting what this error message means. I’m aware from the documentation that the function should output True
if the gradient is flowing. I’ve also looked at similar questions on this community as it seemed like a rather common error to run into but haven’t been able to find anything meaningful.
How should I proceed from here? Any tips are appreciated. Thanks in advance.
Edit
I’ve managed to make the function work by doing:
self.criterion = self.criterion.double()
input1 = input1.double()
input2 = input2.double()
input3 = input3.double()
and passing these into torch.autograd.gradcheck
returns True
. However, I believe it would still be helpful if anyone would be kind enough to shed some light on what the error message means.
Edit 2 (Code)
The code that I’m running looks like the following. The basic framework is that I’m calling a train
function, which in turn calls a train_step
function to conduct training.
class Trainer():
def __init__(self, config, model, data):
"""Simple initialization of attributes for Trainer object."""
self.config = config
self.model = model
self.data = data
self.load_data() # Loads in the data into respective variables.
def train(self):
optimizer = optim.Adam(params=filter(lambda p: p.requires_grad, self.model.parameters()),
lr=self.config.learning_rate,
weight_decay=self.config.weight_decay)
criterion = TripletLoss(margin=self.config.margin)
for epoch in range(self.config.num_epochs):
self.train_step(self.criterion, self.optimizer)
def train_step(self, criterion, optimizer):
self.model.train()
step = 0
total_loss = 0.0
misc_data = [self.drug_info, self.profile_info, self.train_drug_list]
batches = make_batches(self.config, self.data.train_data, misc_data)
for batch in batches:
self.optimizer.zero_grad()
pos_drugs, pos_profiles, pos_labels, neg_drugs, neg_profiles, neg_labels = batch
input_pos_drugs = [to_hgnn_data(self.config, x, 1) for x in pos_drugs]
input_neg_drugs = [to_hgnn_data(self.config, x, 0) for x in neg_drugs]
input_pos_profiles = torch.cuda.FloatTensor(pos_profiles)
input_neg_profiles = torch.cuda.FloatTensor(neg_profiles)
drug_embeddings, profile_embeddings, _ = self.model(input_pos_profiles, input_pos_drugs)
neg_drug_embeddings, profile_embeddings, _ = self.model(input_neg_profiles, input_neg_drugs)
loss = self.criterion(profile_embeddings, drug_embeddings, neg_drug_embeddings)
loss.backward()
total += loss.item()
self.optimizer.step()
step += 1
This is the basically the code that I’m running. I’ve omitted some details that aren’t directly related to this problem. The functions make_batches
and to_hgnn_data
are also defined in a separate utils.py
module.