Hello. I’m currently trying to debug a model after observing that the loss remains constant throughout the entire training process. My first intuition was to check if the gradients are flowing or not, and so after doing some research I stumbled on the
I tried to play around with the function to get a feel for how it works with:
torch.autograd.gradcheck(self.criterion, inputs=(input1, input2, input3))
which gave me the error:
*** RuntimeError: Jacobian mismatch for output 0 with respect to input 1, numerical:tensor([[0.], [0.], [0.], ..., [0.], [0.], [0.]]) analytical:tensor([[-0.0730], [ 0.0224], [-0.0197], ..., [ 0.0574], [-0.0140], [-0.0738]])
I’m having a bit of trouble interpreting what this error message means. I’m aware from the documentation that the function should output
True if the gradient is flowing. I’ve also looked at similar questions on this community as it seemed like a rather common error to run into but haven’t been able to find anything meaningful.
How should I proceed from here? Any tips are appreciated. Thanks in advance.
I’ve managed to make the function work by doing:
self.criterion = self.criterion.double() input1 = input1.double() input2 = input2.double() input3 = input3.double()
and passing these into
True. However, I believe it would still be helpful if anyone would be kind enough to shed some light on what the error message means.
Edit 2 (Code)
The code that I’m running looks like the following. The basic framework is that I’m calling a
train function, which in turn calls a
train_step function to conduct training.
class Trainer(): def __init__(self, config, model, data): """Simple initialization of attributes for Trainer object.""" self.config = config self.model = model self.data = data self.load_data() # Loads in the data into respective variables. def train(self): optimizer = optim.Adam(params=filter(lambda p: p.requires_grad, self.model.parameters()), lr=self.config.learning_rate, weight_decay=self.config.weight_decay) criterion = TripletLoss(margin=self.config.margin) for epoch in range(self.config.num_epochs): self.train_step(self.criterion, self.optimizer) def train_step(self, criterion, optimizer): self.model.train() step = 0 total_loss = 0.0 misc_data = [self.drug_info, self.profile_info, self.train_drug_list] batches = make_batches(self.config, self.data.train_data, misc_data) for batch in batches: self.optimizer.zero_grad() pos_drugs, pos_profiles, pos_labels, neg_drugs, neg_profiles, neg_labels = batch input_pos_drugs = [to_hgnn_data(self.config, x, 1) for x in pos_drugs] input_neg_drugs = [to_hgnn_data(self.config, x, 0) for x in neg_drugs] input_pos_profiles = torch.cuda.FloatTensor(pos_profiles) input_neg_profiles = torch.cuda.FloatTensor(neg_profiles) drug_embeddings, profile_embeddings, _ = self.model(input_pos_profiles, input_pos_drugs) neg_drug_embeddings, profile_embeddings, _ = self.model(input_neg_profiles, input_neg_drugs) loss = self.criterion(profile_embeddings, drug_embeddings, neg_drug_embeddings) loss.backward() total += loss.item() self.optimizer.step() step += 1
This is the basically the code that I’m running. I’ve omitted some details that aren’t directly related to this problem. The functions
to_hgnn_data are also defined in a separate