Autograd's gradcheck outputs RuntimeError regarding Jacobian mismatch

seankala · January 14, 2020, 7:15am

Hello. I’m currently trying to debug a model after observing that the loss remains constant throughout the entire training process. My first intuition was to check if the gradients are flowing or not, and so after doing some research I stumbled on the torch.autograd.gradcheck function.

I tried to play around with the function to get a feel for how it works with:

torch.autograd.gradcheck(self.criterion, inputs=(input1, input2, input3))

which gave me the error:

*** RuntimeError: Jacobian mismatch for output 0 with respect to input 1,
numerical:tensor([[0.],
        [0.],
        [0.],
        ...,
        [0.],
        [0.],
        [0.]])
analytical:tensor([[-0.0730],
        [ 0.0224],
        [-0.0197],
        ...,
        [ 0.0574],
        [-0.0140],
        [-0.0738]])

I’m having a bit of trouble interpreting what this error message means. I’m aware from the documentation that the function should output True if the gradient is flowing. I’ve also looked at similar questions on this community as it seemed like a rather common error to run into but haven’t been able to find anything meaningful.

How should I proceed from here? Any tips are appreciated. Thanks in advance.

Edit

I’ve managed to make the function work by doing:

self.criterion = self.criterion.double()
input1 = input1.double()
input2 = input2.double()
input3 = input3.double()

and passing these into torch.autograd.gradcheck returns True. However, I believe it would still be helpful if anyone would be kind enough to shed some light on what the error message means.

Edit 2 (Code)

The code that I’m running looks like the following. The basic framework is that I’m calling a train function, which in turn calls a train_step function to conduct training.

class Trainer():
    def __init__(self, config, model, data):
        """Simple initialization of attributes for Trainer object."""
        self.config = config
        self.model = model
        self.data = data

        self.load_data() # Loads in the data into respective variables.

    def train(self):
        optimizer = optim.Adam(params=filter(lambda p: p.requires_grad, self.model.parameters()),
                               lr=self.config.learning_rate,
                               weight_decay=self.config.weight_decay)
        criterion = TripletLoss(margin=self.config.margin)

        for epoch in range(self.config.num_epochs):
            self.train_step(self.criterion, self.optimizer)

    def train_step(self, criterion, optimizer):
        self.model.train()

        step = 0
        total_loss = 0.0
        misc_data = [self.drug_info, self.profile_info, self.train_drug_list]
        batches = make_batches(self.config, self.data.train_data, misc_data)

        for batch in batches:
            self.optimizer.zero_grad()

            pos_drugs, pos_profiles, pos_labels, neg_drugs, neg_profiles, neg_labels = batch

            input_pos_drugs = [to_hgnn_data(self.config, x, 1) for x in pos_drugs]
            input_neg_drugs = [to_hgnn_data(self.config, x, 0) for x in neg_drugs]

            input_pos_profiles = torch.cuda.FloatTensor(pos_profiles)
            input_neg_profiles = torch.cuda.FloatTensor(neg_profiles)

            drug_embeddings, profile_embeddings, _ = self.model(input_pos_profiles, input_pos_drugs)
            neg_drug_embeddings, profile_embeddings, _ = self.model(input_neg_profiles, input_neg_drugs)

            loss = self.criterion(profile_embeddings, drug_embeddings, neg_drug_embeddings)

            loss.backward()
            total += loss.item()
            self.optimizer.step()
            step += 1

This is the basically the code that I’m running. I’ve omitted some details that aren’t directly related to this problem. The functions make_batches and to_hgnn_data are also defined in a separate utils.py module.

ptrblck · January 14, 2020, 7:21am

If your loss stays constant for the whole training, this might point to a detached tensor during the forward pass.
Could you check the .grad_fn of the model output as well as look for detach() calls or .data usage?
If possible, could you post the code so that we could have a look?

seankala · January 14, 2020, 7:27am

Hi, thanks for the reply. The model outputs a total of three values, and I’m only using two right now:

output1, output2, _ = self.model(input_profiles_pos, input_drugs_pos)
neg_output1, output2, _ = self.model(input_profiles_neg, input_drugs_neg)

loss = self.criterion(output2, output1, neg_output1)

Just for some random background information, this is a bioinformatics project and profiles and drugs refer to genes and chemical compounds. self.criterion is triplet loss.

Each of the outputs’ .grad_fn returns:

>>> output1.grad_fn
<CopyBackwards object at 0x7f2caeee7278>
>>> neg_output1.grad_fn
<CopyBackwards object at 0x7f2caeee7278>

The code that I’m using is fairly long, but I’ll edit it into the original question.

I’m also not using any .detach or .data calls in the code.

seankala · January 14, 2020, 7:46am

I’m not sure if this is relevant or not, but I also checked the .grad and noticed that:

>>> type(output1.grad)
<class 'NoneType'>
>>> type(neg_output1.grad)
<class 'NoneType'>

Perhaps this is related to the problem?

ptrblck · January 14, 2020, 8:06am

Skimming through the code I cannot find any obvious errors, so we would need your model definition to debug it.

No, the .grad attribute won’t be kept for non-leaf tensors by default.
If you want to inspect this gradient, you would have to call output1.retain_grad() before calling backward.