CustomLoss Function: Outputting Inf as a loss in one iteration and NaNs in subsequent iterations of training

Rio1210 · September 4, 2019, 2:48am

Hello all, I am quite new to Pytorch.
I am trying out my custom loss function in a simple single layer Neural network (x = 2, h_1 = 2, output =2). However, during a particular epoch’s iteration my loss function is first outputting “Inf” and "NaN"s in subsequent iterations. My Loss Function has a log, but it basically calculates distance and uses those as arguments for the log parameter. There might be subtle numerical instability. An Inf arises when there is a zero distance between columns. I can fix that by adding a little epsilon inside the log argument (which I didn’t).

But, what I found the issue at is that, the batch X at one iteration when goes through the loss function, outputs “inf”. But the same matrix if I just copy and pass it through the loss function as a simple torch.tensor(X), then it outputs a value, not an “inf”. I concluded that it must be because of the torch.tensor(X, grad_fn = < permute_backwards> ) the backward gradient might be the reason. I do not know how that works, but without the permute backward generated inside training, the matrix alone doesn’t generate Inf.

I would be glad if someone could take a look. I’ve linked the .ipynb here.
iPy NoteBook Link

ptrblck · September 4, 2019, 11:28pm

You might have skipped the problematic batch somehow?
If I add torch.autograd.set_detect_anomaly(True) in cell12:

xArray = []
iter = 0
torch.autograd.set_detect_anomaly(True)

the code will raise an exception, pointing to a detected invalid value. If I run the next cells afterwards manually, I’ll get an Inf for customLoss(X.T).

Rio1210 · September 5, 2019, 1:33am

Thanks for taking the time to go through the notebook. I really appreciate it!
I am still a beginner to Pytorch, could you please elaborate a bit on what you mean, specially regarding the.torch.autograd.set_detect_anomaly(True) ?

Regarding getting the Inf at last cell, please select and copy the numerical values of the tensor from cell-17 output, and paste it inside the cell-21, assigning it to the variable X_new = torch.tensor( paste here ). Afterwards, if I use the Loss function on this copy-pasted variable: X_new in cell 22, I get a valid number as output. But this same, matrix was giving Loss = inf in cell-18 and inside the main training code-block.

If it again gives Inf in the last cell, changing the two random.seed()s at the start only 1 or 2 times will surely generate this scenario.

After copying and pasting this matrix to the cell21, passing this numbers doesn’t generate Inf from the loss function, as it generated in Cell-18.

What effect is grad_fn having on these numbers?

Thanks again!