Anomaly detection

JIALI_MA · December 1, 2020, 4:33pm

I meet with Nan loss issue in my training, so now I’m trying to use anomaly detection in autograd for debugging. I found 2 classes, torch.autograd.detect_anomaly and torch.autograd.set_detect_anomaly. But I’m getting different results with them. Method1 gives no feedback and the training can be conducted successfully, but method 2 always throw runtime error at the same number of iterations and return the nan gradient information. Why is this happening? Am I using the anomaly detection in a correct way?

Method 1:
for i in range(epoch):
    for batch in data_batches:
        with torch.autograd.detect_anomaly():
            output= model(batch)
            loss = calc_loss(output,label)
            loss.backlward()
            optimizer.step()
    validate_performance()    
    save_model()

Method 2:
torch.autograd.set_detect_anomaly(True)
for i in range(epoch):
    for batch in data_batches:
        output= model(batch)
        loss = calc_loss(output,label)
        loss.backlward()
        optimizer.step()
    validate_performance()    
    save_model()

albanD · December 1, 2020, 4:45pm

Hi,

The two should behave the same as long as the issue happens within the context manager. If it comes from validation/saving, then method 1 might miss it.
Also your training is most likely not deterministic so if the nan appear due to the optimization behavior, they might appear at different time of the training.

The second one should allow you to find and fix the issue no?
Also a good sanity check is to monitor the loss and make sure it does not diverge (as nan can be caused by a loss that diverged).

JIALI_MA · December 1, 2020, 4:59pm

Thank you for the quick response.
1.The loss is quite normal, it’s 40 at first, and keep decreasing. But somehow the loss becomes Nan suddenly during training.
2. What do you mean by “method 1 may miss validation/saving”? I thought anomaly detection is to monitor gradients and find out the forward operation that created the failing backward. Since there is no backward in validation and model saving, how could it affect the anomaly detection?
3.I fixed the seed of random so I think the experiment is deterministic, which explains the repeated error at same place.
4.Method 2 implies that “ExpandBackward” returned nan values in its 0th output. This relates to the following line:
logits = F.linear(F.normalize(emb), F.normalize(weight)).
I guess the normalization computation process has some issue. How to interpret the error more specifically?

JIALI_MA · December 1, 2020, 5:09pm

Besides that, I also tried to use hook to monitor the gradients.
Since method 2 suggest the issue is with:
logits = F.linear(F.normalize(emb), F.normalize(weight))
I defined a tensor hook to check is gradients of emb and weight contain Nan. The strange thing is that, when I turn on the hook, anomaly detection gives no error. But when I turn off the hook, anomaly detection throws runtime error at the same iterations.
Is there any conflict between hooks and anomaly detection?

albanD · December 1, 2020, 5:13pm

It does only check what happens during backward. But I don’t know what your functions are doing so I just mentioned it
It is more complex than that. Changing anything in the code will change the behavior. And the anomaly mode that adds a lot of checks might be changing the ordering of some ops (that that it will always, but it is a possibility).
The way I would debug that is split all of these into different lines to see which one is the actual faulty one. Then you can print the input/outputs and see if anything looks off. For normalization, you can get into issues when all values converge to the same thing and thus the variance becomes 0 for example.

Is there any conflict between hooks and anomaly detection?

No there shouldn’t be. But again, since these introduce more ops, you might end up with a slightly different execution that does not exhibit the issue.

JIALI_MA · December 1, 2020, 5:29pm

OK I see. Thanks for the detailed explanations.
By saying “split all of these into different lines”, do you mean this?

norm_emb = F.normalize(emb)
norm_w = F.normalize(weight)
logits = F.linear(norm_emb, norm_w)

I checked the values for original and normalized emb/weights, and didn’t find any problem. But according to hook, the gradients of weight are nan (not all nan, only some rows), after that the training still goes on successfully for several iterations, and then gradients of weight happen repeatedly, leading to Nan loss.
The feedback given by hook and anomaly detection are different, gradient of weights is nan VS normalization of weights has issue. I don’t know if they refer to the same problem or which one should I believe.

albanD · December 1, 2020, 5:35pm

do you mean this?

Yes exactly.

I don’t know if they refer to the same problem or which one should I believe.

These two things won’t tell you the exact cause but only point you to the right direction.
You will have to do the last step yourself by checking what are the values of your Tensors to figure out when they actually become nan. But as I mentioned, optimization leading to same values could be the reason here.

JIALI_MA · December 2, 2020, 3:04am

Thanks a lot, that’s very helpful!