Facing nan values in the learning process with anomaly_detectionection in pytorch

code segment is the following:

use_Mixed=True
    with torch.autograd.detect_anomaly():
        graph=graph.to(device)
        input_feature_dim = feature.size(1) 
        #print(input_feature_dim)
        net = gnn.GCN(input_feature_dim, args.dim, args.category)
        net.to(device)
weight_decay=5e-4)
        optimizer = torch.optim.Adam(itertools.chain(net.parameters()), lr=0.003, eps=.001, weight_decay=1e-2)
  
        scaler = GradScaler(growth_interval=20, growth_factor=100, backoff_factor=0.5)
        print(scaler.state_dict())

        nan_epoch=0
        grads=[]

        net.train()

        for epoch in range(200):

            with autocast(enabled = use_fp16):
                logits = net(graph, feature)
                #logp = F.log_softmax(logits, 1)
                #loss = F.nll_loss(logp[train_id], train_y_label)

                logp = F.softmax(logits, 1)
                loss = F.cross_entropy(logp[train_id], train_y_label)+ 1e-7

            if use_Mixed:
                print(loss.dtype)
                #print("logits: ", logits)
                print(" logp ", logp)
                scaler.scale(loss).backward()
                scaler.step(optimizer)
                scaler.update()

            else:
                loss.backward()
                optimizer.step()
            # gradient collect
            for p in net.parameters():
                print(p.grad.view(-1))
                
    
            optimizer.zero_grad()

and I get that the loss becomes nan and with the anomaly detection, the following is shown:

UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
  with torch.autograd.detect_anomaly():
/home/datalab/.local/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: Error detected in LogSoftmaxBackward0. Traceback of forward call that caused the error:
  File "GCN_dgl.py", line 141, in <module>
    loss = F.cross_entropy(logp[train_id], train_y_label)+ 1e-7
  File "/home/datalab/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 3014, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
 (Triggered internally at  ../torch/csrc/autograd/python_anomaly_mode.cpp:102.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "GCN_dgl.py", line 147, in <module>
    scaler.scale(loss).backward()
  File "/home/datalab/.local/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/datalab/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.

So, what kind of combination of loss function can I use, or change the Gradscaler or do some clipping?
the Gradscaler is bit changed, but later found that the problem is in loss.
Thank you.

F.cross_entropy expects raw logits as its input while you are passing probabilities to this loss function. Also, logp is a misleading name since F.log_softmax would create log probabilities.
Note that the GradScaler can create invalid gradients if the scaling factor is too high. Once this is detected, the scaling factor will be decreased and next iterations should not create overflowing gradients anymore. Using anomaly detection in amp from the beginning of the training might thus not be a good idea.

How can I make adjustments?
if I used the following I get same nan in loss and then in gradients.

logits = net(graph, feature)
logp = F.log_softmax(logits, 1)
loss = F.nll_loss(logp[train_id], train_y_label)

Is the loss is already NaN, check if the logits are also invalid. If so, narrow down which operation causes the invalid outputs inside the forward method of your model.

Logits are valid, but the loss is invalid.
Some prints: Train loss is the scaled loss

Epoch 163 | Train_Loss: 5692348.000000 | loss 3.643103
torch.float32
logits:  tensor([[ 0.3500, -0.0361, -0.1291,  ...,  0.1045, -0.0760,  0.0441],
        [ 0.2364, -0.0687, -0.0700,  ...,  0.0250, -0.0508, -0.0074],
        [ 0.4655, -0.0218, -0.1786,  ...,  0.1280, -0.1019,  0.0736],
        ...,
        [ 0.8182, -0.3446,  0.8706,  ..., -0.6142, -0.1828,  0.1664],
        [ 0.3388, -0.0269, -0.1201,  ...,  0.0948, -0.0739,  0.0373],
        [ 0.3903, -0.0325, -0.1476,  ...,  0.1213, -0.0867,  0.0581]],
       device='cuda:0', grad_fn=<AddBackward0>)
 logp  tensor([[0.0343, 0.0233, 0.0212,  ..., 0.0268, 0.0224, 0.0252],
        [0.0310, 0.0228, 0.0228,  ..., 0.0251, 0.0232, 0.0243],
        [0.0381, 0.0234, 0.0200,  ..., 0.0272, 0.0216, 0.0257],
        ...,
        [0.0006, 0.0002, 0.0007,  ..., 0.0002, 0.0002, 0.0003],
        [0.0340, 0.0236, 0.0215,  ..., 0.0266, 0.0225, 0.0251],
        [0.0355, 0.0233, 0.0207,  ..., 0.0271, 0.0220, 0.0255]],
       device='cuda:0', grad_fn=<SoftmaxBackward0>)
time per epoch 0:00:00.103240
Epoch 164 | Train_Loss: 5692348.000000 | loss 3.643103
torch.float32
logits:  tensor([[ 0.3586, -0.0380, -0.1317,  ...,  0.1071, -0.0782,  0.0441],
        [ 0.2399, -0.0689, -0.0712,  ...,  0.0268, -0.0518, -0.0072],
        [ 0.4774, -0.0244, -0.1818,  ...,  0.1304, -0.1045,  0.0734],
        ...,
        [ 0.8164, -0.3412,  0.8585,  ..., -0.6064, -0.1850,  0.1634],
        [ 0.3458, -0.0285, -0.1219,  ...,  0.0964, -0.0756,  0.0370],
        [ 0.4001, -0.0347, -0.1504,  ...,  0.1240, -0.0890,  0.0580]],
       device='cuda:0', grad_fn=<AddBackward0>)
 logp  tensor([[0.0345, 0.0232, 0.0211,  ..., 0.0268, 0.0223, 0.0252],
        [0.0311, 0.0228, 0.0228,  ..., 0.0251, 0.0232, 0.0243],
        [0.0384, 0.0233, 0.0199,  ..., 0.0272, 0.0215, 0.0257],
        ...,
        [0.0006, 0.0002, 0.0007,  ..., 0.0002, 0.0002, 0.0003],
        [0.0342, 0.0235, 0.0214,  ..., 0.0266, 0.0224, 0.0251],
        [0.0358, 0.0232, 0.0206,  ..., 0.0272, 0.0219, 0.0254]],
       device='cuda:0', grad_fn=<SoftmaxBackward0>)
time per epoch 0:00:00.102102
Epoch 165 | Train_Loss: 5690994.500000 | loss 3.642236
torch.float32
logits:  tensor([[ 0.3679, -0.0400, -0.1345,  ...,  0.1098, -0.0805,  0.0441],
        [ 0.2436, -0.0692, -0.0726,  ...,  0.0286, -0.0528, -0.0071],
        [ 0.4897, -0.0272, -0.1853,  ...,  0.1328, -0.1074,  0.0732],
        ...,
        [ 0.8156, -0.3383,  0.8464,  ..., -0.5990, -0.1871,  0.1604],
        [ 0.3530, -0.0303, -0.1240,  ...,  0.0980, -0.0774,  0.0366],
        [ 0.4107, -0.0371, -0.1536,  ...,  0.1268, -0.0917,  0.0580]],
       device='cuda:0', grad_fn=<AddBackward0>)
 logp  tensor([[0.0348, 0.0231, 0.0211,  ..., 0.0269, 0.0222, 0.0252],
        [0.0312, 0.0228, 0.0227,  ..., 0.0251, 0.0232, 0.0243],
        [0.0388, 0.0232, 0.0198,  ..., 0.0272, 0.0214, 0.0256],
        ...,
        [0.0006, 0.0002, 0.0007,  ..., 0.0002, 0.0002, 0.0003],
        [0.0344, 0.0235, 0.0214,  ..., 0.0267, 0.0224, 0.0251],
        [0.0361, 0.0231, 0.0205,  ..., 0.0272, 0.0218, 0.0254]],
       device='cuda:0', grad_fn=<SoftmaxBackward0>)
time per epoch 0:00:00.102901
Epoch 166 | Train_Loss: 5689459.000000 | loss 3.641254
torch.float32
logits:  tensor([[ 0.3773, -0.0422, -0.1376,  ...,  0.1124, -0.0831,  0.0440],
        [ 0.2477, -0.0696, -0.0741,  ...,  0.0304, -0.0540, -0.0070],
        [ 0.5024, -0.0302, -0.1891,  ...,  0.1351, -0.1105,  0.0729],
        ...,
        [ 0.8138, -0.3359,  0.8343,  ..., -0.5911, -0.1891,  0.1574],
        [ 0.3606, -0.0321, -0.1262,  ...,  0.0996, -0.0793,  0.0361],
        [ 0.4215, -0.0397, -0.1572,  ...,  0.1297, -0.0946,  0.0578]],
       device='cuda:0', grad_fn=<AddBackward0>)
 logp  tensor([[0.0351, 0.0231, 0.0210,  ..., 0.0269, 0.0221, 0.0251],
        [0.0313, 0.0228, 0.0227,  ..., 0.0252, 0.0232, 0.0243],
        [0.0392, 0.0230, 0.0197,  ..., 0.0272, 0.0213, 0.0255],
        ...,
        [0.0006, 0.0002, 0.0007,  ..., 0.0002, 0.0002, 0.0003],
        [0.0347, 0.0234, 0.0213,  ..., 0.0267, 0.0223, 0.0251],
        [0.0364, 0.0230, 0.0204,  ..., 0.0272, 0.0217, 0.0253]],
       device='cuda:0', grad_fn=<SoftmaxBackward0>)
time per epoch 0:00:00.103105
Epoch 167 | Train_Loss: 5687721.000000 | loss 3.640141
torch.float32
logits:  tensor([[ 0.3874, -0.0446, -0.1411,  ...,  0.1151, -0.0859,  0.0439],
        [ 0.2521, -0.0701, -0.0758,  ...,  0.0322, -0.0553, -0.0070],
        [ 0.5159, -0.0336, -0.1933,  ...,  0.1375, -0.1139,  0.0724],
        ...,
        [ 0.8125, -0.3326,  0.8230,  ..., -0.5838, -0.1909,  0.1543],
        [ 0.3684, -0.0342, -0.1287,  ...,  0.1012, -0.0814,  0.0357],
        [ 0.4331, -0.0426, -0.1611,  ...,  0.1325, -0.0977,  0.0575]],
       device='cuda:0', grad_fn=<AddBackward0>)
 logp  tensor([[0.0354, 0.0230, 0.0209,  ..., 0.0270, 0.0221, 0.0251],
        [0.0314, 0.0228, 0.0227,  ..., 0.0252, 0.0231, 0.0243],
        [0.0397, 0.0229, 0.0195,  ..., 0.0272, 0.0211, 0.0255],
        ...,
        [0.0006, 0.0002, 0.0007,  ..., 0.0002, 0.0002, 0.0003],
        [0.0349, 0.0233, 0.0212,  ..., 0.0267, 0.0223, 0.0250],
        [0.0368, 0.0229, 0.0203,  ..., 0.0272, 0.0216, 0.0253]],
       device='cuda:0', grad_fn=<SoftmaxBackward0>)
time per epoch 0:00:00.102965
Epoch 168 | Train_Loss: 5685748.000000 | loss 3.638879
torch.float32
logits:  tensor([[ 0.3982, -0.0473, -0.1450,  ...,  0.1179, -0.0890,  0.0437],
        [ 0.2567, -0.0708, -0.0777,  ...,  0.0340, -0.0568, -0.0069],
        [ 0.5303, -0.0372, -0.1981,  ...,  0.1399, -0.1177,  0.0719],
        ...,
        [ 0.8111, -0.3305,  0.8112,  ..., -0.5764, -0.1931,  0.1508],
        [ 0.3768, -0.0365, -0.1314,  ...,  0.1027, -0.0838,  0.0351],
        [ 0.4454, -0.0458, -0.1654,  ...,  0.1354, -0.1012,  0.0572]],
       device='cuda:0', grad_fn=<AddBackward0>)
 logp  tensor([[0.0357, 0.0229, 0.0208,  ..., 0.0270, 0.0219, 0.0251],
        [0.0316, 0.0228, 0.0226,  ..., 0.0253, 0.0231, 0.0243],
        [0.0401, 0.0227, 0.0194,  ..., 0.0272, 0.0210, 0.0254],
        ...,
        [0.0006, 0.0002, 0.0006,  ..., 0.0002, 0.0002, 0.0003],
        [0.0352, 0.0233, 0.0212,  ..., 0.0268, 0.0222, 0.0250],
        [0.0371, 0.0227, 0.0202,  ..., 0.0272, 0.0215, 0.0252]],
       device='cuda:0', grad_fn=<SoftmaxBackward0>)
time per epoch 0:00:00.101241
Epoch 169 | Train_Loss: nan | loss nan
torch.float32
logits:  tensor([[ 0.3982, -0.0473, -0.1450,  ...,  0.1179, -0.0890,  0.0437],
        [ 0.2567, -0.0708, -0.0777,  ...,  0.0340, -0.0568, -0.0069],
        [ 0.5303, -0.0372, -0.1981,  ...,  0.1399, -0.1177,  0.0719],
        ...,
        [ 0.8111, -0.3305,  0.8112,  ..., -0.5764, -0.1931,  0.1508],
        [ 0.3768, -0.0365, -0.1314,  ...,  0.1027, -0.0838,  0.0351],
        [ 0.4454, -0.0458, -0.1654,  ...,  0.1354, -0.1012,  0.0572]],
       device='cuda:0', grad_fn=<AddBackward0>)
 logp  tensor([[0.0357, 0.0229, 0.0208,  ..., 0.0270, 0.0219, 0.0251],
        [0.0316, 0.0228, 0.0226,  ..., 0.0253, 0.0231, 0.0243],
        [0.0401, 0.0227, 0.0194,  ..., 0.0272, 0.0210, 0.0254],
        ...,
        [0.0006, 0.0002, 0.0006,  ..., 0.0002, 0.0002, 0.0003],
        [0.0352, 0.0233, 0.0212,  ..., 0.0268, 0.0222, 0.0250],
        [0.0371, 0.0227, 0.0202,  ..., 0.0272, 0.0215, 0.0252]],
       device='cuda:0', grad_fn=<SoftmaxBackward0>)
time per epoch 0:00:00.102409
Epoch 170 | Train_Loss: nan | loss nan
torch.float32
logits:  tensor([[ 0.3982, -0.0473, -0.1450,  ...,  0.1179, -0.0890,  0.0437],
        [ 0.2567, -0.0708, -0.0777,  ...,  0.0340, -0.0568, -0.0069],
        [ 0.5303, -0.0372, -0.1981,  ...,  0.1399, -0.1177,  0.0719],
        ...,
        [ 0.8111, -0.3305,  0.8112,  ..., -0.5764, -0.1931,  0.1508],
        [ 0.3768, -0.0365, -0.1314,  ...,  0.1027, -0.0838,  0.0351],
        [ 0.4454, -0.0458, -0.1654,  ...,  0.1354, -0.1012,  0.0572]],
       device='cuda:0', grad_fn=<AddBackward0>)
 logp  tensor([[0.0357, 0.0229, 0.0208,  ..., 0.0270, 0.0219, 0.0251],
        [0.0316, 0.0228, 0.0226,  ..., 0.0253, 0.0231, 0.0243],
        [0.0401, 0.0227, 0.0194,  ..., 0.0272, 0.0210, 0.0254],
        ...,
        [0.0006, 0.0002, 0.0006,  ..., 0.0002, 0.0002, 0.0003],
        [0.0352, 0.0233, 0.0212,  ..., 0.0268, 0.0222, 0.0250],
        [0.0371, 0.0227, 0.0202,  ..., 0.0272, 0.0215, 0.0252]],
       device='cuda:0', grad_fn=<SoftmaxBackward0>)
time per epoch 0:00:00.101980

You are only printing a subset of the logits, so it’s not a proper verification.
Use e.g. torch.isfinite(logits).all() as a check.

you are right, this is what I get:

Epoch 84 | Train_Loss: 14587713.000000 | loss 3.646928
torch.isfinite returns:  tensor(True, device='cuda:0')
 logp  tensor([[2.5320e-02, 2.0336e-02, 2.1865e-02,  ..., 3.0545e-02, 2.7066e-02,
         2.4068e-02],
        [1.8954e-02, 2.2023e-02, 1.9955e-02,  ..., 3.2436e-02, 3.0565e-02,
         2.3787e-02],
        [3.5442e-02, 2.0563e-02, 2.3970e-02,  ..., 2.8363e-02, 2.3343e-02,
         2.2812e-02],
        ...,
        [1.0605e-04, 2.5010e-05, 8.7593e-05,  ..., 5.0157e-04, 1.6975e-03,
         2.3800e-04],
        [3.2884e-02, 2.0917e-02, 2.5077e-02,  ..., 2.7369e-02, 2.4755e-02,
         2.2684e-02],
        [2.7777e-02, 1.9709e-02, 2.2268e-02,  ..., 3.0124e-02, 2.6083e-02,
         2.3988e-02]], device='cuda:0', grad_fn=<SoftmaxBackward0>)
Epoch 85 | Train_Loss: 14587713.000000 | loss 3.646928
torch.isfinite returns:  tensor(False, device='cuda:0')
 logp  tensor([[2.5388e-02, 2.0297e-02, 2.1842e-02,  ..., 3.0585e-02, 2.7019e-02,
         2.4039e-02],
        [1.8953e-02, 2.1983e-02, 1.9940e-02,  ..., 3.2431e-02, 3.0504e-02,
         2.3766e-02],
        [3.5676e-02, 2.0516e-02, 2.3937e-02,  ..., 2.8402e-02, 2.3301e-02,
         2.2794e-02],
        ...,
        [1.2087e-04, 2.9295e-05, 9.7856e-05,  ..., 5.4582e-04, 1.8230e-03,
         2.6464e-04],
        [3.3005e-02, 2.0916e-02, 2.5033e-02,  ..., 2.7390e-02, 2.4720e-02,
         2.2674e-02],
        [2.7883e-02, 1.9672e-02, 2.2246e-02,  ..., 3.0162e-02, 2.6029e-02,
         2.3956e-02]], device='cuda:0', grad_fn=<SoftmaxBackward0>)
Epoch 86 | Train_Loss: nan | loss nan
torch.isfinite returns:  tensor(False, device='cuda:0')
 logp  tensor([[2.5388e-02, 2.0297e-02, 2.1842e-02,  ..., 3.0585e-02, 2.7019e-02,
         2.4039e-02],
        [1.8953e-02, 2.1983e-02, 1.9940e-02,  ..., 3.2431e-02, 3.0504e-02,
         2.3766e-02],
        [3.5676e-02, 2.0516e-02, 2.3937e-02,  ..., 2.8402e-02, 2.3301e-02,
         2.2794e-02],
        ...,
        [1.2087e-04, 2.9295e-05, 9.7856e-05,  ..., 5.4582e-04, 1.8230e-03,
         2.6464e-04],
        [3.3005e-02, 2.0916e-02, 2.5033e-02,  ..., 2.7390e-02, 2.4720e-02,
         2.2674e-02],
        [2.7883e-02, 1.9672e-02, 2.2246e-02,  ..., 3.0162e-02, 2.6029e-02,
         2.3956e-02]], device='cuda:0', grad_fn=<SoftmaxBackward0>)

Thanks for confirming. In that case check where these invalid values are created in the forward method of your model.

Is it definite that the problem lies in the Model. Because I did get good results for other datasets(smaller ones). But for this case I got the problem.

And the problem must lie in the model too, as logits is returned from the model. Is there any specific technique too debug the model?

You could use the same workflow and add print statements to check if the intermediate tensors contain valid values to isolate the operation/layer.

1 Like