Gradient value is nan

saumya0303 · August 5, 2020, 2:40am

Hi team,
Please follow the below code,

x.requires_grad = True
loss.backward()
print(x.grad)

output:-
tensor([ 1.0545e-05, 9.5438e-06, -8.3444e-06, …, nan,
nan, nan])
how to resolve this nan problem as i am unable to find the range of x.grad.
Please help me to resolve this issue.

TinfoilHat0 · August 5, 2020, 3:12am

Perhaps this is due to exploding gradients? I’d recommend you to first try gradient clipping and see how the training goes.

saumya0303 · August 5, 2020, 12:27pm

Thanks for the answer. Actually I am trying to perform an adversarial attack where I don’t have to perform any training. The strange thing happening is when I calculate my gradients over an original input I get tensor([0., 0., 0., …, nan, nan, nan]) as result but if I made very small changes to my input the gradients turn out to perfect in the range of tensor(0.0580) and tensor(-0.0501)..
Please help me to figure out this issue?

ptrblck · August 8, 2020, 9:09am

You could add torch.autograd.set_detect_anomaly(True) at the beginning of your script to get an error with a stack trace, which should point to the operation, which created the NaNs and which should help debugging the issue.

szahan · June 9, 2021, 2:50pm

Hey, I am also having nan issues with the gradient. I tried gradient clipping, converted my relu functions to LeakyReLU. But no progress. Any suggestions would be great. Thanks.

ptrblck · June 9, 2021, 6:12pm

Assuming that the forward pass does not create invalid outputs, you could register hooks to the parameters of the model and print the gradients during the backward pass in order to isolate which gradient gets the first invalid value.
This could make it easier to debug the issue further and check the operations used to create this gradient (e.g. are you dividing by a small number?).

Alibay_Alili · September 2, 2022, 5:41am

Hey, any idea why does gradient become nan? I mean which mathematical operation cause it?

ptrblck · September 2, 2022, 7:10am

Invalid outputs can create NaN gradients:

x = torch.randn(1, requires_grad=True)
y = x / 0.
y = y / y
y.backward()
print(x.grad)
# tensor([nan])

Alibay_Alili · September 2, 2022, 1:03pm

Yes, that is true. But my case is different. In my case, y is a valid output. But when I call y.backward(), one of the components of gradient is NaN, it traces back to the input during BP and most parameters become NaN.
I think that, since one of the components of gradient is NaN, so partial deriviative is NaN. So relative change produce NaN. Any idea how it can produce NaN? Maybe my conclusion is wrong

ptrblck · September 2, 2022, 5:38pm

Use torch.autograd.detect_anomaly to check which layer is creating the invalid gradients, then check its operations and inputs.

ptrblck · September 2, 2022, 6:52pm

I’m rereading this part and am unsure how to understand it. Are you already seeing invalid values in y before calling backward? Or do you see the first invalid gradient somewhere later in the model?

Also, once you’ve narrowed down the layer or parameter where the first NaN is created, check if something could overflow (and then create NaNs somehow).

Alibay_Alili · September 3, 2022, 5:58pm

Sorry, I just focused on NaN using torch.isnan(tensor).any(). Actually, I must also take into account torch.isinf(tensor).any(). Before loss is NaN, there is actually float('infinity') :

for images, targets in dataloader['train']:
        images, targets= images.to(device),  targets.to(device)
        outputs = model(images) # some elements is infinity
        loss = cross_entropy(outputs, targets) # loss is NaN
       .........

Simple test:

input_ = torch.tensor([[1, float('infinity'), 6],
                     [3, 5, 2],
                      [10, 12, 4]])
criterion = nn.CrossEntropyLoss()
loss = criterion(input_, torch.tensor([0, 2, 1])) # loss is NaN

ptrblck · September 3, 2022, 6:31pm

If you training diverges and the output overflows, the Inf can easily become a NaN value, so you can also check the tensors to be valid via torch.isfinite.

hellohawaii · January 12, 2024, 3:09am

FYI, torch.autograd.set_detect_anomaly(True) do not work with pytorch_geometric (at least I think so).

I am using pytorch_geometric. I got a message that gcnnorm raises an exception.

def gcn_norm(edge_index, edge_weight=None, num_nodes=None, improved=False,
             add_self_loops=True, flow="source_to_target", dtype=None):

    fill_value = 2. if improved else 1.

    if isinstance(edge_index, SparseTensor):
        assert edge_index.size(0) == edge_index.size(1)

        adj_t = edge_index

        if not adj_t.has_value():
            adj_t = adj_t.fill_value(1., dtype=dtype)
        if add_self_loops:
            adj_t = torch_sparse.fill_diag(adj_t, fill_value)

        deg = torch_sparse.sum(adj_t, dim=1)
        deg_inv_sqrt = deg.pow_(-0.5)
        deg_inv_sqrt.masked_fill_(deg_inv_sqrt == float('inf'), 0.)
        adj_t = torch_sparse.mul(adj_t, deg_inv_sqrt.view(-1, 1))
        adj_t = torch_sparse.mul(adj_t, deg_inv_sqrt.view(1, -1))

        return adj_t

    if is_torch_sparse_tensor(edge_index):
        assert edge_index.size(0) == edge_index.size(1)

        if edge_index.layout == torch.sparse_csc:
            raise NotImplementedError("Sparse CSC matrices are not yet "
                                      "supported in 'gcn_norm'")

        adj_t = edge_index
        if add_self_loops:
            adj_t, _ = add_self_loops_fn(adj_t, None, fill_value, num_nodes)

        edge_index, value = to_edge_index(adj_t)
        col, row = edge_index[0], edge_index[1]

        deg = scatter(value, col, 0, dim_size=num_nodes, reduce='sum')
        deg_inv_sqrt = deg.pow_(-0.5)
        deg_inv_sqrt.masked_fill_(deg_inv_sqrt == float('inf'), 0)
        value = deg_inv_sqrt[row] * value * deg_inv_sqrt[col]

        return set_sparse_value(adj_t, value), None

    assert flow in ['source_to_target', 'target_to_source']
    num_nodes = maybe_num_nodes(edge_index, num_nodes)

    if add_self_loops:
        edge_index, edge_weight = add_remaining_self_loops(
            edge_index, edge_weight, fill_value, num_nodes)

    if edge_weight is None:
        edge_weight = torch.ones((edge_index.size(1), ), dtype=dtype,
                                 device=edge_index.device)

    row, col = edge_index[0], edge_index[1]
    idx = col if flow == 'source_to_target' else row
    deg = scatter(edge_weight, idx, dim=0, dim_size=num_nodes, reduce='sum')
    deg_inv_sqrt = deg.pow_(-0.5)
    deg_inv_sqrt.masked_fill_(deg_inv_sqrt == float('inf'), 0)
    edge_weight = deg_inv_sqrt[row] * edge_weight * deg_inv_sqrt[col]

    return edge_index, edge_weight

I was told that the last 4 line deg_inv_sqrt = deg.pow_(-0.5) cause the nan. However, in gcnnorm implementation, the nan is masked out an do not influence the training. But pytorch still raise an exception.

ptrblck · January 12, 2024, 2:43pm

I don’t see where this would happen as you are masking Inf values, not NaNs. Even if you would mask the NaN, the operation creating the NaN is still used, anomaly detection raises a proper error as expected, and the backward pass will contain invalid gradients as seen here:

x = torch.tensor(-1., requires_grad=True)
deg = x * 2

deg_inv_sqrt = deg.pow_(-0.5)
deg_inv_sqrt = torch.nan_to_num(deg_inv_sqrt, nan=0.)

deg_inv_sqrt.mean().backward()
print(x.grad)
# tensor(nan)