Torch.nonzero() leads to crash in PyTorch 1.2

Beinan_Wang · September 11, 2019, 6:51pm

The code snippet works fine in 1.1 but causes an error in 1.2:

p = F.softmax(x, dim=1)
m = y != self.ignore_index
t = F.one_hot((y * m.byte()).long(), num_classes=self.num_classes).byte().permute(0,3,1,2)
i = (p * (t * m.unsqueeze(1).byte()).float()).sum((0,2,3))
u = ((p + t.float()) * m.unsqueeze(1).float()).sum((0,2,3)) - i
v = u.nonzero()
return -((i[v] / u[v]).mean()).log()

I get the error message below:

RuntimeError: range.second - range.first == t.size() INTERNAL ASSERT FAILED at /pytorch/torch/csrc/autograd/generated/Functions.cpp:55, please report a bug to PyTorch. inconsistent range for TensorList output

I had to remove the usage of nonzero() like below to make the code work:

p = F.softmax(x, dim=1)
m = y != self.ignore_index
t = F.one_hot((y * m.byte()).long(), num_classes=self.num_classes).byte().permute(0,3,1,2)
i = (p * (t * m.unsqueeze(1).byte()).float()).sum((0,2,3))
u = ((p + t.float()) * m.unsqueeze(1).float()).sum((0,2,3)) - i
return -((i / u).mean()).log()

I have two questions:
(1) Why does the runtime error happen in 1.2 but not in 1.1?
(2) By not using nonzero to prevent division by zero I have numeric instability in theory. However, in really the number that comes out from a softmax operation should not reach zero because the input can’t really reach negative infinity right?

Nikronic · September 11, 2019, 8:19pm

Hi,

about the first question, could you please provide x and y to let me reproduce the issue?
It would be great too if you show the specs of your environment by running
python collect_env.py

And about the second question, in practice I think you are right, you should not get such a number in your network, so softmax would never output zero.

Good luck
Nik

Beinan_Wang · September 11, 2019, 9:03pm

The code snippet comes from a loss function for a semantic segmentation network. I modified it a little so it can reproduce the same runtime error alone:

import torch
import torch.nn as nn
import torch.nn.functional as F

def loss_func(p, y, ignore_index=255, num_classes=19):
     p = F.softmax(x, dim=1)
     m = y != ignore_index
     t = F.one_hot((y * m.byte()).long(), num_classes=num_classes).byte().permute(0,3,1,2)
     i = (p * (t * m.unsqueeze(1).byte()).float()).sum((0,2,3))
     u = ((p + t.float()) * m.unsqueeze(1).float()).sum((0,2,3)) - i
     v = u.nonzero()
     return -((i[v] / u[v]).mean()).log()

x = torch.rand(1,19,1024,2048, requires_grad=True).to(‘cuda’).float()
y = torch.randint(0, 18, (1,1024,2048)).to(‘cuda’).byte()
y[0][0][0] = 255

loss = loss_func(x, y)
loss.backward()

Below is the output from collect_env.py:

Collecting environment information…
PyTorch version: 1.2.0
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 18.04.2 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.13.1

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: GPU 0: GeForce GTX 1080 Ti
Nvidia driver version: 415.27
cuDNN version: /usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudnn.so.7.5.0

Versions of relevant libraries:
[pip3] efficientnet-pytorch==0.4.0
[pip3] numpy==1.16.4
[pip3] torch==1.2.0
[pip3] torchfile==0.1.0
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.4.0
[conda] Could not collect

Nikronic · September 12, 2019, 11:50am

Actually, If you do not use backward, and just try to use forward feed, nonzero() won’t throw any errors. The issue is related to backward semantics. Currently, I cannot really figure out what is happening. Could you create and issue in pytorch github page?

By the way, you can use code format in forum by putting ``` at the start and the end of your code section.

Beinan_Wang · September 12, 2019, 2:48pm

Thanks, I created an issue.