The code snippet works fine in 1.1 but causes an error in 1.2:
p = F.softmax(x, dim=1)
m = y != self.ignore_index
t = F.one_hot((y * m.byte()).long(), num_classes=self.num_classes).byte().permute(0,3,1,2)
i = (p * (t * m.unsqueeze(1).byte()).float()).sum((0,2,3))
u = ((p + t.float()) * m.unsqueeze(1).float()).sum((0,2,3)) - i
v = u.nonzero()
return -((i[v] / u[v]).mean()).log()
I get the error message below:
RuntimeError: range.second - range.first == t.size() INTERNAL ASSERT FAILED at /pytorch/torch/csrc/autograd/generated/Functions.cpp:55, please report a bug to PyTorch. inconsistent range for TensorList output
I had to remove the usage of nonzero() like below to make the code work:
p = F.softmax(x, dim=1)
m = y != self.ignore_index
t = F.one_hot((y * m.byte()).long(), num_classes=self.num_classes).byte().permute(0,3,1,2)
i = (p * (t * m.unsqueeze(1).byte()).float()).sum((0,2,3))
u = ((p + t.float()) * m.unsqueeze(1).float()).sum((0,2,3)) - i
return -((i / u).mean()).log()
I have two questions:
(1) Why does the runtime error happen in 1.2 but not in 1.1?
(2) By not using nonzero to prevent division by zero I have numeric instability in theory. However, in really the number that comes out from a softmax operation should not reach zero because the input can’t really reach negative infinity right?
about the first question, could you please provide x and y to let me reproduce the issue?
It would be great too if you show the specs of your environment by running pythoncollect_env.py
And about the second question, in practice I think you are right, you should not get such a number in your network, so softmax would never output zero.
The code snippet comes from a loss function for a semantic segmentation network. I modified it a little so it can reproduce the same runtime error alone:
import torch
import torch.nn as nn
import torch.nn.functional as F
def loss_func(p, y, ignore_index=255, num_classes=19):
p = F.softmax(x, dim=1)
m = y != ignore_index
t = F.one_hot((y * m.byte()).long(), num_classes=num_classes).byte().permute(0,3,1,2)
i = (p * (t * m.unsqueeze(1).byte()).float()).sum((0,2,3))
u = ((p + t.float()) * m.unsqueeze(1).float()).sum((0,2,3)) - i
v = u.nonzero()
return -((i[v] / u[v]).mean()).log()
x = torch.rand(1,19,1024,2048, requires_grad=True).to(‘cuda’).float()
y = torch.randint(0, 18, (1,1024,2048)).to(‘cuda’).byte()
y[0][0][0] = 255
loss = loss_func(x, y)
loss.backward()
Below is the output from collect_env.py:
Collecting environment information…
PyTorch version: 1.2.0
Is debug build: No
CUDA used to build PyTorch: 10.0.130
Actually, If you do not use backward, and just try to use forward feed, nonzero() won’t throw any errors. The issue is related to backward semantics. Currently, I cannot really figure out what is happening. Could you create and issue in pytorch github page?
By the way, you can use code format in forum by putting ``` at the start and the end of your code section.