Categorical distribution returning breaking

Mkruger · November 7, 2022, 8:40am

Hey, I am not too sure what is going wrong with my code, however, I am using a categorical distribution and getting a fairly weird error and am uncertain as to why. I did look around for a while online but nothing I found seemed to explain what exactly this issue was and how to get around it.

The error I am getting is as follows:

> ValueError: Expected parameter logits (Tensor of shape (1024, 6)) of distribution Categorical(logits: torch.Size([1024, 6])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
> tensor([[nan, nan, nan, nan, nan, nan],
>         [nan, nan, nan, nan, nan, nan],
>         [nan, nan, nan, nan, nan, nan],
>         ...,
>         [nan, nan, nan, nan, nan, nan],
>         [nan, nan, nan, nan, nan, nan],
>         [nan, nan, nan, nan, nan, nan]], device='cuda:0',
>        grad_fn=<SubBackward0>)

The code I am using to generate this is just a simple feed forward network.

class actor(nn.Module):
    def __init__(self, input_size, n_actions):
        super(actor, self).__init__()

        self.base = layer_init(nn.Linear(input_size, 512))
        self.actor = layer_init(nn.Linear(512, n_actions), std=0.01)

    def forward(self, x):
        x = x.clone()
        x = self.base(x)
        x = torch.tanh(x)
        x = self.actor(x)
        return x

The output of this just gets put into a categorical dist: probs = Categorical(logits=logits)
This seems to be where the error is occurring. The code does not break on the first run, it takes a couple hundred thousand steps before it breaks.

If anyone knows what the problem is and how to fix it, would appreciate it immensely.

ptrblck · November 7, 2022, 11:19pm

Based on the error message it seems the actor is creating NaN outputs after a few iterations of training. Are you seeing an increase in the value range of its output during training, which could then overflow after a while?

Mkruger · November 8, 2022, 5:45am

Hi, yeah when I was attempting to debug it yesterday I noticed that the augmentation I had in my loss function started to give extremely large values. I did change this to clip the loss function if it goes out of a certain range and it seems like the error was fixed. So, I think you are correct and overflow is causing the issue.

Thank you for the assistance.

zuoj · May 15, 2023, 9:21pm

Hi, I have encountered a similar issue when training a PPO agent for discrete actions. Can you give a hint on how to set the clipped loss range as you said here? Thank you!

Mkruger · May 16, 2023, 8:51am

Hey yeah not an issue.

The values were going way out and into the range of a couple hundreds which slowly exploded to thousands and so forth… Basically, as the probability of an action got really small it caused the logs to get out of control. So, I just clipped on 2 and 20 but in particular on the absolute value, however in this case given that the value should all be negative if it is a log of a probability that you are dealing with then you can probably just clip on the neg. However, clip on both just for robustness sake I suppose.

zuoj · May 30, 2023, 8:17am

Thank you very much for the explanation!

OkabeSan · August 5, 2024, 1:11pm

I have the same issue and still need to figure out how to solve it. The main question is that when I run my model on my laptop, there is no error, and it works fine, but when I run this on an HPC server, I face the error. Most vector arrays are not big, but one of them gets a nan value. Do you know what the issue can be here?

gaina54 · August 5, 2024, 5:50pm

thanks for sharing the information..