Masking out padding for LSTM

I have sequences which I padded to a fixed length (365 days) by inserting zeros at the missing time steps (so the padding is contained at varying time steps within the sequences). I then feed the sequences into an LSTM network in order to classify them.
I created a mask which contains True if the value is 0 (padding) and False if not, s.t. the model does not take into account the zeros (I checked the masks and there are indeed values that are False, meaning no padding and the timestep should be taken into account by the model).
However, for some of my sequences, the output of the backbone (before applying the linear layer for classification) results in only nan values.
Does someone know why? Am I doing anything wrong?
Is it correct to apply the masking before feeding the values through the lstm layer? I also tried to apply the masking afterwards. In that case, out contains float values but once the masking is applied, all values are -inf.

def forward(self, x, device):
        mask = x[:, :, 0].eq(0).unsqueeze(-1)
        mask = mask.to(device) # [batch_size, seq_len, 1] = [16, 365, 1]

        # masking out padded time steps
        x = x.masked_fill(mask.bool(), -np.inf) # now only some values are -inf as expected

        x = x.float() # [batch_size, seq_len, channels] = [16, 365, 3]

        # cell states
        h0 = (
            torch.zeros(layer_dim,  x.size(0), hidden_dim)
            .requires_grad_()
            .to(device)
        )
        # Initialize cell state
        c0 = (
            torch.zeros(layer_dim,  x.size(0), hidden_dim)
            .requires_grad_()
            .to(device)
        )
        # [batch_size, seq_len, hidden_dim] = [16, 365, 150]
        out, _ = self._lstm_layer(x, (h0.detach(), c0.detach()))  # now all values are nan