RuntimeError: numel: integer multiplication overflow

I use this line to get the index of first 0 value in the rows of a tensor:
length = torch.LongTensor([(x[i,:,0] == 0).nonzero()[0] for i in range(x.shape[0])])
and for the following tensor:

torch.Size([40, 382, 26])
tensor([[[ 1.2496e+00, -2.5842e-03,  1.7675e-03,  ...,  4.5889e-01,
          -7.1389e-01,  1.6415e+00],
         [ 1.2491e+00, -1.3931e-04,  1.8480e-03,  ..., -2.6708e-01,
          -2.3991e-01, -3.1352e-01],
         [ 1.2478e+00, -3.3568e-03, -3.4667e-03,  ..., -2.5959e-01,
          -8.3522e-01,  1.6146e+00],
         ...,
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00]],

        [[ 1.2491e+00,  6.6564e-06, -1.0897e-04,  ...,  4.2065e-01,
          -8.5722e-01,  1.4956e+00],
         [ 1.2487e+00,  3.3545e-03,  1.0616e-03,  ..., -1.7322e-01,
          -5.1711e-01,  7.0258e-01],
         [ 1.2473e+00,  2.1691e-03, -3.5784e-03,  ..., -3.0112e-01,
          -9.7947e-01,  1.5638e+00],
         ...,
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00]],

        [[ 1.2494e+00,  1.0986e-03, -1.0312e-03,  ...,  3.7550e-01,
          -3.3405e-02,  1.0006e+00],
         [ 1.2489e+00,  4.7714e-03,  2.5151e-03,  ..., -6.2233e-01,
          -1.9066e-01,  5.2548e-01],
         [ 1.2476e+00,  3.6464e-03,  1.2658e-04,  ..., -6.5587e-01,
          -1.0196e+00,  3.9814e-01],
         ...,
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00]],

        ...,

        [[ 1.2471e+00, -3.2874e-03,  2.5414e-03,  ...,  2.1677e-01,
          -4.8340e-01,  1.4457e-02],
         [ 1.2466e+00, -6.1289e-03, -3.8176e-03,  ...,  8.6825e-01,
          -9.4528e-01,  1.5469e+00],
         [ 1.2452e+00, -8.7639e-03, -1.0271e-02,  ...,  1.7150e-01,
          -8.5414e-02, -4.8455e-01],
         ...,
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00]],

        [[ 1.2480e+00,  2.3856e-04, -4.2868e-03,  ..., -1.7246e-01,
          -9.1360e-01, -4.5913e-01],
         [ 1.2476e+00, -3.6790e-03, -1.0436e-02,  ..., -6.8468e-02,
           1.4334e-01,  8.2367e-01],
         [ 1.2463e+00, -7.7002e-03, -1.6612e-02,  ..., -2.4162e-01,
          -3.2239e-01, -1.1522e-01],
         ...,
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00]],

        [[ 1.2536e+00,  2.7262e-03, -3.5257e-04,  ..., -3.2121e-01,
          -1.6265e-01, -4.9548e-01],
         [ 1.2532e+00, -2.0610e-03, -6.2969e-03,  ...,  2.4315e-01,
           1.0951e-02,  1.4688e+00],
         [ 1.2520e+00, -6.5647e-03, -1.2356e-02,  ..., -2.0678e-01,
           8.6351e-02, -5.9951e-01],
         ...,
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00]]], device='cuda:0', grad_fn=<CatBackward0>)

I got this error:

--> 373         length =  torch.LongTensor([(x[i,:,0] == 0).nonzero()[0] for i in range(x.shape[0])])


RuntimeError: numel: integer multiplication overflow

I also observed that when I add print(x) inside the forward module I get this error just for priniting while outside I don’t get the error:
RuntimeError: CUDA error: device-side assert triggered
The result of another experiment with my model is that I moved the above line which was causing the error outside of the forward module. The error disappered. Then I added sequence length as input for the forward module, but instead I got this error:

    380             x, length, batch_first=True, enforce_sorted=False
    381         )
--> 382         out_packed, (_, _) = self.rnn(packed, (h0, c0))
    383         y, _ = nn.utils.rnn.pad_packed_sequence(out_packed, batch_first=True)
    384         y = self.dropout(y)

/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1192         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194             return forward_call(*input, **kwargs)
   1195         # Do not call functions when jit is used
   1196         full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.8/dist-packages/torch/nn/modules/rnn.py in forward(self, input, hx)
    775                               self.dropout, self.training, self.bidirectional, self.batch_first)
    776         else:
--> 777             result = _VF.lstm(input, batch_sizes, hx, self._flat_weights, self.bias,
    778                               self.num_layers, self.dropout, self.training, self.bidirectional)
    779         output = result[0]

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

`
The batch size is equal to 5 right now.
What is the reason for this error?

Based on the random error reporting you might be running into a sticky CUDA assert.
Could you rerun the code on the CPU or with CUDA_LAUNCH_BLOCKING=1 to see if the stacktrace improves?
If not, could you post a minimal and executable code snippet to reproduce the issue, please?