Cuda Invalid configuration argument during a torch zeros

Hello everyone, I have had the following error for a few weeks now:

RuntimeError: CUDA error: invalid configuration argument

My use case is RL in self play.
In input of my agent I have a sequence of size [B, Seq_len, Feat_size] (the size of the batch can change during the training) which will be directly put in a LSTM with batchfirst=True.
It happens systematically after several million step at the initialization of the hidden state and the cell.
Here is the traceback, unfortunately I am obliged to truncate and anonymise certain part due to the confidentiality of my project.

  ****     *****
  File "/home/*****/anaconda3/envs/test_env_soda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/adil.zouitine/project/*****/*.py", line 25, in forward
    _, (h_n, c_n) = self.lstm_1(x)
  File "/home/*****/anaconda3/envs/test_env_soda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/*****/anaconda3/envs/test_env_soda/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 570, in forward
    zeros = torch.zeros(self.num_layers * num_directions,
RuntimeError: CUDA error: invalid configuration argument

During the error I was able to get the problematic tensor:
SIZE OF INPUT torch.Size([8, 10, 18]) (the size is Ok)
Device of input: cuda:0 (Device OK)
its values are between -1 and 1 (Value Ok)
Memory summary:

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |    7459 KB |  777955 KB |   27916 GB |   27916 GB |
|       from large pool |       0 KB |   24750 KB |   11515 GB |   11515 GB |
|       from small pool |    7459 KB |  756142 KB |   16400 GB |   16400 GB |
|---------------------------------------------------------------------------|
| Active memory         |    7459 KB |  777955 KB |   27916 GB |   27916 GB |
|       from large pool |       0 KB |   24750 KB |   11515 GB |   11515 GB |
|       from small pool |    7459 KB |  756142 KB |   16400 GB |   16400 GB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |     780 MB |     780 MB |     780 MB |       0 B  |
|       from large pool |      40 MB |      40 MB |      40 MB |       0 B  |
|       from small pool |     740 MB |     740 MB |     740 MB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |   45788 KB |  713821 KB |   33551 GB |   33551 GB |
|       from large pool |       0 KB |   29630 KB |   15796 GB |   15796 GB |
|       from small pool |   45788 KB |  713821 KB |   17755 GB |   17755 GB |
|---------------------------------------------------------------------------|
| Allocations           |     428    |   20610    |  530960 K  |  530960 K  |
|       from large pool |       0    |       7    |    2815 K  |    2815 K  |
|       from small pool |     428    |   20610    |  528144 K  |  528144 K  |
|---------------------------------------------------------------------------|
| Active allocs         |     428    |   20610    |  530960 K  |  530960 K  |
|       from large pool |       0    |       7    |    2815 K  |    2815 K  |
|       from small pool |     428    |   20610    |  528144 K  |  528144 K  |
|---------------------------------------------------------------------------|
| GPU reserved segments |     372    |     372    |     372    |       0    |
|       from large pool |       2    |       2    |       2    |       0    |
|       from small pool |     370    |     370    |     370    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      55    |    1454    |  280235 K  |  280235 K  |
|       from large pool |       0    |       3    |    1615 K  |    1615 K  |
|       from small pool |      55    |    1454    |  278620 K  |  278620 K  |
|===========================================================================|

I have Nvidia 3090 24gb.

Do you have any idea why this kind of error appears?
I didn’t have any problems when I replaced the LSTM with a linear (I flatten the input).

This error is usually raised by an invalid kernel launch.
Could you post the output of python -m torch.utils.collect_env and update to the latest PyTorch release in case you are using an older one?
If you are already using the latest release (or nightly), could you post an executable code snippet to reproduce this error, please?