Hey,
I have a confusing problem with PyTorch.
I do something with my code which is probably not even advisable - so I will just do a workaround to avoid this problem. Nonetheless, it is puzzling and I wanted to ask if somebody has experienced it as well or knows a solution.
In my experiments, I call cuda.is_available() everytime when I create a new local model, which is superfluous as I realized by now, but sometimes after a bigger amount of training episodes under certain seeds this leads to a “RecursionError: maximum recursion depth exceeded”
The stacktrace always ends as follows
File “/home/user/../.venv/lib/python3.12/site-packages/torch/cuda/_
init
_.py”, line 165, in is_available
if _nvml_based_avail():
^^^^^^^^^^^^^^^^^^^
File “/home/user/../.venv/lib/python3.12/site-packages/torch/cuda/_
init
_.py”, line 158, in _nvml_based_avail
return os.getenv(“PYTORCH_NVML_BASED_CUDA_CHECK”) == “1”
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “”, line 812, in getenv File "
<frozen _collections_abc>", line 807, in get
File “”, line 711, in _
getitem
_
RecursionError: maximum recursion depth exceeded
As far as I understand this strack trace, this means that there is a considerate recursion in the os.getenv function calling get and getitem .. This might mean that something wraps getenv in my process and does something weird with it.
Now the puzzling part: My project does not do this. It barely even uses any function from the os module and I’m sure it does not wrap any os functions.
I’m using numpy 2.3.2, pytorch 2.71, tensortdict 0.9.1 (but barely), torchvision 0.22.0 (only for transforms very recently and the bug is older), gymnasium , tqdm, minigrid, vizdoom, scipy, pandas, colorcet 3.1.0.