RecursionError: When calling cuda.is_available() later on in an experiment

Hey,

I have a confusing problem with PyTorch.
I do something with my code which is probably not even advisable - so I will just do a workaround to avoid this problem. Nonetheless, it is puzzling and I wanted to ask if somebody has experienced it as well or knows a solution.

In my experiments, I call cuda.is_available() everytime when I create a new local model, which is superfluous as I realized by now, but sometimes after a bigger amount of training episodes under certain seeds this leads to a “RecursionError: maximum recursion depth exceeded”

The stacktrace always ends as follows

File “/home/user/../.venv/lib/python3.12/site-packages/torch/cuda/_init_.py”, line 165, in is_available
if _nvml_based_avail():
^^^^^^^^^^^^^^^^^^^

File “/home/user/../.venv/lib/python3.12/site-packages/torch/cuda/_init_.py”, line 158, in _nvml_based_avail
return os.getenv(“PYTORCH_NVML_BASED_CUDA_CHECK”) == “1”

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “”, line 812, in getenv File "
<frozen _collections_abc>", line 807, in get
File “”, line 711, in _getitem_
RecursionError: maximum recursion depth exceeded

As far as I understand this strack trace, this means that there is a considerate recursion in the os.getenv function calling get and getitem .. This might mean that something wraps getenv in my process and does something weird with it.

Now the puzzling part: My project does not do this. It barely even uses any function from the os module and I’m sure it does not wrap any os functions.

I’m using numpy 2.3.2, pytorch 2.71, tensortdict 0.9.1 (but barely), torchvision 0.22.0 (only for transforms very recently and the bug is older), gymnasium , tqdm, minigrid, vizdoom, scipy, pandas, colorcet 3.1.0.

Thanks for your response :slight_smile:. Yeah, I think I will just follow this recommendation and call torch.cuda.is_available() at the beginning of an experiment and be done with it..