I have a problem with (mini)conda, pytorch and the A6000 GPU (cuda 11).
On my vm server, I have once installed pytorch on base and pytorch in a conda environment.
In base, everything works as it should.
In the conda environment, the GPU memory is already over 42 GB. Even after torch.cuda.empty_cache() the memory is used. Or if I kill all processes that are shown via nvidia-smi, the memory is still almost full.
The error message when running a model is: “RuntimeError: CUDA error: out of memory”.
I have no idea where the error/bug is and how to fix it?
Check for dead processes which might allocate GPU memory. nvidia-smi is not always able to display all processes using GPU memory depending on the permission it has on your system.
The current peak value is saved. start_peak_gpu = torch.cuda.max_memory_allocated()
With base then start_peak_gpu has the value 0. With my conda env the value ~ 460864000
After the training follows:
I don’t fully understand the issue. Are you seeing 42GB of allocated memory after just activating your conda environment via conda activate env_name?
If so, then I don’t believe it’s a PyTorch issue, since you haven’t even executed anything and I haven’t seen this issue before.
If that’s not the case, could you explain in more detail what exactly you are executing and how you are measuring the memory usage?
Yes, exactly.
When I enable the conda environment I lose 42GB of GPU memory.
I don’t know where the cause is. I just read that there were problems with cuda and conda in the past, because conda overwrote binaries.
When I start my model the following is executed at the beginning: