Conda env blocks GPU memory?

Hello,

I have a problem with (mini)conda, pytorch and the A6000 GPU (cuda 11).

On my vm server, I have once installed pytorch on base and pytorch in a conda environment.

In base, everything works as it should.
In the conda environment, the GPU memory is already over 42 GB. Even after torch.cuda.empty_cache() the memory is used. Or if I kill all processes that are shown via nvidia-smi, the memory is still almost full.

The error message when running a model is: “RuntimeError: CUDA error: out of memory”.

I have no idea where the error/bug is and how to fix it?

Check for dead processes which might allocate GPU memory. nvidia-smi is not always able to display all processes using GPU memory depending on the permission it has on your system.

I will try this later. Thanks for the tip.

Hello,

I have now trained many constellations and all but one work. And this is not because the one constellation needs the most memory.

My goal is to measure the memory consumption during training.
Before the training the cache is cleared.

torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()

The current peak value is saved.
start_peak_gpu = torch.cuda.max_memory_allocated()
With base then start_peak_gpu has the value 0. With my conda env the value ~ 460864000
After the training follows:

end_peak_gpu = torch.cuda.max_memory_allocated()
diff_peak_gpu = end_peak_gpu - start_peak_gpu

Then I have the GPU peak during the training.

I could not find zombie processes.

$ sudo fuser -v /dev/nvidia*
                     USER PID ACCESS COMMAND
/dev/nvidia0: root 852 F.... nvidia-persiste
                     root 1180 F...m Xorg
/dev/nvidiactl: root 852 F.... nvidia-persiste
                     root 1180 F...m Xorg
/dev/nvidia-modeset: root 852 F.... nvidia-persiste
                     root 1180 F.... Xorg

Would you check other for zombie processes or what else could be a possible error?

I don’t fully understand the issue. Are you seeing 42GB of allocated memory after just activating your conda environment via conda activate env_name?
If so, then I don’t believe it’s a PyTorch issue, since you haven’t even executed anything and I haven’t seen this issue before.
If that’s not the case, could you explain in more detail what exactly you are executing and how you are measuring the memory usage?

Yes, exactly.
When I enable the conda environment I lose 42GB of GPU memory.
I don’t know where the cause is. I just read that there were problems with cuda and conda in the past, because conda overwrote binaries.

When I start my model the following is executed at the beginning:

torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
start_peak_gpu = torch.cuda.max_memory_allocated()

I expect start_peak_gpu to have the value 0 and the GPU memory to be cleared. However, it has the value 460864000.

In my base environment start_peak_gpu has the value 0.

There is nothing else running on the server. And nvidia-smi does not show any processes.

Could you explain what exactly you are executing? Based on your description is seems you are running some code beforehand before checking the memory.

I am running a customized version of the wl-coref model.
In the run.py I have extended the if args.mode== “train” with the known GPU information.

if args.mode == "train":
    print("\n##### Start model training - " + start_time + " (UTC) #####", flush=True)
    start_peak_gpu = ""
    if "cuda" in model.config.device:
        torch.cuda.empty_cache()
        torch.cuda.reset_peak_memory_stats()
        start_peak_gpu = torch.cuda.max_memory_allocated()
        print("Start Peak GPU value: " + str(start_peak_gpu), flush=True)     
    if args.weights is not None or args.warm_start:
        model.load_weights(path=args.weights, map_location="cpu",
                           noexception=args.warm_start)
    with output_running_time():
        model.train()
        if "cuda" in model.config.device:
            end_peak_gpu = torch.cuda.max_memory_allocated()
            diff_peak_gpu = end_peak_gpu - start_peak_gpu
            print("End Peak GPU value: " + str(end_peak_gpu), flush=True)
            print("\nPeak memory use: " + str(round(diff_peak_gpu / 1024 ** 3, 3)) + " GB", flush=True)

Edit:
And load bert/electra model at first