PyTorch using both GPUs even when after setting explictly

Hi everyone,

I’m facing a strange issue with GPU allocation in PyTorch.

Environment:

  • PyTorch version: 2.1.2+cu118
  • CUDA Version: 11.8
  • NVIDIA Driver: (recently updated to latest version)
  • 2 GPUs available on the cluster

Issue:
In my training script, I explicitly specify which GPU to use by setting the environment variable:

CUDA_VISIBLE_DEVICES=1

This should restrict PyTorch to only GPU1. However:

  • When I run lsof /dev/nvidia0, I can see that my process is also holding 730MB memory and showing 3% utilization on GPU0.
  • nvidia-smi and gpustat do not show any active process on GPU0.
  • If I reverse and set CUDA_VISIBLE_DEVICES=0, I observe the same issue — now GPU1 shows minor utilization according to lsof, even though I specified to use only GPU0.

I have also updated my NVIDIA drivers and tested this behavior again — it still persists.

Summary:

CUDA_VISIBLE_DEVICES lsof shows usage on nvidia-smi / gpustat shows usage on
1 GPU0 (minor usage) Only GPU1
0 GPU1 (minor usage) Only GPU0

Question:
Why is PyTorch touching the other GPU even when I explicitly set CUDA_VISIBLE_DEVICES?
Is this just a harmless driver-level initialization, or can it affect performance in multi-user environments?
Is there a way to prevent any interaction with the unselected GPU?

Any help or explanation would be greatly appreciated!

I’m not able to reproduce this but I am using cuda 12.4

Maybe one potential hypothesis is that you are setting CUDA_VISIBLE_DEVICES after cuda has already been initialized? Cuda is lazily initialized, but if there is some usage of cuda APIs (e.g. torch.cuda.set_device()) in the libraries you are using, then that will create a context of the device.

A simple reproduction script would help in tracking down the issue.

I am training the official nnUNet without using DDP. Hence I am not sure how it is happening.