Hi everyone,
I’m facing a strange issue with GPU allocation in PyTorch.
Environment:
- PyTorch version:
2.1.2+cu118
- CUDA Version: 11.8
- NVIDIA Driver: (recently updated to latest version)
- 2 GPUs available on the cluster
Issue:
In my training script, I explicitly specify which GPU to use by setting the environment variable:
CUDA_VISIBLE_DEVICES=1
This should restrict PyTorch to only GPU1. However:
- When I run
lsof /dev/nvidia0
, I can see that my process is also holding 730MB memory and showing 3% utilization on GPU0. nvidia-smi
andgpustat
do not show any active process on GPU0.- If I reverse and set
CUDA_VISIBLE_DEVICES=0
, I observe the same issue — now GPU1 shows minor utilization according tolsof
, even though I specified to use only GPU0.
I have also updated my NVIDIA drivers and tested this behavior again — it still persists.
Summary:
CUDA_VISIBLE_DEVICES |
lsof shows usage on | nvidia-smi / gpustat shows usage on |
---|---|---|
1 | GPU0 (minor usage) | Only GPU1 |
0 | GPU1 (minor usage) | Only GPU0 |
Question:
Why is PyTorch touching the other GPU even when I explicitly set CUDA_VISIBLE_DEVICES
?
Is this just a harmless driver-level initialization, or can it affect performance in multi-user environments?
Is there a way to prevent any interaction with the unselected GPU?
Any help or explanation would be greatly appreciated!