In pytorch DistributedDataParallel 1 process using 2 gpu's (i.e. Processes GPU from nvida-smi command) though using torch.cuda.set_device(local_rank)

I am using the following code snippet to ensure each task/process(local_rank) generated by slurm (using #SBATCH -n 4) is assigned to a specific local gpu:

device = torch.device("cuda",local_rank)

Still, when I look at nvidia-smi on the “Process GPU” column I see two different GPU’s (say 0, and 1), assigned to only a single PID. If I tie a slurm task to a GPU using torch.cuda.set_device(device) (where device here is the local_rank) I should not have multiple GPU allocations right?

Not sure if this is important but I am using two workers on my DataLoader (see definition below), but then that should allocate the process to same GPU (say 0) right?

    trainloader =, batch_size=128, num_workers=2,sampler =,num_replicas=world_size,rank=world_rank))


Slurm config:

#SBATCH --nodes 2
#SBATCH -n 4
#SBATCH --gres:gpus=4


Four tasks to be launched across two nodes. And each task to have one GPU allocated (because inside program launched by each task/process I have set torch.cuda.set_device(device))


PID 6173 for example is allocated to two gpus (0 and 1). The 4th task moved to the other node; so please ignore the same.

GPU PID Type Process name Usage
0 6172 C …bin/python 2657MiB
0 6173 C …bin/python 673MiB
0 6174 C …bin/python 673MiB
1 6173 C …bin/python 2657MiB
2 6174 C …bin/python 2657MiB

Are the dataloaders doing any GPU work? If not, that should not be relevant here.

Still, when I look at nvidia-smi on the “Process GPU” column I see two different GPU’s (say 0 , and 1 ), assigned to only a single PID.

My guess is that this might just be some book keeping allocations happening on GPU 0 from all the processes (ex: initializing some cuda contexts etc.). Could you share a minimal script to repro this and we can figure out what might be happening here.

Thanks for this. Putting together a minimial script to repro and will get back soon.