I am using the following code snippet to ensure each task/process(local_rank) generated by slurm (using #SBATCH -n 4
) is assigned to a specific local gpu:
device = torch.device("cuda",local_rank)
torch.cuda.set_device(device)
Still, when I look at nvidia-smi
on the “Process GPU” column I see two different GPU’s (say 0
, and 1
), assigned to only a single PID. If I tie a slurm task to a GPU using torch.cuda.set_device(device)
(where device
here is the local_rank
) I should not have multiple GPU allocations right?
Not sure if this is important but I am using two workers on my DataLoader (see definition below), but then that should allocate the process to same GPU (say 0) right?
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, num_workers=2,sampler = torch.utils.data.distributed.DistributedSampler(dataset=trainset,num_replicas=world_size,rank=world_rank))
Details:
Slurm config:
#SBATCH --nodes 2
#SBATCH -n 4
#SBATCH --gres:gpus=4
Expectation:
Four tasks to be launched across two nodes. And each task to have one GPU allocated (because inside program launched by each task/process I have set torch.cuda.set_device(device)
)
Observed:
PID 6173
for example is allocated to two gpus (0
and 1
). The 4th task moved to the other node; so please ignore the same.
GPU | PID | Type | Process name | Usage |
---|---|---|---|---|
0 | 6172 | C | …bin/python | 2657MiB |
0 | 6173 | C | …bin/python | 673MiB |
0 | 6174 | C | …bin/python | 673MiB |
1 | 6173 | C | …bin/python | 2657MiB |
2 | 6174 | C | …bin/python | 2657MiB |