I am using the following code snippet to ensure each task/process(local_rank) generated by slurm (using
#SBATCH -n 4) is assigned to a specific local gpu:
device = torch.device("cuda",local_rank) torch.cuda.set_device(device)
Still, when I look at
nvidia-smi on the “Process GPU” column I see two different GPU’s (say
1), assigned to only a single PID. If I tie a slurm task to a GPU using
device here is the
local_rank) I should not have multiple GPU allocations right?
Not sure if this is important but I am using two workers on my DataLoader (see definition below), but then that should allocate the process to same GPU (say 0) right?
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, num_workers=2,sampler = torch.utils.data.distributed.DistributedSampler(dataset=trainset,num_replicas=world_size,rank=world_rank))
#SBATCH --nodes 2 #SBATCH -n 4 #SBATCH --gres:gpus=4
Four tasks to be launched across two nodes. And each task to have one GPU allocated (because inside program launched by each task/process I have set
6173 for example is allocated to two gpus (
1). The 4th task moved to the other node; so please ignore the same.