Hi,
I am using some gpu cluster to train my model. I am using the torch.distributed
module to manipulate the two gpus on one node. I set the dataloader like this:
ds = CityScapes(cfg, mode='train')
dl = DataLoader(ds,
batch_size=8,
shuffle=False,
sampler=sampler,
num_workers=4,
pin_memory=True,
drop_last=True)
This means that I will open 4 processes on each gpu to loader the data. Considering the process that is used to train the model, on this node with 2gpus, there would be 2(1+4)=10 processes in this node. So shall I start my training like this?
srun -p gpu --gres=gpu:2 --ntasks_per_node=10 python -m torch.distributed.launch --nproc_per_node=2 train.py
Is this the correct way to make it work please?