Thanks for the reply, please correct me if I’m wrong but the purpose of distributed training is to be used for:
- Multi-node & multi-gpu training
- Single-node & multi-gpu training → my use case scenario
I’ve seen them and according to the recommendation shown below I launch my script for my scenario as explained above.
How to use this module:
- Single-Node multi-process distributed training
::python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
YOUR_TRAINING_SCRIPT.py (–arg1 --arg2 --arg3 and all other
arguments of your training script)
This is what I don’t understand why do I need to specify gpu for distrubed training on single-node multi-gpu?
Obviously I want to use all gpus do I still need --local_rank
I thought it was used to specify gpus in multi-node scenario where nodes might have different number of gpus.
When launching launch.py
with --nproc_per_node=2
where 2 is the num of gpus it returns --local_rank=0
shouldn’t that be 2 instead for each gpu?
Even if I parse the --local-rank
argument in my script and set the device to --local_rank
that would still be using only one of the available gpus. I need to use all, that where there’s confusion in my understanding. How do we do that?