I’m trying to run 2 gpu with torch tranning.
from apex.parallel import DistributedDataParallel as DDP
os.environ["CUDA_VISIBLE_DEVICES"] = '0, 1'
#model = nn.DataParallel(model, output_device=1)
model = DDP(model, delay_allreduce=True)
I added this init_process_group and DDP part
- this is not running without error
- and I dont know what backend and init_method mean
Checkout this doc and this tutorial.
init_method arg helps process groups to perform rendezvous. You can use either of the following two options:
- set with a string. E.g., for local training, you can specify sth like
tcp://localhost:23456 to tell all processes to rendezvous on that port.
MASTER_PORT env vars.
BTW, for questions regarding distributed training, please add a “distributed” tag, so that the corresponding team can closely monitor that.