Can you share a minimum repro of train_net.py
, especially how you call init_process_group
and DistributedDataParallel
ctor?
BTW, could you please add a “distributed” tag for future torch.distributed
-related posts? So that the PT distributed team can get back to you promptly.