I am trying to setup distributed training and encountered some problems with initialization of process group.
Since I have a shared file-system between nodes, I chose initialization with file://
. But I got this error:
ValueError: Error initializing torch.distributed using file:// rendezvous: rank parameter missing
Then I found in documentation, that “automatic rank assignment is not supported anymore”, although documentation for init_process_group imply otherwise.
Is there a way to not tell init_process_group
rank explicitly. And what is the point of init_process_group
if I have to pass rank explicitly.