Hi ,
I code a distributed data parallel in pytorch (windows 10) , everything goes fine without error . When I load it in Linux , it give me and error:
– Process 0 terminated with the following error:
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py”, line 19, in _wrap
fn(i, *args)
File “/scratch/users/industry/ite/philipso/m/pixelsnail_25062021/main.py”, line 121, in train
rank=args[0] )
File “/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py”, line 455, in init_process_group
barrier()
File “/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py”, line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: flock: Function not implemented
This is how I code it
I used
for machine_rank in range(world_size):
mp.spawn( train,args=(world_size,backend,c_param),nprocs=world_size,join=True)
then follow by:
def train(*args):
if (args[3].training_type==1):
torch.distributed.init_process_group(backend=args[2],init_method=r"file:///"+args[3].classifier+".log",world_size=args[1],rank=args[0] )
I suspect the problem is in join=True in mp.spawn
any help will be much appreciated
This might be a file system related issue (see Run code for training got some errors · Issue #20 · wenet-e2e/wenet · GitHub for a similar issue). What kind of file system are you using for the file in init_method
?
The System in Linux is CentOS 6
my laptop is Windows 10
Thank Bro for helping me
The init_method i am using is Shared File System
Thank you
my share file is located at Scratch of file system Lustre
It could be possible that the shared file system doesn’t support the flock
system call. If you are training on one node, can you use the local file system instead?
If you are training with multiple nodes, you can instead use TCP initialization described here: Distributed communication package - torch.distributed — PyTorch 1.9.0 documentation.
Also, looking at Mounting a Lustre File System on Client Nodes - Lustre Wiki, it seems like there are options to mount Lustre with support for flock
, that might be another option.
Hi thank
I have solved all problems except this “flock function not implemented”. Interestingly , the shared file did created however, it did not manage to flock. The problem is with init_process_group. What can I replace with this?
As I mentioned above, you could use TCP initialization as follows:
import torch.distributed as dist
# Use address of one of the machines
dist.init_process_group(backend, init_method='tcp://10.1.1.20:23456',
rank=args.rank, world_size=4)