RuntimeError: flock: Function not implemented

Hi ,
I code a distributed data parallel in pytorch (windows 10) , everything goes fine without error . When I load it in Linux , it give me and error:

– Process 0 terminated with the following error:
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py”, line 19, in _wrap
fn(i, *args)
File “/scratch/users/industry/ite/philipso/m/pixelsnail_25062021/main.py”, line 121, in train
rank=args[0] )
File “/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py”, line 455, in init_process_group
barrier()
File “/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py”, line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: flock: Function not implemented

This is how I code it

I used

for machine_rank in range(world_size):
mp.spawn( train,args=(world_size,backend,c_param),nprocs=world_size,join=True)

then follow by:

def train(*args):
if (args[3].training_type==1):
torch.distributed.init_process_group(backend=args[2],init_method=r"file:///"+args[3].classifier+".log",world_size=args[1],rank=args[0] )

I suspect the problem is in join=True in mp.spawn

any help will be much appreciated

This might be a file system related issue (see Run code for training got some errors · Issue #20 · wenet-e2e/wenet · GitHub for a similar issue). What kind of file system are you using for the file in init_method?

The System in Linux is CentOS 6
my laptop is Windows 10

Thank Bro for helping me :slightly_smiling_face:

The init_method i am using is Shared File System

Thank you :slight_smile:

my share file is located at Scratch of file system Lustre :slight_smile:

It could be possible that the shared file system doesn’t support the flock system call. If you are training on one node, can you use the local file system instead?

If you are training with multiple nodes, you can instead use TCP initialization described here: Distributed communication package - torch.distributed — PyTorch 1.9.0 documentation.

Also, looking at Mounting a Lustre File System on Client Nodes - Lustre Wiki, it seems like there are options to mount Lustre with support for flock, that might be another option.

Hi thank
I have solved all problems except this “flock function not implemented”. Interestingly , the shared file did created however, it did not manage to flock. The problem is with init_process_group. What can I replace with this?

As I mentioned above, you could use TCP initialization as follows:

import torch.distributed as dist

# Use address of one of the machines
dist.init_process_group(backend, init_method='tcp://10.1.1.20:23456',
                        rank=args.rank, world_size=4)