RuntimeError: flock: Function not implemented

Philip_Solomon_See · June 30, 2021, 7:29am

Hi ,
I code a distributed data parallel in pytorch (windows 10) , everything goes fine without error . When I load it in Linux , it give me and error:

– Process 0 terminated with the following error:
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py”, line 19, in _wrap
fn(i, *args)
File “/scratch/users/industry/ite/philipso/m/pixelsnail_25062021/main.py”, line 121, in train
rank=args[0] )
File “/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py”, line 455, in init_process_group
barrier()
File “/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py”, line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: flock: Function not implemented

This is how I code it

I used

for machine_rank in range(world_size):
mp.spawn( train,args=(world_size,backend,c_param),nprocs=world_size,join=True)

then follow by:

def train(*args):
if (args[3].training_type==1):
torch.distributed.init_process_group(backend=args[2],init_method=r"file:///"+args[3].classifier+".log",world_size=args[1],rank=args[0] )

I suspect the problem is in join=True in mp.spawn

any help will be much appreciated

pritamdamania87 · June 30, 2021, 11:03pm

This might be a file system related issue (see Run code for training got some errors · Issue #20 · wenet-e2e/wenet · GitHub for a similar issue). What kind of file system are you using for the file in init_method?

Philip_Solomon_See · July 1, 2021, 5:00am

The System in Linux is CentOS 6
my laptop is Windows 10

Thank Bro for helping me

Philip_Solomon_See · July 1, 2021, 5:05am

The init_method i am using is Shared File System

Thank you

Philip_Solomon_See · July 1, 2021, 5:13am

my share file is located at Scratch of file system Lustre

pritamdamania87 · July 1, 2021, 10:03pm

It could be possible that the shared file system doesn’t support the flock system call. If you are training on one node, can you use the local file system instead?

If you are training with multiple nodes, you can instead use TCP initialization described here: Distributed communication package - torch.distributed — PyTorch 1.9.0 documentation.

pritamdamania87 · July 1, 2021, 10:05pm

Also, looking at Mounting a Lustre File System on Client Nodes - Lustre Wiki, it seems like there are options to mount Lustre with support for flock, that might be another option.

Philip_Solomon_See · July 2, 2021, 6:49am

Hi thank
I have solved all problems except this “flock function not implemented”. Interestingly , the shared file did created however, it did not manage to flock. The problem is with init_process_group. What can I replace with this?

pritamdamania87 · July 6, 2021, 4:43pm

As I mentioned above, you could use TCP initialization as follows:

import torch.distributed as dist

# Use address of one of the machines
dist.init_process_group(backend, init_method='tcp://10.1.1.20:23456',
                        rank=args.rank, world_size=4)