I want to make a logging buffer in shared memory so I can batch my writes to the filesystem, since it occurs as a network call when I run my setup in kubernetes.
I’m trying to use multiprocessing.Lock
and multiprocessing.Semaphore
, but when I create it in the local rank 0 process and broadcast it via torch.distributed.broadcast_object_list
, I get a multiprocessing
RuntimeError
that mutexes must be created on the parent process. Creating a mutexes from a multiprocessing.Manager
doesn’t work either as it fails some kind of authorization check.
How can I create mutexes for a shared resource using what is available in torch.distributed
?