Hi, does pytorch provide routines or support for parallel writes to the same single file (ex. binary file) at the same time from multiple distributed processes? I am looking for functionality similar to the link below, but using only pytorch. Tutorial — MPI for Python 3.1.4 documentation.
I heavily use torch.distributed. It has many functions for communication between distributed processes. But, I couldn’t find any functions for writing of a binary or similar data file by multiple processes simultaneously.
No, PyTorch provides no support for parallel writes to a single file.
This is not an easy task given controlling which process can write is not portable and quite complex. It’s even worse if you’re doing over the network.
You could use object collectives to negotiate which process can write to which part of the file and then use regular file writes.
Hi @kumpera, thank you very much for your response. Can you please elaborate on your solution? What do you mean by use of object primitives. Sorry about my lack of object oriented knowledge.
I am willing to change my software architecture. For now, I am writing a separate file per process. It will be so much better to just write one file for all processes.
You can use something like this: Distributed communication package - torch.distributed — PyTorch master documentation
So all ranks tell which items they need to write and where.