Ddp: exclusive writing on disk w/ multiple nodes

– this is no longer a pytorch issue. but you can find some helpful tips for ddp pytorch. thanks

hi,
i am using ddp.
on a single machine (node=1) w/ many gpus, it is fine.
but with many nodes w/many gpus, i find an issue with file writing.
assuming i run job where each gpu handles a process.
if:
case 1: each gpu is located on a different node and never 2 gpus on the same node

case 2: each gpu is located on the same node (case 2), there is no issue.

let’s say we want to write a tmp file that every process will need later.
in case 1, all of them need to write without worries about writing simultaneously in the same file.
in case 2, only the master can write.

but the problem is when some gpus are located on the same node.
in that case, many process cold attempt to write in the same file which is a problem.

questions:
1- how to properly deal with the mixed case?
2- is there a way to know if the process is a node master?

for second question, which could solve q1, one needs to check the local rank.

this is useful when writing files such as:

  • copy training data
  • checkpoints.

because processes in a node can not see disk of other nodes.
in a node, only one process needs to write.
a node master needs to be designed.

also, a master (global) can be designed as well to handle unique operations that need to be done only once such as logging.

thanks

If you need different processes on the same node to write to a combined single file, then you can first call gather or gather_objects to collect all the outputs, and then write them to the file. Or you can just let all the processes collectively write to a sharded file, where each shard is a single file output by a process.

If you only want the master to write, you can specify the condition torch.distributed.get_rank() == 0. This is usually good enough for checkpointing the model, since DDP guarantees the model on each process is the same.

yes, for the synch part, that’s what i was doing (model is self synch using ddp, but for other objects i use torch.distrbuted.all_gather). so, this is fine.

for writing, i allow only node master to write.
for the last point, this is fine on paper but it does not work… not because ddp or pytorch but because of slurm.

i ask every master node to write/bring/decompress some data from somewhere else to the node local disk. but at the end, they did all instruction, but processes dont find decompressed files. i wrote to the server it staff asking for some explanations. waiting…

so, i’ll face the same issue when writing checkpoints, or any file in each process… later when asked to reload the checkpoint, it will be missing.

thanks

I think the answer now is quite specific to the cluster and not quite relevant to PyTorch. At the high level, I think you need to write the output to a dedicated distributed/shared file system or directory instead of the local disk of each node. Better ask the cluster admins for the right output destination.

  • yes, pytorch is needed only to control which process is doing the writing
  • the rest is cluster dependent.
  • i/o in local disk is the fastest and the recommended way by admins. writing in network disk or other shared storage is slow and can be extremely slow. in multi-node jobs, they provided a way to dispatch the data across all nodes using srun in a job script but it is not clear how to do it from inside a python code. each process needs to do some i/o operations. it has to be done right in this case. i wrote to admins. no answer yet.

so, yes, this question does not concern pytorch anymore. i will edit the question’s header to clarify.
thanks.

1 Like