– this is no longer a pytorch issue. but you can find some helpful tips for ddp pytorch. thanks
i am using ddp.
on a single machine (node=1) w/ many gpus, it is fine.
but with many nodes w/many gpus, i find an issue with file writing.
assuming i run job where each gpu handles a process.
case 1: each gpu is located on a different node and never 2 gpus on the same node
case 2: each gpu is located on the same node (case 2), there is no issue.
let’s say we want to write a tmp file that every process will need later.
in case 1, all of them need to write without worries about writing simultaneously in the same file.
in case 2, only the master can write.
but the problem is when some gpus are located on the same node.
in that case, many process cold attempt to write in the same file which is a problem.
1- how to properly deal with the mixed case?
2- is there a way to know if the process is a node master?
for second question, which could solve q1, one needs to check the local rank.
this is useful when writing files such as:
- copy training data
because processes in a node can not see disk of other nodes.
in a node, only one process needs to write.
a node master needs to be designed.
also, a master (global) can be designed as well to handle unique operations that need to be done only once such as logging.