Consider the scenario where we have two machines, each equipped with 8 GPUs, all of which can access the same folder through a distributed file system. I’m curious whether conflicts might arise when these two nodes attempt to save logs into the same directory. Take for instance the following torchrun
commands executed on these nodes, both utilizing the same log_dir
:
# node #0
torchrun --rdzv-id aaaa --master_addr bbbb --master_port cccc --nnodes 2 --node_rank 0 --nproc_per_node 8 --log_dir ./logs --redirects 3 --tee 3 train.py
# node #1
torchrun --rdzv-id aaaa --master_addr bbbb --master_port cccc --nnodes 2 --node_rank 1 --nproc_per_node 8 --log_dir ./logs --redirects 3 --tee 3 train.py
The question that arises is whether these two nodes might concurrently attempt to write to the same log file within the specified log_dir
.