`torch.distributed.barrier` used in multi-node distributed data-parallel training

To put it simply, if you just want process to execute mkdir, download, etc, then you should:

import torch
import argparse


def main():
    # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
    torch.distributed.init_process_group(backend="nccl")

    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", type=int)
    args = parser.parse_args()
    local_rank = args.local_rank
    
    torch.distributed.barrier()

    if local_rank == 0:
        print(local_rank)
    
    torch.distributed.barrier()

    print("{} exit".format(local_rank))


if __name__ == "__main__":
    main()

this will print:

0
0 exit
2 exit
1 exit3 exit

And should not

import torch
import argparse


def main():
    # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
    torch.distributed.init_process_group(backend="nccl")

    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", type=int)
    args = parser.parse_args()
    local_rank = args.local_rank
    
    if local_rank != 0:
        torch.distributed.barrier()

    print(local_rank)
    
    if local_rank == 0:
        torch.distributed.barrier()

    print("{} exit".format(local_rank))


if __name__ == "__main__":
    main()

which will print

0
0 exit
2
2 exit
13
3 exit

1 exit

barrier is just a barrier, it requires all processes in the group to reach one barrier function, no matter where it is placed, so the second function basically delays all other processes (except 0), unless the code in between two barriers is a not-effective (equal to return / pass) once any process has executed it (Eg: process 0), you are not going to get your expected result.

And please make sure that your CUDA runtime has the same major & minor version as your the CUDA version your torch you have built with, 9 is not compatible with 10, so you are likely to experience some issues when using “nccl” or cuda tensor computations.

1 Like