DDP barrier has no effect when more than one is used

When using barrier in multi-gpu DDP (I am using torchrun --standalone --nnodes=1 --nproc_per_node=gpu example.py on 6 gpus machine), I expect that whenever I use barrier, all ranks should be blocked until all ranks arrive this barrier. However, only the first barrier successfully blocked all processes, the second barrier just falls through as if it does not exist. See example:

# example.py
import time
from typing import Tuple

from torch import distributed as dist


def ddp_setup() -> Tuple[int, int]:
    dist.init_process_group("nccl")
    world_size = dist.get_world_size()
    rank = dist.get_rank()
    return rank, world_size


def ddp_cleanup() -> None:
    dist.destroy_process_group()


def main():
    rank, world_size = ddp_setup()

    if rank == 0:
        time.sleep(1)
        for i in range(3):
            time.sleep(1)
            print(f"[RANK {rank}] 1-{i}", flush=True)
        time.sleep(1)

    print(f"[RANK {rank}] before barrier1", flush=True)
    dist.barrier()
    print(f"[RANK {rank}] after barrier1", flush=True)

    time.sleep(3)

    if rank == 0:
        time.sleep(1)
        for i in range(3):
            time.sleep(1)
            print(f"[RANK {rank}] 2-{i}", flush=True)
        time.sleep(1)

    print(f"[RANK {rank}] before barrier2", flush=True)
    dist.barrier()
    print(f"[RANK {rank}] after barrier2", flush=True)

    ddp_cleanup()


if __name__ == "__main__":
    main()

output:

[RANK 5] before barrier1
[RANK 4] before barrier1
[RANK 1] before barrier1
[RANK 3] before barrier1
[RANK 2] before barrier1
[RANK 0] 1-0
[RANK 0] 1-1
[RANK 0] 1-2
[RANK 0] before barrier1
[RANK 4] after barrier1
[RANK 2] after barrier1
[RANK 5] after barrier1
[RANK 1] after barrier1
[RANK 0] after barrier1
[RANK 3] after barrier1
[RANK 4] before barrier2
[RANK 4] after barrier2
[RANK 2] before barrier2
[RANK 2] after barrier2
[RANK 5] before barrier2
[RANK 5] after barrier2
[RANK 1] before barrier2
[RANK 1] after barrier2
[RANK 3] before barrier2
[RANK 3] after barrier2
[RANK 0] 2-0
[RANK 0] 2-1
[RANK 0] 2-2
[RANK 0] before barrier2
[RANK 0] after barrier2

Notice that during barrier1, all ranks are waiting for rank0 to complete its task, and “after barrier1” is displayed after “[RANK 0] before barrier1”, which is as expected. However, for barrier2, which does the exact same thing, “after barrier2” for ranks 1-5 are displayed before rank 0 does its task. (note: adding even more barrier also has no effect, only first barrier works)

Why is this happening? How can I use barriers multiple times?

Python 3.11.10
torch 2.4.1
torchaudio 2.4.1
torchvision 0.19.1

OS: Ubuntu 18.04.1