Torch.distributed.barrier occupies additional CUDA memory

I am conducting a DDP parallel training task, and I only want to perform validation in the main process. I used torch.distributed.barrier to prevent non-main processes to avoid NCCL timeout. However, I found that torch.distributed.barrier consumes an additional approximately 4GB of GPU memory (from 93GB to 97GB). Can someone help explain why this happens?