Memory used by torch.distributed

I’m trying to profile the torch memory usage from the cuda API, i.e. torch.cuda API. I’m wondering if the memory used by torch.distributed’s collective call will be captured by the cuda API.
For example, if I use NCCL as the backend and I made a call of torch.distributed.allagther. Will the memory used by NCCL (caching, communication buffer etc) be recorded by the torch.cuda API?

Yes, all CUDA memory consumed by NCCL (such as events, streams, comm. buffers, etc) will be captured by CUDA memory APIs such as torch.cuda.memory_stats.

Note that however the PyTorch profiler’s memory profiling features will not capture this.