Tracing down to C++ code about implementation of GPUs and CPU communication /synchronization on DataParallel module

Hi all,

I am doing the multi-gpus test (eg. 4 gpus) on pytorch with DataParallel module.
For the “gather” function in DataParallel module, it needs to collect the data from other GPUs. I want to know after this function, it may have communication/synchronization among GPU and CPUs.
But i got stuck on tracing down to C++ code with the implementation of GPUs and CPU communication /synchronization. I cannot find the correct place. I am already tracing down to " csrc/cuda/comm.cpp".

Thanks for any help.