How does pytorch transfer data between GPUs

https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
This page explains how we can split a big nn model on multiple GPUs. But I don’t understand when the intermediate training result needs to be transferred from one GPU to another (by using “.to(‘cuda:1’)”), does pytorch runtime moves the data to CPU mem first and then to another gpu’s mem, or pytorch actually utilizes some direct data transfer technique between GPUs like SLI?

Any help would be appreciated.

Hey @Yanan

In this case, the tensor will be directly copied from device to device using cudaMemcpyAsync. The C++ implementation is linked below:

@mrshenli Thanks! This really helps.
I have another question that does data copy between GPUs in DataParallel module utilize the same kernel?
Sorry I forgot to ask this before.

I have another question that does data copy between GPUs in DataParallel module utilize the same kernel?

Yes. DataParallel calls into scatter. Its C++ implementation is linked below. It’s basically calling Tensor.to in a loop