How does pytorch transfer data between GPUs

Yanan · June 2, 2020, 9:31pm

https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
This page explains how we can split a big nn model on multiple GPUs. But I don’t understand when the intermediate training result needs to be transferred from one GPU to another (by using “.to(‘cuda:1’)”), does pytorch runtime moves the data to CPU mem first and then to another gpu’s mem, or pytorch actually utilizes some direct data transfer technique between GPUs like SLI?

Any help would be appreciated.

mrshenli · June 2, 2020, 9:39pm

Hey @Yanan

In this case, the tensor will be directly copied from device to device using cudaMemcpyAsync. The C++ implementation is linked below:

github.com

pytorch/pytorch/blob/89c0efb30b5bad49e717c1b1a797bac3c62e8b7e/aten/src/ATen/native/cuda/Copy.cu#L56-L67


if (memcpy_eligible) {
  void *dst = iter.data_ptr(0);
  void *src = iter.data_ptr(1);
  size_t size = numel * iter.element_size(0);
  if (src != dst || src_device != dst_device) {
    // Perform the copy
    AT_CUDA_CHECK(cudaMemcpyAsync(
        dst, src, size,
        cudaMemcpyDeviceToDevice,
        copy_stream));
  }
} else {

Yanan · June 2, 2020, 9:43pm

@mrshenli Thanks! This really helps.
I have another question that does data copy between GPUs in DataParallel module utilize the same kernel?
Sorry I forgot to ask this before.

mrshenli · June 2, 2020, 9:54pm

I have another question that does data copy between GPUs in DataParallel module utilize the same kernel?

Yes. DataParallel calls into scatter. Its C++ implementation is linked below. It’s basically calling Tensor.to in a loop

github.com

pytorch/pytorch/blob/89c0efb30b5bad49e717c1b1a797bac3c62e8b7e/torch/csrc/cuda/comm.cpp#L201-L206


      
          chunks[chunk] =
              chunks[chunk].to(
                  {DeviceType::CUDA, device_index},
                  /*non_blocking=*/true,
                  /*copy=*/false,
                  /*memory_format=*/at::MemoryFormat::Preserve);