Efficiency when pushing layers onto GPU after initialization

I was profiling the power draw of my GPU when I noticed that after the first time a layer is sent to the GPU using to(device= cuda_device), the speed at which the layers are sent to the GPU decreases by three orders of magnitude:

import timeit
import torch
from torch import nn

cuda = torch.device('cuda')

#1 Layer
def test():
      nn.ConvTranspose2d(in_channels=10, out_channels=32, kernel_size=(4,4),stride=2, padding=0).to(device=cuda)

#2 Layer
def test2():
      nn.ConvTranspose2d(in_channels=32, out_channels=32, kernel_size=(6,6),stride=2, padding=0).to(device=cuda)

# First time layer #1 is sent to the gpu, the time is at 3.6 secs
>>> print(timeit.timeit(test, number=1))

# Second time the same (#1 Layer) layer is sent to the gpu, the time is at 0.0021 secs
>>> print(timeit.timeit(test, number=1))

# This time, a different layer (#2 Layer) with larger input channel and larger kernels is sent to the gpu, the time is at 0.0027 secs
>>> print(timeit.timeit(test2, number=1))

#After restarting the python interpreter and re-import everything, the same layer takes 3 orders of magnitude longer to send
<Re-import all and define the test2 function>
>>>print(timeit.timeit(test2, number=1))

I have a feeling that this is clearly a front end cost for cuda initialization that pytorch establishes at the first to(gpu) call, but what exact reason behind this efficiency?

The first CUDA operation will create the CUDA context, which contains the the PyTorch kernels, cudnn, NCCL, CUDA libs etc., so it’ll take some time to load these libraries.

Also note, that CUDA operations are executed asynchronously, so if you want to profile the data transfer from host to device or any other CUDA op, you would have to synchronize the code before starting and stopping the timer via torch.cuda.synchronize().

1 Like

Thank you! That was extremely helpful