I was profiling the power draw of my GPU when I noticed that after the first time a layer is sent to the GPU using to(device= cuda_device), the speed at which the layers are sent to the GPU decreases by three orders of magnitude:
import timeit
import torch
from torch import nn
cuda = torch.device('cuda')
#1 Layer
def test():
nn.ConvTranspose2d(in_channels=10, out_channels=32, kernel_size=(4,4),stride=2, padding=0).to(device=cuda)
#2 Layer
def test2():
nn.ConvTranspose2d(in_channels=32, out_channels=32, kernel_size=(6,6),stride=2, padding=0).to(device=cuda)
# First time layer #1 is sent to the gpu, the time is at 3.6 secs
>>> print(timeit.timeit(test, number=1))
3.634781403990928
# Second time the same (#1 Layer) layer is sent to the gpu, the time is at 0.0021 secs
>>> print(timeit.timeit(test, number=1))
0.0021496260014828295
# This time, a different layer (#2 Layer) with larger input channel and larger kernels is sent to the gpu, the time is at 0.0027 secs
>>> print(timeit.timeit(test2, number=1))
0.0026589079934637994
#After restarting the python interpreter and re-import everything, the same layer takes 3 orders of magnitude longer to send
>>>exit()
<Re-import all and define the test2 function>
>>>print(timeit.timeit(test2, number=1))
4.171760909986915
I have a feeling that this is clearly a front end cost for cuda initialization that pytorch establishes at the first to(gpu) call, but what exact reason behind this efficiency?