Moving tensors to GPU is super slow

hi, I’m pretty new to pytorch and I am trying to fine tune a BERT model for my purposes.
the problem is that the .to(device) function is super slow. moving the transformer to the gpu takes 20 minutes.

I found some test code on pytorch github repo

import torch

import torch.nn as nn

import timeit


t0 = timeit.default_timer()

if torch.cuda.is_available():


    torch.backends.cudnn.deterministic = True

    device = torch.device('cuda:0')

    ngpus = torch.cuda.device_count()

    print("Using {} GPU(s)...".format(ngpus))

print("Setup takes {:.2f}".format(timeit.default_timer()-t0))

t1 = timeit.default_timer()

model = nn.Sequential(

    nn.Conv2d(3, 6, 3, 1, 1),


    nn.Conv2d(6, 1, 3, 1, 1)


print("Model init takes {:.2f}".format(timeit.default_timer()-t1))

if torch.cuda.is_available():

    t2 = timeit.default_timer()

    model =

print("Model to device takes {:.2f}".format(timeit.default_timer()-t2))

t3 = timeit.default_timer()


print("Cuda Synch takes {:.2f}".format(timeit.default_timer()-t3))


the output is:

import torch...
Using 1 GPU(s)...
Setup takes 0.00
Model init takes 0.00
Model to device takes 952.94
Cuda Synch takes 0.00

this is my environment:

Pytorch version is: 1.7.0
Cuda version is: 10.1
cuDNN version is : 7604
Arch version is : sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37

system information:
os: Windows 10
graphics card: NVIDIA GeForce RTX 3090
processor: AMD Ryzen 9 5900X 12-Core Processor, 3693 Mhz
motherboard: ROG STRIX B550-F GAMING (WI-FI)
memory: 16GB

You are most likely running into the JIT kernel compilation, since you are not using sm_80 or sm_86 in your binaries due to CUDA10.1.
Use the CUDA11.0 binaries and the startup time should be gone.

1 Like