The speed of tensor to device

I am running two different segmentation models. The input have the same size and data type. But the speed of the following code is quite different.
What caused the difference.
Model 1 code

            # images  torch.float32 cpu False torch.Size([2, 3, 480, 480]) 
           # targets torch.int64 cpu False torch.Size([2, 480, 480])
            end2 = time.time()
            images = images.to(self.device)
            targets = targets.to(self.device)
            data_to_device = time.time() - end2  

The data_to_device of this model is 0.0568695068359375.

Model 2 code

        print(images.dtype, images.device, images.requires_grad, images.size())
        print(target_mask.dtype, target_mask.device,
              target_mask.requires_grad, target_mask.size())
        end2 = time.time()
        images = images.to(device)
        target_mask = target_mask.to(device)
        data_to_device = time.time() - end2
        print('=============', data_to_device)

The output of model 2 from the first iteration is

2019-07-14 18:23:21,835 agfcoo.trainer INFO: Start training
torch.float32 cpu False torch.Size([2, 3, 480, 480])
torch.int64 cpu False torch.Size([2, 480, 480])
============= 0.0012090206146240234
torch.float32 cpu False torch.Size([2, 3, 480, 480])
torch.int64 cpu False torch.Size([2, 480, 480])
============= 0.5653738975524902
torch.float32 cpu False torch.Size([2, 3, 480, 480])
torch.int64 cpu False torch.Size([2, 480, 480])
============= 0.28435635566711426
torch.float32 cpu False torch.Size([2, 3, 480, 480])
torch.int64 cpu False torch.Size([2, 480, 480])
============= 0.26169490814208984
torch.float32 cpu False torch.Size([2, 3, 480, 480])
torch.int64 cpu False torch.Size([2, 480, 480])
============= 0.2824215888977051
torch.float32 cpu False torch.Size([2, 3, 480, 480])
torch.int64 cpu False torch.Size([2, 480, 480])
============= 0.2888615131378174

This computer have one RTX 2060.
The pytorch version is 1.1.
The cuda is 10.0.

What caused the difference?

Hi, I have the same question with you.
Same data but very different speed with pytorch1.0.
Have you solved it?

Since CUDA operations are executed asynchronously, you should synchronize before starting and stopping the timer via torch.cuda.synchronize().

Could you add it and profile the code again?