Why is the .to method faster than passing device=cuda:0

Hello,
recently I profiled some piece of code training a NN using torch, and discovered the .to method takes a significant amount of time out of the running time.

when digging in a bit further, I indeed allocate tensors in the forward pass in this manner:

torch.randn((x, y)).to("cuda:0")

This didn’t seem like a problem initially, however this probably constructs the tensor on CPU and then passes it to GPU, and creates a bottleneck. I tried bench-marking what would happen if I do:

torch.randn((x, y), device="cuda:0")

But unexpectedly, it performs worse? I created this in a small example which also shows the same behavior:

import torch
import time

cur_time = time.time()
for i in range(5000):
    torch.randn(5, device="cuda:0")
print("device=device took: {}".format(time.time() - cur_time))
cur_time = time.time()
for i in range(5000):
    torch.randn(5).to("cuda:0")
print(".to took: {}".format(time.time() - cur_time))

Why is this the case?
How can I speed up the creation of tensors by allocating them directly on GPU?

the example reproduces for different functions such as torch.zeros and torch.ones. I am using windows 10 with python 3.9.12, and torch 1.12.1 (GPU=NVIDIA GeForce RTX 2060).

CUDA operations are executed asynchronously so you would need to synchronize the code before starting and stopping the timers via torch.cuda.synchronize(). Adding it gives:

device=device took: 0.023486614227294922
.to took: 0.0527348518371582

on my setup.