Performance difference between torch.zeros((...), device=dev) and torch.zeros((...)).to(dev)?

I’ve got two questions regarding performance.

  1. Is there a performance difference between first creating a tensor, then sending it to the device with the “to” function, and with specifying the device directly during creation?

  2. When having tensors on the GPU and switching data type (e.g from int to float), does it do this directly on the device or first moves it back to CPU before casting?

yes, one is directly creating on the particular device. and the other one creates on cpu and does a copy (if dev is not cpu)

directly on the device

1 Like

Thanks for the rapid response. So to confirm, torch.zeros((…), device=dev) is the faster way?

yes you are correct indeed