Hello,
recently I profiled some piece of code training a NN using torch, and discovered the .to method takes a significant amount of time out of the running time.
when digging in a bit further, I indeed allocate tensors in the forward pass in this manner:
torch.randn((x, y)).to("cuda:0")
This didn’t seem like a problem initially, however this probably constructs the tensor on CPU and then passes it to GPU, and creates a bottleneck. I tried bench-marking what would happen if I do:
torch.randn((x, y), device="cuda:0")
But unexpectedly, it performs worse? I created this in a small example which also shows the same behavior:
import torch
import time
cur_time = time.time()
for i in range(5000):
torch.randn(5, device="cuda:0")
print("device=device took: {}".format(time.time() - cur_time))
cur_time = time.time()
for i in range(5000):
torch.randn(5).to("cuda:0")
print(".to took: {}".format(time.time() - cur_time))
Why is this the case?
How can I speed up the creation of tensors by allocating them directly on GPU?
the example reproduces for different functions such as torch.zeros and torch.ones. I am using windows 10 with python 3.9.12, and torch 1.12.1 (GPU=NVIDIA GeForce RTX 2060).