When doing large scale tensor operation on cuda including constant, I expect that pre-allocate constant on cuda can accelerate computation at least slightly. But the result is almost opposite: cuda tensor const is always slightly slow than native float constant. Below is my code:
import torch
import time
scale = 0.5
scale_cuda = torch.tensor(0.5, device="cuda")
def test_native_float(x_cuda):
# use native float
for _ in range(1000):
x_cuda = x_cuda * scale
def test_cuda_tensor(x_cuda):
# use CUDA tensor
for _ in range(1000):
x_cuda = x_cuda * scale_cuda
x = torch.randn(10000, 10000, device="cuda")
# warm up
test_native_float(x)
test_cuda_tensor(x)
torch.cuda.synchronize()
# test time
start = time.time()
test_native_float(x)
torch.cuda.synchronize()
print("Native float time:", time.time() - start)
start = time.time()
test_cuda_tensor(x)
torch.cuda.synchronize()
print("CUDA Tensor time: ", time.time() - start)
Native float time: 1.0922577381134033
CUDA Tensor time: 1.1039364337921143
OS: windows 10
GPU: 3080TI
cuda version: 12.6