Why native float const is faster than cuda tensor const

When doing large scale tensor operation on cuda including constant, I expect that pre-allocate constant on cuda can accelerate computation at least slightly. But the result is almost opposite: cuda tensor const is always slightly slow than native float constant. Below is my code:

import torch
import time

scale = 0.5
scale_cuda = torch.tensor(0.5, device="cuda")

def test_native_float(x_cuda):
    # use native float
    for _ in range(1000):
        x_cuda = x_cuda * scale 

def test_cuda_tensor(x_cuda):
    # use CUDA tensor
    for _ in range(1000):
        x_cuda = x_cuda * scale_cuda

x = torch.randn(10000, 10000, device="cuda")

# warm up
test_native_float(x)
test_cuda_tensor(x)
torch.cuda.synchronize()

# test time
start = time.time()
test_native_float(x)
torch.cuda.synchronize()
print("Native float time:", time.time() - start)

start = time.time()
test_cuda_tensor(x)
torch.cuda.synchronize()
print("CUDA Tensor time: ", time.time() - start)

Native float time: 1.0922577381134033
CUDA Tensor time: 1.1039364337921143

OS: windows 10
GPU: 3080TI
cuda version: 12.6