Hello,
if I perform a multiplication between a tensor that is on GPU and a float, does this operation move the tensor to CPU?
E.g.
scalar = 0.5
t = torch.ones(10, device='cuda')
res = t * scalar
I see that the result res will be on GPU, and also t didn’t change device. However, I’m wondering if the operation moves t to CPU first, performs the computation, and then moves the result to GPU.
No, as this would result in terrible performance since you would move potentially large data, execute the operation on the slower CPU, and move the result back. Instead the scalar is passed to the kernel.
No, it would be worse as you will trigger an explicit memcpy instead of allowing PyTorch to use a specialized TensorIterator kernel lifting the scalar to a kernel parameter.
This code snippet shows the effect:
import torch
from torch.profiler import profile, record_function, ProfilerActivity
x = torch.randn(10, device="cuda")
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
with record_function("scalar"):
y = x * 0.5
print(prof.key_averages().table(sort_by="cuda_time_total"))
# ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
# Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
# ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
# scalar 26.74% 231.000us 99.42% 859.000us 859.000us 0.000us 0.00% 1.000us 1.000us 1
# aten::mul 42.01% 363.000us 72.69% 628.000us 628.000us 1.000us 100.00% 1.000us 1.000us 1
# void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 100.00% 1.000us 1.000us 1
# cudaLaunchKernel 30.67% 265.000us 30.67% 265.000us 265.000us 0.000us 0.00% 0.000us 0.000us 1
# cudaDeviceSynchronize 0.58% 5.000us 0.58% 5.000us 5.000us 0.000us 0.00% 0.000us 0.000us 1
# ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
with record_function("tensor"):
y = x * torch.tensor(0.5, device="cuda")
print(prof.key_averages().table(sort_by="cuda_time_total"))
# ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
# Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
# ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
# tensor 4.82% 82.000us 99.82% 1.700ms 1.700ms 0.000us 0.00% 2.000us 2.000us 1
# aten::mul 0.88% 15.000us 1.47% 25.000us 25.000us 2.000us 100.00% 2.000us 2.000us 1
# void at::native::elementwise_kernel<128, 2, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 100.00% 2.000us 2.000us 1
# aten::empty 0.18% 3.000us 0.18% 3.000us 3.000us 0.000us 0.00% 0.000us 0.000us 1
# aten::to 76.34% 1.300ms 93.19% 1.587ms 1.587ms 0.000us 0.00% 0.000us 0.000us 1
# aten::_to_copy 0.65% 11.000us 16.85% 287.000us 287.000us 0.000us 0.00% 0.000us 0.000us 1
# aten::empty_strided 0.53% 9.000us 0.53% 9.000us 9.000us 0.000us 0.00% 0.000us 0.000us 1
# aten::copy_ 0.70% 12.000us 15.68% 267.000us 267.000us 0.000us 0.00% 0.000us 0.000us 1
# cudaMemcpyAsync 14.86% 253.000us 14.86% 253.000us 253.000us 0.000us 0.00% 0.000us 0.000us 1
# cudaStreamSynchronize 0.12% 2.000us 0.12% 2.000us 2.000us 0.000us 0.00% 0.000us 0.000us 1
# Memcpy HtoD (Pageable -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 0.000us 0.00% 0.000us 0.000us 1
# aten::lift_fresh 0.00% 0.000us 0.00% 0.000us 0.000us 0.000us 0.00% 0.000us 0.000us 1
# aten::detach_ 0.12% 2.000us 0.18% 3.000us 3.000us 0.000us 0.00% 0.000us 0.000us 1
# detach_ 0.06% 1.000us 0.06% 1.000us 1.000us 0.000us 0.00% 0.000us 0.000us 1
# cudaLaunchKernel 0.59% 10.000us 0.59% 10.000us 10.000us 0.000us 0.00% 0.000us 0.000us 1
# cudaDeviceSynchronize 0.18% 3.000us 0.18% 3.000us 3.000us 0.000us 0.00% 0.000us 0.000us 1
# ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
// The gpu_kernel_with_scalars generates specializations that support a
// single scalar CPU argument, such as from cuda_tensor + 5. The CPU scalar
// is lifted to a kernel parameter instead of copying to device memory.
// This should be used in conjunction with TensorIterator::allow_cpu_scalars_,
// which is the default for TensorIterator::binary_op. Otherwise, all inputs
// and the output must be on the GPU.