Hello…
Suppose I have a tensor t
that resides on GPU and I would like to apply some math operations on it with constants:
t = torch.FloatTensor(...).cuda()
t = 0.1 * t + 0.2
I wonder how Torch/Python handles the constants 0.1
and 0.2
. Are they transferred from CPU to GPU memory each time this operation is performed, or is the code optimized automatically and the constants are moved to GPU?
I am concerned about the efficiency.
Should I define constant CUDA tensors and use them instead (less readable code)?
Thanks.
it comes down to two cuda kernels that, strongly simplified, look like:
mul(float* a, float* b)
vs
mul(float* a, float b)
second one is marginally faster, as one “load from global memory pointer” operation is avoided
but this is not related to CPU-GPU transfers, note that function arguments, launch configuration and a signal to execute cuda kernel itself require a data transfer anyway (I’d assume it is a single transfer that uses faster “constant memory”, but haven’t investigated into this).
takeaway is, it is better to NOT use tensors, when tensor-scalar operator versions exist (they usually do when you don’t need to wrap numbers in tensors). OTOH, the timing difference is negligible.
Thanks for the nice explanation.
This is a very common use-case, and I almost always see constants being used as constants in the equations instead of constant tensors.
There is also the caching; not familiar with GPU caching, but assuming it is similar to CPU caching, such constants are probably cached if they are frequently accessed.
It is not about caching, what I tried to explain is that to call a kernel you either transfer a pointer to tensor data (address of 8 bytes) or a scalar (1-8 bytes), it is the same overhead, probably unavoidable (unless you would compile specialized kernels with less function arguments).
I understand that, but the real cost comes from whether the value/pointer is accessed from cache or from GPU/CPU memory, the latter being much slower (especially from CPU memory).
In that sense, I think special memory area is used to pass arguments, reading from which is at least not slower than from cached global memory. And I haven’t mentioned that the version with pointers requires doing address arithmetic in all threads, so it is worse than my oversimplified example suggests.