GPU Operations with constants

blackberry · April 22, 2021, 6:42pm

Hello…

Suppose I have a tensor t that resides on GPU and I would like to apply some math operations on it with constants:

t = torch.FloatTensor(...).cuda()
t = 0.1 * t + 0.2

I wonder how Torch/Python handles the constants 0.1 and 0.2. Are they transferred from CPU to GPU memory each time this operation is performed, or is the code optimized automatically and the constants are moved to GPU?
I am concerned about the efficiency.
Should I define constant CUDA tensors and use them instead (less readable code)?

Thanks.

googlebot · April 23, 2021, 6:06am

it comes down to two cuda kernels that, strongly simplified, look like:
mul(float* a, float* b)
vs
mul(float* a, float b)

second one is marginally faster, as one “load from global memory pointer” operation is avoided

but this is not related to CPU-GPU transfers, note that function arguments, launch configuration and a signal to execute cuda kernel itself require a data transfer anyway (I’d assume it is a single transfer that uses faster “constant memory”, but haven’t investigated into this).

takeaway is, it is better to NOT use tensors, when tensor-scalar operator versions exist (they usually do when you don’t need to wrap numbers in tensors). OTOH, the timing difference is negligible.

blackberry · April 23, 2021, 7:31am

Thanks for the nice explanation.

This is a very common use-case, and I almost always see constants being used as constants in the equations instead of constant tensors.
There is also the caching; not familiar with GPU caching, but assuming it is similar to CPU caching, such constants are probably cached if they are frequently accessed.

googlebot · April 23, 2021, 8:15am

It is not about caching, what I tried to explain is that to call a kernel you either transfer a pointer to tensor data (address of 8 bytes) or a scalar (1-8 bytes), it is the same overhead, probably unavoidable (unless you would compile specialized kernels with less function arguments).

blackberry · April 23, 2021, 6:44pm

I understand that, but the real cost comes from whether the value/pointer is accessed from cache or from GPU/CPU memory, the latter being much slower (especially from CPU memory).

googlebot · April 23, 2021, 7:31pm

In that sense, I think special memory area is used to pass arguments, reading from which is at least not slower than from cached global memory. And I haven’t mentioned that the version with pointers requires doing address arithmetic in all threads, so it is worse than my oversimplified example suggests.