Q1:
Let’s say D
is a matrix with k very long columns and I want to compute D + d.ger(x)
. Also, I don’t need to backpropagate through it. I think the best approach is a simple for loop:
for i in range(k):
D[:,i] = x[i]*d
Does x[i]*d
create a temporary tensor?
Maybe I should use
for i in range(k):
D[:,i].add_(d, alpha=x[i])
Is this the most memory-efficient version?
Q2:
What happens when I use python scalars as in some_tensor * 2.4
? Is 2.4 moved from the cpu to the gpu as usual? Should I pre-initialize known constants (T.scalar_tensor(2.4)
) as much as possible to avoid slowdowns? Even small integers such as 2 and 3? Or maybe the transfer is asynchronous and thus irrelevant as long as the GPU has still some work to complete?
Q3:
Is there a way to make absolutely sure a piece of code is not creating temporary buffers or doing cpu->gpu transfers? Maybe some context managers which cause the code to throw?