D is a matrix with k very long columns and I want to compute
D + d.ger(x). Also, I don’t need to backpropagate through it. I think the best approach is a simple for loop:
for i in range(k): D[:,i] = x[i]*d
x[i]*d create a temporary tensor?
Maybe I should use
for i in range(k): D[:,i].add_(d, alpha=x[i])
Is this the most memory-efficient version?
What happens when I use python scalars as in
some_tensor * 2.4? Is 2.4 moved from the cpu to the gpu as usual? Should I pre-initialize known constants (
T.scalar_tensor(2.4)) as much as possible to avoid slowdowns? Even small integers such as 2 and 3? Or maybe the transfer is asynchronous and thus irrelevant as long as the GPU has still some work to complete?
Is there a way to make absolutely sure a piece of code is not creating temporary buffers or doing cpu->gpu transfers? Maybe some context managers which cause the code to throw?