Avoiding allocations and memory transfers

Let’s say D is a matrix with k very long columns and I want to compute D + d.ger(x). Also, I don’t need to backpropagate through it. I think the best approach is a simple for loop:

for i in range(k):
    D[:,i] = x[i]*d

Does x[i]*d create a temporary tensor?
Maybe I should use

for i in range(k):
    D[:,i].add_(d, alpha=x[i])

Is this the most memory-efficient version?

What happens when I use python scalars as in some_tensor * 2.4? Is 2.4 moved from the cpu to the gpu as usual? Should I pre-initialize known constants (T.scalar_tensor(2.4)) as much as possible to avoid slowdowns? Even small integers such as 2 and 3? Or maybe the transfer is asynchronous and thus irrelevant as long as the GPU has still some work to complete?

Is there a way to make absolutely sure a piece of code is not creating temporary buffers or doing cpu->gpu transfers? Maybe some context managers which cause the code to throw?

Q1: The function torch.addr_ is what I was looking for because it adds the outer product of two vectors completely in-place without allocating any extra memory.
As a side note, since the matrices are in row-major order in PyTorch, accessing the columns the way I do in my question above is slower than accessing the rows.