Code and memory optimization

Hi all,

I have developed a tracking system which operates on video. I also have strong constraints about embeddability (my code needs to run on small cards/CPUs) and the tracker must be real time. Thus, I spend a lot of time to optimize my code, using torch functions and doing as many operations as possible on GPU.

But there is a point not clear to me: I am not sure when data is exchanged between CPU and GPU, and how memory is exactly managed. For instance:

a = torch.tensor([1, 2, 3]).float().cuda() # tensor a will be on GPU
b = a + 1 # tensor b will also be on GPU
c = len(a) # tensor c will still be on GPU

Few questions about these basic operations:

  • On line 1, does the .float() function make a copy of a in memory?
  • On line 2, does the + operation copy the data of a to the CPU, then compute the result, then send the data back to the GPU? Or is torch intelligent enough to make the addition directly on the GPU? Is it always better to use the .add() torch function?
  • Do functions such as len() make a copy of the tensor on the CPU before counting?

And finally, is there a difference between these two lines:

a = torch.tensor(1.).cuda().add_(1.)
a = torch.tensor(1.).cuda().add_(torch.tensor(1.).cuda())

Thank you in advance, any help will be very appreciated! :slight_smile:

c is not a Tensor here but a python number. And as such it will be on the CPU.

For you other questions:

  • .float() makes a full copy if the given Tensor is not of floating type, otherwise it returns the input as-is
  • All operations on GPU Tensors will happen on GPU. + or add() are the same thing so you can use either.
  • len() returns a python number. So the result will always be on CPU (as python number cannot be on GPU). And it returns the size of the first dimension, which doesn’t need access to the content of the Tensor, only its metadata, which are read on the CPU side.
  • The two lines are very similar. The first one is preferable as a single plain number can be sent as an argument to the GPU kernel which will be faster than creating a full Tensor containing a 1. and send that to the GPU.

Some other pointers for perf:

  • PyTorch will never copy the content of a Tensor between device for you. Only when you call .cuda() or .cpu(), etc where a copy will happen.
  • It is better to create the Tensor you want on the right device directly to avoid extra copies: torch.tensor([1, 2, 3]).float().cuda() should be torch.tensor([1, 2, 3], dtype=torch.float, device="cuda")

That is perfectly clear. Thank you very much for your quick answer!