Time for moving data to GPU varies a lot

I am trying to move a big tensor from CPU to GPU by the method .cuda()

But I found the time varies a lot. For example, the first time for moving data is about a few ms but the second moving may take a few seconds. When I use pdb to debug the code, the time of TENSOR.cuda() operation is fast if I pause a while before executing the operation. If I do not pause, the operation will take longer time.

When I use ctrl + c to stop the program, the code is on
return new_type(self.size()).copy_(self,async)

So does anyone know what caused this?

I am not pretty sure if I am right or not.

the pytorch seems not running in real python code. The python code is more like a instrument. Before we call or print the result explictly, the result is not really in memory. So every time I call print or CUDA it will take a while to fectch the data.

I havent tested this out in pytorch itself, but generally speaking any time you ask the gpu to do something, it’s fairly asynchronous. So, you send a request, in the mail as it were, and at some point in the future, when the gpu feels like it, has finished its breakfast, reading the morning paper etc, you get the results back.

Now, ok, it doesnt really read the paper etc, but it does take a while to do stuff. If you send more stuff whilst the first stuff hasnt finished, obviously the second set of stuff will be delayed for a while.

copying data to the gpu counts as ‘doing stuff’. it may look instantaneous, because the instruction returns immediatley, but it’ll take a while.

Hunt for an instruction with a name like ‘sync’, or similar, and have a play with that. in fact there’s an example I made earlier here:

2 Likes

Thanks hughpekins. I like your explanation. It is intuitive and straight forward.

I reckon the Network forward propagation seems asychronized too. Do you have any idea about that? Or are all pytorch codes executed just “return immediately” instead of “return after real executing” ?

correct. pretty much anything going to gpu, forward prop, or whatever, is in general asynchronous, unless something forces otherwise. Things that cause sync poitns:

  • calling ‘sync’ :slight_smile:
  • memory allocation
  • displaying values on the host
  • anything that causes values to be copied to the host

Thanks for reply:grinning:

I think I have known why the .cuda() has various executing time. A “traffic compact” happens to the GPU:joy:

1 Like