Understanding asynchronous execution

It is said in https://pytorch.org/docs/master/notes/cuda.html that GPUs operations are asynchronous. Operations are enqueued and executed in parallel. But, there is also a caveat that this process is under the hood. Users are expected to see it as synchronous.

If I understand correctly, if the user demands the result of an operation, it can’t be waited any longer, it then must perform the operation stripping away a chance for better optimization.

What operations then that would force the execution of such operations? What are some guidelines to take most out of this asynchronous execution thing?

1 Like

Sorry, I’m a bit confused about what you’re asking. What do you mean by “user demands the result of an operation”?

For example “print(tensor)”, if the user side demands this it must block whatever expressions that come after this expression, must it not?

Are there some other expressions of that kind?

I think the question I have is rather by what is a tensor “represented” in the Python perspective? Like if I create one tensor, I just get a placeholder rather than a real array of values. And whatever I do to that placeholder is just that I get another placeholder. All the operations are scheduled and optimized under the hood. Only if I demand the result of it to be represented in non Pytorch way, it blocks until the placeholder is resolved.

1 Like

Operations that require a synchronize will block (see cudaStreamSynchronize and cudaDeviceSynchronize). In particular, device-to-host transfers require a synchronize, which is why print will block.

Tensors are backed by a python storage that holds a pointer to data that can be on GPU or CPU.

2 Likes

That makes sense. Does it mean that appending a Python list with a tensor could be done promptly without the need for synchronization?

Yeah, that should be correct. (Unless there’s some device-to-host transfer going on there, but I don’t think that’s the case.)

1 Like