It is said in https://pytorch.org/docs/master/notes/cuda.html that GPUs operations are asynchronous. Operations are enqueued and executed in parallel. But, there is also a caveat that this process is under the hood. Users are expected to see it as synchronous.
If I understand correctly, if the user demands the result of an operation, it can’t be waited any longer, it then must perform the operation stripping away a chance for better optimization.
What operations then that would force the execution of such operations? What are some guidelines to take most out of this asynchronous execution thing?