Moving data from GPU to CPU takes unreasonably long

x is a large matrix of size with 1 million rows and 1000 columns.

Doing math operation is very fast as expected. But retrieve the result, which is a Tensor of size 1 takes more than 10 sec. What is happening here.

Okay, it seems to be related to this post (Time for moving data to GPU varies a lot)

Basically, the in the first case, the calls are all async, so it return immediately without actually finish the work.