Moving data from GPU to CPU takes unreasonably long

Okay, it seems to be related to this post (Time for moving data to GPU varies a lot)

Basically, the in the first case, the calls are all async, so it return immediately without actually finish the work.