Low performance of transferring tensor to CUDA

The to() operations as well as e.g.copy_ accept the non_blocking argument and an example was posted in your other question.