How to improve the slow transfer rate of tensors from GPU to CPU

How to improve the slow transfer rate of tensors from GPU to CPU in libtorch?

How did you profile this operation and what is the time spent for which tensor shape?