Libtorch spends a lot of time on data moving between cpu and gpu

libtorch spends a lot of time on data moving operation. Comparing to doing the predict, moving data spends 10-100x more time. That is very embarrassing. How can I get speed up?

are synchronization points added when counting data transfer time?