How to inference asynchronous

By default cuda kernels are run asynchronously (you need to call torch.cuda.synchronize()) to block until all launched kernels are done.