Exploit shared memory between CPU and GPU of Jetson devices

I have a Jetson Nano Orin which I’m running a model with TensorRT.
I’ve seen that the most time consuming operation is when I transfer a torch Tensor into the cuda device using the function to(“cuda:0”).

By profiling with torch.profiler, I’ve seen that the 97% of the overhead came from the copy of the tensor. However the Jetson Orin has shared memory between the CPU and the GPU, so technically I could avoid to copy the tensor because the GPU has the access to the same memory.

It is possible to transfer the tensor’s device without copying it? If yes, how?

1 Like