Clearing cache in Nvidia xavier agx improves speed of copy from cpu to gpu for pytorch

SM19 · April 11, 2022, 10:19am

I am using Pytorch + tensorrt (using torch2trt) to run inference on the Nvidia Xavier AGX development kit.

The problem is, data loading from numpy array on CPU to torch cuda tensor using torch.from_numpy().float().to(“cuda:0”).unsqueeze_(0).permute(0, 3, 1, 2)
is very slow.

This copy operation takes a lot of time. (Nearly 70% of all time spent in one cycle of pre-process + inference + post-process).

I used python timeit module to measure time spent by each of pre-processing, inference and post-processing.

Surprisingly, when I do a cache clear using the jtop tool (jetson-stats tool), the copy time reduces drastically and my full cycle time now reduces as a result of that.

I am trying to understand:

Why is memory copy from numpy to cuda tensor so slow? I am copying a 1024x1024x5 input. It takes about 60ms.
Why cache clear boosts speed? (After cache clear I see < 20ms)

The entire system runs as a ROS node.

Please help me understand.

Best Regards
Sambit

ptrblck · April 12, 2022, 4:44am

Are you seeing different times by properly synchronizing the code before starting and stopping the timers?
If so, could you check if some kind of cache is being used to offload the data?

SM19 · April 12, 2022, 8:33am

I know that I need to call torch.cuda.synchronize(), but this only ensures that the copy is finished before moving on, right?

My problem is that the copy operation seems to take unnaturally long on the Xavier AGX devkit.

And then even more strangely, we I do torch.cuda.empty_cache(), things speed up a lot.
So why is it like this?

Also,
test_x = torch.from_numpy(test_x_npy).float().unsqueeze_(0).permute(0, 3, 1, 2).pin_memory()

This seems to make no difference in speed.

Please note that the function runs in a loop inside a ROS subscriber node.
So test_x get’s allocated in each iteration.

Sumary: How to make numpy array to cuda tensor copy fast?

ptrblck · April 12, 2022, 6:38pm

Yes, and thus the timers are measuring the actual copy. If you are not synchronizing the code you might be measuring e.g. the kernel launches etc. which is wrong and wouldn’t mean that the actual copy was already done.