I am using Pytorch + tensorrt (using torch2trt) to run inference on the Nvidia Xavier AGX development kit.
The problem is, data loading from numpy array on CPU to torch cuda tensor using torch.from_numpy().float().to(“cuda:0”).unsqueeze_(0).permute(0, 3, 1, 2)
is very slow.
This copy operation takes a lot of time. (Nearly 70% of all time spent in one cycle of pre-process + inference + post-process).
I used python timeit module to measure time spent by each of pre-processing, inference and post-processing.
Surprisingly, when I do a cache clear using the jtop tool (jetson-stats tool), the copy time reduces drastically and my full cycle time now reduces as a result of that.
I am trying to understand:
Why is memory copy from numpy to cuda tensor so slow? I am copying a 1024x1024x5 input. It takes about 60ms.
Why cache clear boosts speed? (After cache clear I see < 20ms)
Are you seeing different times by properly synchronizing the code before starting and stopping the timers?
If so, could you check if some kind of cache is being used to offload the data?
Yes, and thus the timers are measuring the actual copy. If you are not synchronizing the code you might be measuring e.g. the kernel launches etc. which is wrong and wouldn’t mean that the actual copy was already done.