Libtorch memory options for tensors - pinned memory, zero copy memory

Hello Pytorch Team,

I have an application running using Libtorch + TorchTensorrt.
For this, I create input by first allocating a tensor of shape BCHW on GPU and then writing values into the pixels. However, I see this is super slow - slower than Python when copying data from numpy array.

So, I was wondering if using a pinned memory/ zero copy memory would help???

In any case, I would like to know how can I create tensors with pinned and zero-copy memory models?
Also, if I pre-allocate such an array on GPU using normal CUDA APIs, is it possible to then later copy it to libtorch tensor?

Please help me understand this scenario.

Summary - What is the fastest way to create/ copy data to GPU in Libtorch?

Apart from this - a remark: The documentation about Torch-Tensorrt seems to be heavily outdated.

This is on the main documentation page and it would probably not work as the input should be of IValue type - this can be found in other pages of documentation.

Please update the information if this is indeed the case.

Best Regards

You should be able to specify it in the options you would pass the the tensor initialization:

TensorOptions options = TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory);

Yes, from_blob should work but you would have to make sure the actual memory doesn’t go out of scope or is released, as PyTorch will not trigger a copy (you can of course manually trigger a copy if needed).

Two small follow up questions:

  1. Does creating tensor on GPU take significant time? Or is it only data transfer that takes time? I mean, if my function creates a static zero tensor on GPU, should I expect to see any improvement in speed, given the fact that I will still create input data in a normal array and copy it to the zero GPU tensor using from_blob, in each call to the function.

  2. What is the C++ equivalent for - torch.cuda.empty_cache()