Libtorch memory options for tensors - pinned memory, zero copy memory

SM19 · July 25, 2022, 8:32am

Hello Pytorch Team,

I have an application running using Libtorch + TorchTensorrt.
For this, I create input by first allocating a tensor of shape BCHW on GPU and then writing values into the pixels. However, I see this is super slow - slower than Python when copying data from numpy array.

So, I was wondering if using a pinned memory/ zero copy memory would help???

In any case, I would like to know how can I create tensors with pinned and zero-copy memory models?
Also, if I pre-allocate such an array on GPU using normal CUDA APIs, is it possible to then later copy it to libtorch tensor?

Please help me understand this scenario.

Summary - What is the fastest way to create/ copy data to GPU in Libtorch?

Apart from this - a remark: The documentation about Torch-Tensorrt seems to be heavily outdated.

This is on the main documentation page and it would probably not work as the input should be of IValue type - this can be found in other pages of documentation.

Please update the information if this is indeed the case.

Best Regards
Sambit

ptrblck · July 26, 2022, 12:42am

You should be able to specify it in the options you would pass the the tensor initialization:

TensorOptions options = TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory);

Yes, from_blob should work but you would have to make sure the actual memory doesn’t go out of scope or is released, as PyTorch will not trigger a copy (you can of course manually trigger a copy if needed).

SM19 · July 27, 2022, 7:37am

Two small follow up questions:

Does creating tensor on GPU take significant time? Or is it only data transfer that takes time? I mean, if my function creates a static zero tensor on GPU, should I expect to see any improvement in speed, given the fact that I will still create input data in a normal array and copy it to the zero GPU tensor using from_blob, in each call to the function.
What is the C++ equivalent for - torch.cuda.empty_cache()