Using pre allocated CUDA buffers when calling module forward

When the loaded PyTorch Module returns a “Tensor” it works great. Here is the code:

> // allocate your cuda buffer
> cudaMalloc(cuda_output_ptr,....)
> // make it a proper Torch Tensor
> at::Tensor predTensor = torch::from_blob((void*)cuda_output_ptr, c10::IntArrayRef(myTensorShape), tensor_options);                
> // call module foreward and use std::move() for the result variable
> std::move(predTensor) = mModule.forward(std::move(inputs)).toTensor();

But, that does not work for me when the Module outputs a List[Tensor] or TensorList
did anyone had success with that? Here is my code:

> c10::List<at::Tensor> predTensorList;
> // add cuda allocated output buffers...
> for (int i=0; i<xxx; ++i) {
>     at::Tensor predTensor = torch::from_blob((void*)output_ptr, 
>     c10::IntArrayRef(pytorchTensorShapeRefPtr), tensor_options);
>     predTensorList.push_back(std::move(predTensor));
> }
> std::move(predTensorList) = mModule.forward(std::move(inputs)).toTensorList();

The result is that the output cuda buffers stay empty or with their previous values.