DtoH transfer of a partial tensor or untyped_storage to tensor without memory allocation

aag · February 19, 2025, 11:25am

Hi there!

I would like to convert a cupy function to torch: the purpose is to allocate pinned memory once, fill it with data (random size lower than allocated) and transfer it to the GPU.

i tried with untyped_storage but I failed creating a tensor pointing to the data_ptr of a pinned tensor and view is not possible because of shape incompatibility.
constraints:
no more memory allocations than what is done with cupy
no use of deprecated functions

here is the cupy code

max_nbytes = ...
h_pinned_mem = cp.cuda.alloc_pinned_memory(max_nbytes) # memory allocation 1 at the initialization (cpu)

# loop, threaded etc, whatever...
variable_buffer_nbytes = ... # from 1 to max_nbytes
...
random_buffer = np.random.bytes(variable_buffer_nbytes) # memory allocation 2 (cpu)
h_pinned_array: np.ndarray = np.frombuffer(h_pinned_mem, dtype=np.uint8, count=variable_buffer_nbytes)
np.copyto(h_pinned_array, random_buffer)

d_cp_tensor = cp.empty((variable_buffer_nbytes,), cp.uint8) # memory allocation 3 (gpu), in my sw it's done at the initialization and never allocated after that
d_cp_tensor.set(h_pinned_array)
d_torch_tensor: torch.Tensor = from_dlpack(d_cp_tensor.toDlpack())

and here is some code in torch

h_pinned_tensor = torch.empty(
    max_nbytes, dtype=torch.uint8, device='cpu', requires_grad=False, pin_memory=True
)
# fill  h_pinned_tensor with random_buffer, size=variable_buffer_nbytes

untyped_storage: UntypedStorage = bytes_tensor.untyped_storage()
partial_untyped_storage = untyped_storage[0:variable_buffer_nbytes]

# Create a tensor pointing to partial_untyped_storage
# complete this
h_partial_pinned_tensor = .... # <- must not allocate memory, just point to untyped_storage and size must be variable_buffer_nbytes
#

d_torch_tensor: torch.Tensor = h_partial_pinned_tensor.to("cuda")

=> Is it possible or should I have to stick with CuPy?

ps: It’s not needed to ask Claude, chatgpt, they’re wrong.
i don’t need any advice about memory overflow check, cuda streams, cuda synchronisation etc.