Does the pytorch has a tool to convert data from GPU to CPU and from CPU to GPU automatically when the GPU memory is not enough?

clockzhong · August 18, 2023, 8:29am

Hi, all,

I’m facing a tedious problem when using pytorch tensor’s ops APIs, because I want to use GPU’s performance power to accelerate my data processing speed, but my GPU’s memory size is too small, so I need cut my operands tensors into smaller size and after getting the result, move it to the CPU main memory, and after getting all the small parts of the results, I combine them into one final results in the CPU’s main memory. It means I need manually do the operands tensors cutting work and combining work of the result, and moving those tensors between GPU memory and CPU memory, and for all different Operators in pytorch, I need manually do this work all again and again.

My question is: do we have any tool to do this data cutting and result combining work automatically currently? Then I could fully utilize my GPU’s calculating performance and my CPU memory’s size.

Thanks in advance!

Clock ZHONG

ptrblck · August 18, 2023, 2:32pm

For a potentially large performance penalty due to the additional data movement.

You could offload data to the CPU as described here or you could try to write a custom allocator with offloading capabilities using torch.cuda.CUDAPluggableAllocator.

clockzhong · August 20, 2023, 2:25am

@ptrblck Thanks for your reply, I mean I want to us unified or similar tensor operator APIs which could calculate tensors operations in GPU, but the input and result tensors are stored in CPU, then I could use them as what I could use in pure CPU or GPU device, I mean I want to API’s framework to do the data moving work between GPU and CPU for me. I didn’t mention the model training yet, just use the basic pytorch tensor’s APIs, currently I do these data moving work manually case by case.

Thanks again.

Clock

ptrblck · August 20, 2023, 5:19pm

Both suggestions should work for your use case, where the former (CPU offloading) is more explicit than the latter (custom allocator using unified memory).