Fast write Pytorch CUDA tensors to memory mapped files on GPU

I saw that is possible to use CUDA to write to memory mapped files (reference https://stackoverflow.com/questions/29518875/cuda-zero-copy-memory-memory-mapped-file )

I am wonder if it is somehow possible in Pytorch to write a cuda mounted tensor directory to a mem mapped stored on GPU.

The purpose of this is to speed up writing tensors after each training step. Currently,

with torch.no_grad():
    numpyMemmap[arrayOfRandomIndexes] = u_embeddings.weight.data.detach().cpu().numpy()

takes 6 seconds. I think it’s because the numpy memory map is stored on CPU. I need something that would write in a fraction of a second since I will be storing the tensors after each training step, and there will be hundreds of thousands of training steps.

How are you profiling the code?
Make sure to synchronize the code before starting and stopping the timer using toch.cuda.synchronize().
Since CUDA operations are asynchronous, the cpu() call will create a synchronization point, so that you might be in fact profiling some other workload which is still executed on the GPU.

1 Like