Prevent copying when num_workers>1

Hi, I’m loading my data using a DataLoader with num_workers>1. As far as I know, Pytorch uses multiprocessing, which copies the data to different processes. Is there a way to prevent the tensors in my dataset from being copied? Would Tensor().share_memory_() work?

I’ve written some dummy code some time ago using mp.Array here.
Your approach might work as well, but I haven’t tried it yet.

Thanks, I’ll try it.

Hi @ptrblck thanks for the code sample.

I wanted to ask a related question.

My dataset object does not contain a straightforward tensor that will be read in chunks by dataloader workers.

In fact, I have a custom python object which performs the role of a search tree and everytime a worker triggers the __getitem__ method of the dataset, the search tree is used to get the appropriate data.

In general, how could I ensure that an arbitrary python object (neither pytorch tensor nor np array) is not copied when I used more than one worker in the dataloader? I have made sure I never write to the structure when loading data but I still get memory overhead due to copying.

Thanks for your help!

Since multiple workers are using multiple processes, you would have to use any approach so share the data between processes. The linked mp.Array object would be one way to use it.