Prevent copying when num_workers>1

G.M · February 22, 2020, 11:14am

Hi, I’m loading my data using a DataLoader with num_workers>1. As far as I know, Pytorch uses multiprocessing, which copies the data to different processes. Is there a way to prevent the tensors in my dataset from being copied? Would Tensor().share_memory_() work?
Thanks

ptrblck · February 23, 2020, 8:41am

I’ve written some dummy code some time ago using mp.Array here.
Your approach might work as well, but I haven’t tried it yet.

G.M · February 23, 2020, 9:16am

Thanks, I’ll try it.

jaks19 · January 18, 2021, 8:03pm

Hi @ptrblck thanks for the code sample.

I wanted to ask a related question.

My dataset object does not contain a straightforward tensor that will be read in chunks by dataloader workers.

In fact, I have a custom python object which performs the role of a search tree and everytime a worker triggers the __getitem__ method of the dataset, the search tree is used to get the appropriate data.

In general, how could I ensure that an arbitrary python object (neither pytorch tensor nor np array) is not copied when I used more than one worker in the dataloader? I have made sure I never write to the structure when loading data but I still get memory overhead due to copying.

Thanks for your help!

ptrblck · January 19, 2021, 12:35am

Since multiple workers are using multiple processes, you would have to use any approach so share the data between processes. The linked mp.Array object would be one way to use it.