How could I share dataset among processes?

coincheung · May 9, 2022, 1:51pm

Hi,

I use distributed training mode to train my model.

My dataset is not large so I can load it to memory such that I do not need to read them from disk and decode them.

I found this example: pytorch_misc/shared_array.py at master · ptrblck/pytorch_misc · GitHub

It is inspiring but it only supports single gpu training rather than ddp. Besides, this only supports the senario where all input images have identical sizes. What if my dataset contains images of various sizes?

ptrblck · May 17, 2022, 6:32pm

Maybe check the shared_dict example to add the tensors into a dict which would support various shapes.

coincheung · May 18, 2022, 12:00am

Hi,

Thanks for replying!!! Does this support sharing amoung different gpu? I mean I am not only using multi-worker for dataloader but also using distributed training mode. Can I simply keeping only one piece of copy in the memory for all processes from dataloader and gpus?

ptrblck · May 18, 2022, 12:16am

I don’t know as I haven’t tried this use case, but usually you would use a DistributedSampler in a DDP setting which would make sure that each process loads only the corresponding chunk of the data used on the corresponding device.