I also tried to copy training data to /dev/shm (reference) and run DDP with 8 GPUs, but nothing is different. The memory usage when running with 8 GPUs is the same as before, but I tested with a single process, loading the dataset may occupy about 1 GB of memory. Am I missing something here?
Thanks for posting the question @siahuat0727, did you try torch.multiprocessing.Queue to pass the tensor objects between the processes? you can take a look at the `torch.multiprocessing doc and see if this works for you.
Hi @wanchaol, thank you very much for your reply!
I think tensor.shared_momory should work well in pure multiprocess program.
I will look at how to pass the reference in pytorch-lightning DDP mode.