I would like to have different models, but I would also like them to share the data input.
Also, I want to train those models in parallel (maybe 1 model per 1 GPU).
so I assume dp or ddp may work, but they all synchronize the weight, and for ddp in particular, it splits the dataset for each GPU through the sampler, and sync the gradient from each model.
thank you for the reply. yes right, but i would like them to share the data pipeline due to the large data memory overhead.
Also those models are from the same graph but with different initial weight. Thank you.
import torch
import torch.multiprocessing as mp
ITERS = 10
def worker(qs, rank):
for it in range(ITERS):
print(f"rank {rank}, it {it}: {qs[rank].get()}")
def main():
num_workers = 4
qs = [mp.JoinableQueue() for _ in range(num_workers)]
processes = []
for rank in range(num_workers):
p = mp.Process(target=worker, args=(qs, rank))
p.start()
processes.append(p)
for it in range(ITERS):
inp = torch.full((1,), it).share_memory_()
for rank in range(num_workers):
qs[rank].put(inp, block=False)
for p in processes:
p.join()
if __name__ == "__main__":
mp.set_start_method('spawn')
main()
NOTE: It depends on what kind of overhead you want to avoid - if the overhead is in producing and storing the input data in host memory, this approach would help, because the data is produced once and shared among all processes via shared memory. However, if the overhead lies in Host-to-Device copy and/or input data GPU memory consumption, this would not help, because you can’t really physically share memory across different devices.