Dp or ddp only for parallel computing

I would like to have different models, but I would also like them to share the data input.
Also, I want to train those models in parallel (maybe 1 model per 1 GPU).

so I assume dp or ddp may work, but they all synchronize the weight, and for ddp in particular, it splits the dataset for each GPU through the sampler, and sync the gradient from each model.

How can I do this?
Thank you ahead

In that case, you can just kick off different jobs on different GPUs. They don’t have to be executed absolutely together in a synchronized way.

thank you for the reply. yes right, but i would like them to share the data pipeline due to the large data memory overhead.
Also those models are from the same graph but with different initial weight. Thank you.

to share data loading, you can do

import torch
import torch.multiprocessing as mp

ITERS = 10
def worker(qs, rank):
    for it in range(ITERS):
        print(f"rank {rank}, it {it}: {qs[rank].get()}")

def main():
    num_workers = 4
    qs = [mp.JoinableQueue() for _ in range(num_workers)]
    processes = []
    
    for rank in range(num_workers):
        p = mp.Process(target=worker, args=(qs, rank))
        p.start()
        processes.append(p)

    for it in range(ITERS):
        inp = torch.full((1,), it).share_memory_()
        for rank in range(num_workers):
            qs[rank].put(inp, block=False)
    
    for p in processes:
        p.join()

if __name__ == "__main__":
    mp.set_start_method('spawn')
    main()

NOTE: It depends on what kind of overhead you want to avoid - if the overhead is in producing and storing the input data in host memory, this approach would help, because the data is produced once and shared among all processes via shared memory. However, if the overhead lies in Host-to-Device copy and/or input data GPU memory consumption, this would not help, because you can’t really physically share memory across different devices.