How to share parameters between different GPUs and different models?

There are two different models in my program that share word embedding parameters and other parameters are independent. We want the word embedding parameters can be updated synchronously by both.
I have to use multi-GPUs for training because of the big training dataset.
I used multiprocessing and distributed method to utilize computation of GPU. Each model used two GPUs respectively.

However, I don’t know how to implement the share word embedding between both models that they distributed in different processes and different GPUs.
How to guarantee the word embedding parameters synchronous update ?

1 Like

I wrote a draft code below. It can be run successfully.

import torch

args.distributed_word_size=2  # each model distributed training on two GPUs
mp = torch.multiprocessing.get_context('spawn')
args.distributed_init_method_run_0 = 'tcp://localhost:00000'  # Local address for model-0
args.distributed_init_method_run_1 = 'tcp://localhost:11111'  # Local address for mode-1

procs = []
for i in range(args.distributed_word_size):    
    # assign two processes and GPUs for model-0
    args.distributed_rank = i
    args.device_id = i
    procs.append(mp.Process(target=run_0, args=(args, ), daemon=True))
    procs[i].start()

for i in range(args.distributed_word_size): 
    # assign two processes and GPUs for model-1
    args.distributed_rank = i
    args.device_id = i + args.distributed_word_size
    procs.append(mp.Process(target=run_1, args=(args, ), daemon=True))
    procs[i+args.distributed_word_size].start()

for p in procs:
    p.joint()

def run_0(args):   
    # TCP initialization for model-0
    torch.distributed.init_process_group(
            backend=args.distributed_backend,
            init_method=args.distributed_init_method_run_0,
            world_size=args.distributed_world_size,
            rank=args.distributed_rank)
    main_0(args)
    ....

def run_1(args):
    # TCP initialization for model-1
    torch.distributed.init_process_group(
            backend=args.distributed_backend,
            init_method=args.distributed_init_method_run_1,
            world_size=args.distributed_world_size,
            rank=args.distributed_rank)
    main_1(args)
    ....

What would be your approach to update the embedding layer synchronously using two models?
Would you reuse the same layer in both models, i.e. sum the gradients of both models for the embedding layer?

Right, I would like to reuse the same layer in both models.
However, the two models distributed in different processes and different GPUs.

What should I do?