Sync Gradient for two tiny MLP model training on the same GPU

How to effectively sync the gradients/parameters between two tiny MLP models (3-layer each, with 256 hidden dimensions), Note that these two MLP models are trained separately by two parallel processes.
I currently come up with two potential solutions.

  1. using the torch.distributed library with gloo` backend for CPU-based parameter sync. it seems to be very slow due to copying tensor back-and-forth between CPU and GPU.
  2. using shared global memory for GPU-based parameter sync. However, it seems hard to achieve for the current Pytorch version.

@DanielWang Thanks for posting the question. Could you elaborate more on what’s your use case? from the description I see you have two MLP models with exactly the same architecture, and trained separately by two parallel process, it looks to me you can just use DDP with the MLP model, and it will periodically (full_sync or async) sync the model parameters gradients by DDP under the hood? Getting Started with Distributed Data Parallel — PyTorch Tutorials 1.9.1+cu102 documentation

DDP is usually for model running on two GPUs, while for models share the same GPU it probably not work

I see, so you are using one GPU but run two training in parallel. We couldn’t run all_reduce collective manually since there’s only one GPU. If we are on CPU we can leverage tensor.share_memory_(), but since we are on GPU, this is not an option. Yeah so the only option I could think of is the first option you mentioned, you can periodically move the two model back to CPU, sync params/grads, then move them back and resume training.