How to effectively sync the gradients/parameters between two tiny MLP models (3-layer each, with 256 hidden dimensions), Note that these two MLP models are trained separately by two parallel processes.
I currently come up with two potential solutions.
- using the
torch.distributed library withgloo` backend for CPU-based parameter sync. it seems to be very slow due to copying tensor back-and-forth between CPU and GPU.
- using shared global memory for GPU-based parameter sync. However, it seems hard to achieve for the current Pytorch version.