Adding Distributed Model Parallelism to PyTorch

Thank you, Dylan for the response. Yes, I’ve see the two questions.

I’ve been quiet on this thread mainly because I’ve been trying to dig into PyTorch code. If I understood correctly, what you have mentioned in the beginning is about splitting up layers into multiple machines. What I was thinking was to split each layer into multiple machines. The reason being that if you split across layers then the machine handling layer i would anyway be idle until layer i-1 is completed, so there’s not much gain from parallelizing the model. On the other hand, if we split a layer into multiple machines, then we can utilize parallelism better. Then once all the computations of a layer is done we can sync (allgather) the outputs before starting the next layer.

I am new to PyTorch internals, so would appreciate any help on figuring out the code. I was looking at this post (The pytorch blog "A Tour of PyTorch Internals" is out-of-date. How to know more about the pytorch internal). Is there anything else you can recommend?