Thank you, Dylan for the response. Yes, I’ve see the two questions.
I’ve been quiet on this thread mainly because I’ve been trying to dig into PyTorch code. If I understood correctly, what you have mentioned in the beginning is about splitting up layers into multiple machines. What I was thinking was to split each layer into multiple machines. The reason being that if you split across layers then the machine handling layer i would anyway be idle until layer i-1 is completed, so there’s not much gain from parallelizing the model. On the other hand, if we split a layer into multiple machines, then we can utilize parallelism better. Then once all the computations of a layer is done we can sync (allgather) the outputs before starting the next layer.
I am new to PyTorch internals, so would appreciate any help on figuring out the code. I was looking at this post (The pytorch blog "A Tour of PyTorch Internals" is out-of-date. How to know more about the pytorch internal). Is there anything else you can recommend?