Combine Tensor Parallelism and Data Parallelism


I have a question that I am very much hoping is an easy answer, although I myself cannot find it documented anywhere to save my life. Thank you in advance for any help.

I need to combine that capabilities of tensor parallelism and data parallelism, in order to train a model with a very large layer in relatively little time. I would have a cluster of pytorch workers/nodes, each with multiple GPUs available, onto which i would distribute partitions of the training set, which would then be sharded along with the model parameters onto multiple GPUs within each worker before being fit.

I cannot find an example or even confirmation that this is a supported pattern. If anyone from the dev team is able to let me know if this is something i completely missed or is perhaps planned soon, I would very much appreciate it. Thank you.

You could check Megatron-LM which implements different parallelization strategies and is the base for a lot of other implementations from various teams.