Various addition to torch.distributed

kose-y · January 10, 2018, 9:43am

Hi,

First of all, thanks for the great effort in torch.distributed. I found it very useful for my project.

Is there any plan to support gatherv, scatterv, igather, all_gatherv, etc.?

smth · January 10, 2018, 12:25pm

we plan to add collectives as they are needed in many projects. what is your need for the gatherv and scatterv routines?

kose-y · January 11, 2018, 5:19am

I am exploring if PyTorch can be used as a quick way to write portable codes involving distributed matrix multiplications.

For example, consider the case where “world size” is 4 (torch.distributed.get_world_size() returns 4). a 1000 x 10 matrix can be represented by a world with 250x10 matrix on each rank. I needed reduce, all_gather, all_reduce for various types of matrix multiplication involving such “thin and tall” matrices. However, if the number of rows is not divisible by world size, e.g. when the size of data matrix is 1001 x 10, I need all_gatherv in place of all_gather.