How `nn.Embedding` works with DistributedDataParallel?

jh_shim · July 4, 2022, 7:33am

As I know, DDP (DistributedDataParallel) works only when all parameters of the given module participate to calculate loss.
In case of nn.Embedding, I think some parameters of the module can be not used in forward pass.
However, Transformers such as BERT works well with DDP.
I don’t understand how nn.Embedding can work with DDP.

ptrblck · July 4, 2022, 8:11pm

Even if not all indices of the embedding weight matrix are used, the parameter itself would still be used and would thus get a valid gradient (zeros for the unused indices), so DDP shouldn’t complain about it.