For DataParallel model, during loss.backward(), where is the reduce done?

BruceDai003 · July 12, 2021, 12:45pm

As documented in DataParallel(torch/nn/parallel/data_parallel.py), during the backwards pass, gradients from each replica are summed into the original module. So ASAIK, this is a reduce op, right?
Now I’m trying to do some debug into the C++ source code for this loss.backward(), but can’t find the part related to this “reduce” operation, could you please tell me where is this part done?

rvarm1 · July 12, 2021, 8:50pm

DataParallel works via scatter/gather, so the gathering of scattered gradients is implemented in the backwards function of these operators here: pytorch/_functions.py at master · pytorch/pytorch · GitHub