For DataParallel model, during loss.backward(), where is the reduce done?

As documented in DataParallel(torch/nn/parallel/data_parallel.py), during the backwards pass, gradients from each replica are summed into the original module. So ASAIK, this is a reduce op, right?
Now I’m trying to do some debug into the C++ source code for this loss.backward(), but can’t find the part related to this “reduce” operation, could you please tell me where is this part done?

DataParallel works via scatter/gather, so the gathering of scattered gradients is implemented in the backwards function of these operators here: pytorch/_functions.py at master · pytorch/pytorch · GitHub