Why the forward results of all parallel computations are aggretated on GPU1

I read a artical which said:

"The Picture explain the action of DataParallel:

In the forth step of forward, all parallel computations are aggretated on GPU1 which is a good thing for many classification problems."

I don’t understand why design so. Why not computate loss and backward on each GPU. Then final gradients are aggretated on GPU1 at last to update parameters. And I also don’t understand why it’s a good thing for classification?

can anyone answer my question? Thanks a lot!