I am Broadcasting the Variables to different GPUs using the function https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/_functions.py#L6. Will the gradient automatically aggregated during the backward? I am getting errors when doing backward, not sure if it is because of broadcasting incorrectly handled. Thanks in advance!
It works
I make sure all inputs into forward function are cuda type. However, it still has this error …