I am trying to keep track of an average gradient over time. With a single GPU this is working perfect with a register_backward_hook to a function which references self variables. When trying to run this on multiple GPUs I am getting:
RuntimeError: binary_op(): expected both inputs to be on same device, but input a is on cuda:0 and input b is on cuda:1
I can see from: https://github.com/pytorch/pytorch/issues/8637 That is seems like using the self variables can be the problem. Is there a way to workaround this?