Register_backward_hook with two GPUs to save

I am trying to keep track of an average gradient over time. With a single GPU this is working perfect with a register_backward_hook to a function which references self variables. When trying to run this on multiple GPUs I am getting:

RuntimeError: binary_op(): expected both inputs to be on same device, but input a is on cuda:0 and input b is on cuda:1

I can see from: https://github.com/pytorch/pytorch/issues/8637 That is seems like using the self variables can be the problem. Is there a way to workaround this?