I want to save gradients of internal variables through register_hook() or retain_grad().
When I run model on single GPU, it works.
But when I run model on multiple GPUs through wrapping model into nn.DataParallel, I find that it doesn’t work.
Can anyone help me?
based on comments “In each forward, :attr:module
is replicated on each device, so any
updates to the running module in forward
will be lost. For example,
if :attr:module
has a counter attribute that is incremented in each
forward
, it will always stay at the initial value because the update
is done on the replicas which are destroyed after forward
. However,
:class:~torch.nn.DataParallel
guarantees that the replica on
device[0]
will have its parameters and buffers sharing storage with
the base parallelized :attr:module
. So in-place updates to the
parameters or buffers on device[0]
will be recorded.”
That means gradients of internal variables can not be updated in multiple GPUs, can only be updated in device[0]. if you want to sync buffers, you can try to use DistributedDataParallel package?