Model's parameters update during DDP training

Scott_Hoang · June 23, 2020, 7:13pm

I’m using DDP to train Neural Architecture Search networks which contained a controller and a model network. During training, my controller predictss a model’s architecture that maximize reward. the call looks like this.

# both model and controller are torch.nn.DistributedDataParallel
arch = controller.forward(conditions)
model.module.set_arch(arch) # modified model internal architecture.
output = model.forward(input)...

However, in DDP docs I noticed the following:

… warning::
You should never try to change your model’s parameters after wrapping
up your model with DistributedDataParallel. In other words, when
wrapping up your model with DistributedDataParallel, the constructor of
DistributedDataParallel will register the additional gradient
reduction functions on all the parameters of the model itself at the
time of construction. If you change the model’s parameters after
the DistributedDataParallel construction, this is not supported and
unexpected behaviors can happen, since some parameters’ gradient
reduction functions might not get called.

So I’m just wondering what is the correct way to do this? or if NAS is not suitable with DDP.

mrshenli · June 23, 2020, 8:29pm

model.module.set_arch(arch) # modified model internal architecture.

By doing the above, are you removing parameters from the model or adding new parameters into the model? If yes, then it won’t work with DDP, as DDP creates communication buckets at construction time using the parameters returned by model.parameters() field. Hence, if the model.parameters() returns a different set of parameters, DDP won’t adapt to it.
- To make it work, you can create a new DDP instance using the modified model whenever the model gets updated. But all DDP processes need to do the same at the same time using the same model.
If it just changes the value of those parameters, it should be fine.

Scott_Hoang · June 23, 2020, 9:14pm

can you clarify the different between modifying and replacing?

def __init__(self):
    self._arch =  torch.variable(<shape> , required_grad=True)
def set_arch(self, arch):
    self._arch = arch # is this modifying or replacing?

mrshenli · June 23, 2020, 10:06pm

I believe this is replacing. You can use self._arch.copy_(arch) to override the value. See the code below.

import torch

x = torch.zeros(2, 2)
y = torch.ones(2, 2)
print("x storage: ", x.data_ptr())
print("y storage: ", y.data_ptr())
x = y
print("x storage: ", x.data_ptr())
z = torch.zeros(2, 2) + 2
print("z storage: ", z.data_ptr())
x.copy_(z)
print("x storage: ", x.data_ptr()) 
print(x)

outputs are:

x storage:  94191491020800                                                                                                                                           y storage:  94191523992320                                                                                                                                           
x storage:  94191523992320                                                                                                                                        
z storage:  94191523994816                                                                                                                                     
x storage:  94191523992320                                                                                                                                    
tensor([[2., 2.],                                                                                                                                               
        [2., 2.]])

Scott_Hoang · June 23, 2020, 10:10pm

This might be it. if DDP wrapper kept a ptr to my arch settings, then it will not see the new value since it with a different pointer.
So does that mean that DDP.module params is a stale copy of our model??

mrshenli · June 23, 2020, 10:18pm

So does that mean that DDP.module params is a stale copy of our model??

I believe so. As DDP remembers the variables at construction time:

github.com

pytorch/pytorch/blob/09285070a70d146b158db1e1e44b2c031a5c70b0/torch/csrc/distributed/c10d/reducer.cpp#L32


      
          }
          
          } // namespace
          
          Reducer::Reducer(
              std::vector<std::vector<torch::autograd::Variable>> replicas,
              std::vector<std::vector<size_t>> bucket_indices,
              std::shared_ptr<c10d::ProcessGroup> process_group,
              std::vector<std::vector<bool>> expect_sparse_gradients,
              int64_t bucket_bytes_cap)
              : replicas_(std::move(replicas)),
                process_group_(std::move(process_group)),
                expect_sparse_gradients_(std::move(expect_sparse_gradients)),
                expect_autograd_hooks_(false),
                require_finalize_(false),
                next_bucket_(0),
                has_marked_unused_parameters_(false),
                local_used_maps_reduced_(false),
                backward_stats_base_(0),
                has_rebuilt_bucket_(false),
                bucket_bytes_cap_(bucket_bytes_cap) {

And there might be more than that. DDP might not be able to read that value at all. Because DDP registers a backward hook on each parameter, and relying on that hook to notify DDP when and what to read. Those hooks are installed at DDP construction time as well. If you create a new variable and assign it to self._arch, that hook might be lost.

cc @albanD is the above statement on variable hook correct?

albanD · June 23, 2020, 11:16pm

Hi,

Yes I think this note explicitly warns you against doing this. You should not change the Parameters.

As a side note, you should never call the forward() of your module directly but call module(input).

Scott_Hoang · June 23, 2020, 11:38pm

what if i modified my forward function such that

> forward(input, arch):
>      self._arch = arch

will this works?
Also does DDP keeps DDP.module value up-to-date?

mrshenli · June 24, 2020, 2:19pm

IIUC, that will still remove DDP autograd hooks on self._arch.

Question, do you need the backward pass to compute the gradients for self._arch? If not, you can explicitly setting self._arch.requires_grad = False before passing the model to DDP ctor to tell DDP to ignore self._arch. Then, the above assignment would work.

Scott_Hoang · June 24, 2020, 3:47pm

Thank you. My model is now performing as expected