Can PyTorch use stateful optimizers for a dynamic amount of weights?


I like to implement the following project in PyTorch:

I already started to implement some simple networks (which is quite elegant in PyTorch).

The linked project uses a dynamic amount of layers and weights. If required, the model adds new layers. Is it possible to optimize such a newtork with a dynamic amount of layers / weights with stateful optimizers (e.g. Adam, Adagrad, etc.) with PyTorch without loosing the advantages of the internal state?

Thank you very much:slight_smile:

Looking at the source for Adam, it loops over all parameter sets in all parameter groups, and does its calculations separately for each parameter set.

    for group in self.param_groups:
        for p in group['params']:
            state = self.state[p]

            # State initialization
            if len(state) == 0:

            # do calculations

So when you add a new parameter group to the optimiser using optimizer.add_param_group(({'name':optional, 'params':new_module.parameters() })) the old parameters will keep their existing optimiser state, and the new parameters optimiser state will be initialised correctly.

Other optimisers should work similarly.

This paper reminds me of a couple of others.

Forward Thinking: Building and Training Neural Networks One Layer at a Time
Learning Infinite Layer Networks Without the Kernel Trick

Thanks for the great answer:)!

1 Like