I like to implement the following project in PyTorch:
I already started to implement some simple networks (which is quite elegant in PyTorch).
The linked project uses a dynamic amount of layers and weights. If required, the model adds new layers. Is it possible to optimize such a newtork with a dynamic amount of layers / weights with stateful optimizers (e.g. Adam, Adagrad, etc.) with PyTorch without loosing the advantages of the internal state?
Looking at the source for Adam, it loops over all parameter sets in all parameter groups, and does its calculations separately for each parameter set.
for group in self.param_groups:
for p in group['params']:
...
state = self.state[p]
# State initialization
if len(state) == 0:
...
# do calculations
...