Changing the set of trainable params in the optimizer during training?


I have this idea in mind that I would like to try and see what happens when I randomly “dropout” the training of entire layer’s weights during different training steps. E.g. in epoch one train all layers but the first one, in epoch two maybe drop out layer three and five, again in some other epoch train all layers, and so on.

The idea is that it might on the one hand add a kind of regularization to the model, as well as save some computational effort, if implemented correctly, because the gradients of the corresponding parameters that are “turned off” in training during an epoch do not need to be computed.

I don’t know whether this is a common practice, and I am interested in hearing opinions on this.

What would be the typical way to achieve this?


Hello :wave:

I can understand this in two ways:

  1. You’re talking about the dropout technique, applyied to all weights of an entire layer
  2. You want that the weights of a layer do not contribute to the weights update.

The first option can be achieved by applying a dropout with a drop probability equals to 1.0, but it doesn’t make sense for me, because after zeroing the weights of an entire layer, all the next layers would be zeroed too, including the output.

The second option possibly can be achieved by freezing the layer weights dinamically, setting the requires_grad attribute of the parameters of a single layer to False, then reverting it back to True.

Best Regards,
Rafael Macedo.

Hello once more, Rafael, :slight_smile:

yes, I meant the second option, was not sure however how to realize it. I should have probably not mixed up my explanation with the terminology of dropout, as I realize now, it is a bit confusing.

So, I will simply randomly access layers like model.layer1.weight/bias.requires_grad=False, then train like this for one epoch, and after that set requires_grad=True, again. This should introduce some interesting form of regularization, no?

Best, JZ


Yeah, I think the way you explained is going to do what you want.

About introducing interesting form of regularization, I have absolutely no idea hahaha
Deep Learning is an experimental research area, hardly some one can predict how even a minor change in the model will affect it, and as I’ve never seen some one using this technique, I’ll not try to guess.

So go on, try it, and if possible, come back and let your conclusions registered here in the forum :upside_down_face:
See ya!

1 Like


just wanted to give some quick feedback on this, for people interested in this topic. I recently discovered that the method I proposed is close to a regularization method called Zoneout. Find the paper here:

Best, JZ