I’ve really been enjoying going through the recently released kafnet paper and eventually would like to take a stab at implementing it. In the paper, the authors have a section where they go over some other activations, one of which is PReLU. I’ve had great success with PReLU in some of my networks, literally doubling performance.
In the PyTorch documentation, there is a note regarding the use of PReLU:
Note: weight decay should not be used when learning “α” for good performance.
Where alpha is the learned ‘leaky’ value negative activations are scaled by in order to avoid dying ReLU’s and to hopefully renormalize the activation towards zero. In the Kafnet paper, they further explain:
Importantly, in the case of l_p regularization (p=1, 2), the user has to be careful not to regularize the “α” parameters, which would bias the optimization process towards classical ReLU / leaky ReLU activation functions.
In PyTorch, l1/l2 weight decay is defined on the optimizer directly. If I wanted to create a new activation layer where my parameters were learned but were not regularized, how can I go about doing that considering nn.Parameter only exposes requires_grad?
In fact, forget about creating a new activation; how should one even go about properly using weight decay with the existing PReLU activation? I’d like to regularize my network–I just want to exclude my activation parameters from that.
As weight decay is defined on the param_group level, you could pass the αs in a separate param group and specify weight_decay=0 for them. (or use a separate optimizer).
Alternatively - and I don’t think the overhead of taking is that terrible, you could manually add the weight decay to your loss and not use the optimizer-provided version.
Thanks for the tips, guys. Your suggestions, and looking at the .step() method of an optimizer, provide some key insight. Here’s a compact example for reference purposes, in case anyone else has this need: