I've really been enjoying going through the recently released kafnet paper and eventually would like to take a stab at implementing it. In the paper, the authors have a section where they go over some other activations, one of which is PReLU. I've had great success with PReLU in some of my networks, literally doubling performance.
In the PyTorch documentation, there is a note regarding the use of PReLU:
Note: weight decay should not be used when learning "α" for good performance.
Where alpha is the learned 'leaky' value negative activations are scaled by in order to avoid dying ReLU's and to hopefully renormalize the activation towards zero. In the Kafnet paper, they further explain:
Importantly, in the case of l_p regularization (p=1, 2), the user has to be careful not to regularize the "α" parameters, which would bias the optimization process towards classical ReLU / leaky ReLU activation functions.
In PyTorch, l1/l2 weight decay is defined on the optimizer directly. If I wanted to create a new activation layer where my parameters were learned but were not regularized, how can I go about doing that considering nn.Parameter only exposes
In fact, forget about creating a new activation; how should one even go about properly using weight decay with the existing PReLU activation? I'd like to regularize my network--I just want to exclude my activation parameters from that.