Hello,
I am pretty new to pytorch and I am trying to implement the trainable layer proposed in this paper : https://arxiv.org/pdf/1607.05666.pdf
Here is my code:

Q. the eps parameter is to avoid division by zero and I don’t need to use expand_as with it?

Q. The forward pass seems to be working, I was wondering if there are any obvious errors? Do I need to use register buffers for the parameters ?

Q. In the paper they say to insure parameter positivity, they do gradient updates on the log values of the parameters and then take exponentials. How can I go about doing this?

I have a couple more questions, but I will save them for now.

Yes, that usually is just the regularisation. I’d even leave it as a python float.

I think something is up with the indentation, but that is likely only the quoting, I have not checked in great detail.

You could use self.log_alpha, log_delta, log_r as the parameter (but ideally init to something close to in 0 instead of 1, too) and then do alpha = self.log_alpha.exp().expand_as(x).

yea, the indent for the forward function got messed up while I was pasting the code.

“You could use self.log_alpha, log_delta, log_r as the parameter (but ideally init to something close to in 0 instead of 1, too) and then do alpha = self.log_alpha.exp().expand_as(x).”

I am confused as to what you mean exactly. Lets say I init them properly. in the paper they initialize with a normal distribution with mean 1 and std 0.1.
Q. When exactly would I take the log?

I thought I could do something like - for a simple version of SGD, though it would be nice to use pytorchs optimizers

for p in pcen.parameters():
p_log = torch.log§
p_log.data.add_(-learning_rate, p.grad.data) #or p_log.grad.data?
# and then somehow copy the exponentiated log parameters back to p

Q. or is all this not necessary based on the approach you proposed?

apologies for being less clear.
I’d do something like the following (the probability of the log going wrong is not that large, given that the mean is 10 standard deviations from 0):

This way, the backprop will just compute correct adjustments to the log parameters. My understanding is that they did it similarly.

I also took the liberty to generate the dummy data in pytorch directly and to make it positive with exp_. The fractional powers don’t really mix well with negative numbers (that is why your code got NaNs) and we all prefer positive energy.

I hope you still remember this post (not to mention see this one )
I have been experimenting with this model, and so far it does ok, but still degrades on my baseline. I still have a few optimization tricks to try.

You said:
This way, the backprop will just compute correct adjustments to the log parameters. My understanding is that they did it similarly.

Does this mean that when I do my loss.backward() an inplace log will be taken for the associated parameters, before computing their gradients?
I am just trying to check any possible loose end, though since I do get sensible results, I do think its more of an optimization issue.