It seems ax was never used in this and all previous implementations (it was returned at one points, but I don’t know if and how it was further used).
CC @vincentqb do you know, if and where this state is used?
From the ASGD implementation, it has the same effect as SGD . And the state[“ax”] is not used for updating parameters. Suggesting a modification in the next version.
I don’t quite understand the explanation. Where should ax be used then? Do you mean it doesn’t need to be calculated and is just there for “debugging” purposes?
I am not sure enough. I cannot download the corresponding paper and have no idea about how ASGD updates the parameters. Generally, the state[‘ax’] should be used for modifying the p.grad (gradient) as Adagrad or Adam does.