I read that I should be initializing the weights instead of letting torch do it, and I want to initialize linear layers with Kaiming. I have a few questions about specifying its nonlinearity.
- There’s no nonlinearity option for softmax, but since it ranges from 0 to 1, I assume it’s fine to use sigmoid nonlinearity?
- If I’m using low rank linear layer, I’ll have two linear layers before activation. Should both have the nonlinearity of the activation function or just the second (last) one?
- Should I still be using the nonlinearity of the activation function if I will have layer norm just right after the activation? If not, what should I use instead? Maybe tanh? I mean, normalization alters the range, but it’s not a nonlinear operation, or is it?