Specifying Kaiming initialization nonlinearity in case of softmax, low rank or a stack of linear layers, and layer norm

I read that I should be initializing the weights instead of letting torch do it, and I want to initialize linear layers with Kaiming. I have a few questions about specifying its nonlinearity.

  1. There’s no nonlinearity option for softmax, but since it ranges from 0 to 1, I assume it’s fine to use sigmoid nonlinearity?
  2. If I’m using low rank linear layer, I’ll have two linear layers before activation. Should both have the nonlinearity of the activation function or just the second (last) one?
  3. Should I still be using the nonlinearity of the activation function if I will have layer norm just right after the activation? If not, what should I use instead? Maybe tanh? I mean, normalization alters the range, but it’s not a nonlinear operation, or is it?