I realize this is a simple question, but I didn’t think it mattered and the LLMs told me to move some computation I had in the model to the loss fn and I can’t seem to determine if it hallucinated that answer and am hoping for some clarification.
For concreteness, I was doing ordinal threshold prediction with 6 classes. My last layer of my model was a linear layer, followed by softplus, followed by torch.cumsum and divided by the sum of the rows to get values to between 0 and 1. Essentially I was trying to output 5 threshold probabilities that were monotonically increasing. The LLMs suggested that I was constraining the optimization space and suggested moving the softplus and cumsum and division to the loss function.
From my understanding this shouldn’t matter because theoretically it’s differentiable regardless of if it’s in the models last layer or if it’s in a loss function. Is this accurate?
Is there any reason to what the LLM is saying? Maybe it matters from a weight initialization pov?