Does computation matter whether it is in the loss function or in the model?

I realize this is a simple question, but I didn’t think it mattered and the LLMs told me to move some computation I had in the model to the loss fn and I can’t seem to determine if it hallucinated that answer and am hoping for some clarification.

For concreteness, I was doing ordinal threshold prediction with 6 classes. My last layer of my model was a linear layer, followed by softplus, followed by torch.cumsum and divided by the sum of the rows to get values to between 0 and 1. Essentially I was trying to output 5 threshold probabilities that were monotonically increasing. The LLMs suggested that I was constraining the optimization space and suggested moving the softplus and cumsum and division to the loss function.

From my understanding this shouldn’t matter because theoretically it’s differentiable regardless of if it’s in the models last layer or if it’s in a loss function. Is this accurate?

Is there any reason to what the LLM is saying? Maybe it matters from a weight initialization pov?

I’m not to familiar with ordinal threshold regression, but from what i understand the functions you apply at the end (softplus, cumsum and division) behave like a custom activation. Moving it to the loss function would, as far as i know not make a difference since the gradients would be unaffected. However for the sake of functionality i would recommend you keep the current setup since if you dont your model would have to be evaluated with an extra module just to adjust the output to the preferred range [0,1]. Also if you want to compile it to .pt or .pth for any other usecases its best to have the correct last layer activation in the model.
Hope this made sense

1 Like