Is it good/neutral/bad practice to specify loss and optimizer inside the model class?

I don’t know if it is even worth to ask this question in this forum, but I was wondering if it is a good, neutral or bad practice to specify loss as well as optimizer inside the model class, for example:

class Net(nn.Module):
    def __init__(self, in_features, n_classes, lr):
        self.fc_1 = nn.Linear(in_features, 256)
        self.fc_2 = nn.Linear(256, n_classes)

        self.criterion = nn.CrossEntropyLoss()
        self.optimizer = nn.Adam(self.parameters(), lr=0.01)

    def forward(self, x)
        return self.fc_2(F.relu(self.fc_1(x)))

As of right now I’ve seen only one tutorial that uses this kind of approach. At first, for me, it looked kinda sketchy, but now I think that this is way more clean than creating criterion and optimizer outside of class.
But I have a question, does this way have any implications or unnecessary behavior which I don’t understand or that I haven’t encountered that could harm my model’s performance or something like that?

Hi Arturas!

Including your loss and/or optimizer in your model class won’t break
anything or cause inefficiency. So from a mechanical or technical
perspective, it’s okay.

However, I would say that doing so is bad practice from a stylistic
or design perspective.

The main idea is “separation of concerns.” To my mind the purpose
of a model is to take an input and produce a prediction. At inference
time (in contrast to training), you won’t be using a loss function nor
optimizer, so they don’t really belong in the model class. In a similar
vein, when performing evaluation, you might choose to use a different
“loss” function (perhaps a non-differentiable accuracy function) than
the one you use for training. Or you might use more than one accuracy

It is also true that an important purpose of a model is to be able to be
trained – that’s why models have Parameters that have grads and
layers that have backward() functions. It seems to me to be a sound
stylistic / design choice to have these be part of the internals of the
model class.

But the details of how to train the model (e.g., what optimization
algorithm) can be separated from the infrastructure that makes a
model able to be trained (e.g., tensors that have a grad property),
and the guideline of “separation of concerns” suggests that when
things can be conceptually and straightforwardly separated, they
should be separated and placed in separate classes.

So “clean design” (whatever that means) suggests that the loss and
optimizer aren’t “naturally” part of the model, so they shouldn’t be
part of the model class.


K. Frank

1 Like

This is where I had doubts, because I thought that they are not part of the model, but at the same time, the example I gave made my feel that this is OK to do it like that. But now that I’ve heard another person’s opinion, I have to agree with you, that yes, loss and optimizer are not part of the model, and they shouldn’t be specified inside the model.
But still, it would be interesting to see what more people have to say.

I agree with @KFrank also also think that the model, criterion, and optimizer are separate units, which can be independently used and changed.
E.g. reusing your “model” to finetune it in another scenario would also recreate your optimizer, which would not be needed. Restoring a checkpoint also would need some additional check to see where the optimizer is located and used. In case you are using utilities, such as mixed-precision training via torch.cuda.amp, you would also need to manipulate the forward method of your model to add e.g. the GradScaler, and would need to add the “amp logic” to the model itself, which I would also like to avoid.

1 Like