Is it bad practice to couple model and optimizer?

Noam_Salomonski · February 16, 2019, 3:12pm

Hi

In my case, I have two separate networks, each has its own optimizer.
I noticed the optimizer is decoupled from the model.nn, and was wondering why that is.

My use case is going to be

predict
calculate loss
backward
optimizer [zero grad and] step

I was wondering why not hide the optimizer from the network’s user, and just give an API that looks like

predict
learn(prediction, expected value)

The network should know how to calculate its own loss, and how to learn using its own internal state (optimizer and so on).

Is my second approach bad practice?
If not, why is it not common practice?

The way I viewed it was:
A network defines the following API:

init(lambda_fn_for_loss, learning rate, optimizer)
forward() (outputs expected value predicted by net)
[private, function] loss(expected) (outputs loss tensor, calculated by the lambda on the expected value)
learn(expected) (uses the lambda given on init to calculate loss(expected). then loss.backward, optimizer.step())

ptrblck · February 16, 2019, 3:47pm

The work flow you’ve described would be in my opinion suitable for a high-level API like Ignite.
The reason this isn’t the standard in plain PyTorch is that you have more flexibility using all modules separately.
E.g. if you would like to create an artificially larger batch size because you are running our of memory, you can just execute multiple backward calls, accumulating the gradients, and finally call optimizer.step().

dngros · February 17, 2019, 1:14am

This the API I adapted for my current project. It can work fine.

Though the primary motivator was that I didn’t want to assume the models were just pytorch, and wanted to let models be black boxes which used any sort of learning, optimizers, or extra operations. Then there is a Trainer which handles feeding in data into something implementing the interface, no matter what kind of model it is.

There have been some times I have slightly regretted it, as it makes things like using a different learning rate for different datasets is somewhat more annoying. Also, quickly extra complexity was needed on the interface to let the model know when training started, whether it should be in eval or train mode, methods to move in and out of cuda, etc…

So I think it really depends on use case and things like how many models you plan to have for whether this extra complexity and extra wrapper is worth it. In base pytorch it does make sense why they are not coupled, as it is a higher level and fairly opinionated choice. If you just trying train one model for one task you don’t really need all that abstraction.

rasbt · February 17, 2019, 1:57am

Like others said, your approach/idea proposal sounds fine, but I think it would be more suited for external PyTorch wrapper libraries as coupling these would make research much harder if you lose control over these things. But also in many practical scenarios it would be a bit limiting if you have multiple networks and want to optimize different parts of a network (or module) at separate times.