Hi Benja!
I assume you mean that you want the outputs of your network to be positive,
rather than the weight and bias parameters of your network to be positive.
(If you want your parameters to be positive, similar comments will apply, but
the details will be different.)
One simple approach would be to add a penalty to your loss function (where
pred
is the output of your network):
loss = loss_fn (pred, target)
penalty = (torch.nn.functional.relu (-pred)**2).sum() # for example
loss_with_penalty = loss + alpha * penalty
Note that penalty
will not force the elements of pred
to be positive, but
it will encourage them to be positive. However, by increasing the value
of the penalty-weight, alpha
, you can push pred
harder and harder not
to be negative.
To perform an “official” constrained optimization that requires pred
to be
non-negative (an inequality constraint), you can add slack variables for the
elements of pred
:
pred_with_slack = pred - slack**2
and constrain (with an equality constraint) pred_with_slack
to be zero
element by element (where slack
has the same shape as pred
). Because
slack**2
can be positive, pred
is free to become positive, but because
slack**2
can never be negative, the constraint on pred_with_slack
prevents pred
from becoming negative.
You can use Lagrange multipliers to perform such an optimization where
pred_with_slack
is constrained to be zero. However, because the optimum
for the Lagrange-multiplier optimization occurs at a saddle point (rather than
at a minimum), you can’t use gradient descent to perform the optimization
without tweaking it so that you use, in effect, gradient ascent on the Lagrange
multiplier.
This is explained by @tom, here, and a sound approach to implementing
such mixed gradient descent / ascent optimization using pytorch (with its
gradient-descent-based optimizers) is give by @t_naumenko, here.
Note that using the Lagrange-multiplier technique will not force pred
to be
non-negative during the training process; only after the optimization has
converged to its (saddle-point) optimum will pred
be non-negative.
(You can think of the Lagrange multiplier as being being like the alpha
penalty-weight in the loss_with_penalty
optimization approach except
that the optimization process tunes the penalty-weight automatically.)
As an aside, this naturally raises the question why not? Depending on your use
case, it may make perfect sense to pass the output of your network through
something like relu()
or, perhaps better, exp()
to ensure positive values.
Good luck!
K. Frank