# Pytorch: Neural Network for classification - Constrain some weights to be chosen from a finite set

When training Neural Network for classification in Pytorch, is it possible to put constraints on the weights in the output layer such that they are chosen from a specific finite feasible set?

For example, let’s say W is the weight in the output layer, is it possible to put constraints on W such that the optimal W is selected from the set S={W_1, W_2, …, W_n}, where each W_i is a given feasible value for W? i.e. I will give the values of the W_1,…,W_n to the model

If this is not possible in Pytorch, is there any other ways to achieve this?

Thanks!

Hi Zhen!

This is not a particularly natural thing to do in the context of neural
networks and gradient-descent optimization (and I am not aware of
anything built in to pytorch that will do this).

A general note outside of the context of neural networks and gradient
descent: The kind of constraints you contemplate lead to a so-called
integer-programming problem – a known “hard” problem.

Back to pytorch: Let’s say your feasible weights are the integers 1
through 10, `S = {1, 2, ..., 10}`. Suppose – based on the learning
rate and the backpropagated gradient – your `opt.step()` wants to
set `W = 1.23`. What do you do? You could round `W` to `1`, but that is
not necessarily satisfactory.

Suppose `W` happens to be `3` at a certain stage of training, but you
and I magically know that it should end up being `5`. But let’s say that
the learning rate is small so that your optimizer is only taking small
steps, say no larger than 0.01. How is `W` suppose to get from `3` to `5`,
when it can’t take a step of `2` or even `1`.

The best I can imagine is to add a penalty to your loss function that is
zero when `W = W_i`, and greater than zero otherwise. For example:

``````penalty = ((W - W_1) * (W - W_2) * ... * (W - W_n))**2
loss_with_penalty = regular_loss + alpha * penalty
loss_with_penalty.backward()
opt.step()
``````

Here `alpha` is an adjustable parameter that you have start out at
zero, and then have increase as you train. When you make `alpha`
very large, you will force `W` to be close to one of the `W_i`, but while
`alpha` is still zero or small, the training is not prevented from moving
from one `W_i` to another. But you will have to experiment in order to
determine how to schedule how `alpha` increases throughout training.

And at the end, when `W` is close to one of the `W_i`, you can always
round it to the nearest `W_i` if it has to be equal to one of the specified
values.

Good luck.

K. Frank

Thank you so much! The penalty approach makes a lot of sense!

I figure out that what I want on the weights `W` can be stated in the following way:

Suppose the weight or kernel matrix is a 3 by 4 matrix `W,` and its elements are `W_ij`

I would like for each column j, there is one and only one nonzero `W_ij`, and `W_ij = 1`.

What would be a good way to implement this requirement?

One possible solution I can think of is to put the following constraints using the penalty approach you mentioned:

`W_1j + W_2j + W_3j = 1 for all j = 1,2,3,4`

and

`W_ij * (1-W_ij) = 0, for all i, j`

But is it possible to implement these constraints directly? Or is there any better way to require the property (in bold) that I want?

If this is not possible in Pytorch, can TensorFlow do it?

Hi Zhen!

Again, this is an integer-programming problem – not a natural fit for

But why do you want to do this? Suppose you had weights that didn’t
have this constrained form, but gave a better-performing network, e.g.,
more accurate predictions. Why would you not prefer those?

I haven’t thought about it carefully, but these constraints do look like
they are consistent with what you described above, and can be turned
into penalties.

I know less about tensorflow, but I’m not aware of anything built into
it that would handle such constraints for you. Tensorflow is also a
neural-network framework that is based on backpropagation and
gradient descent, so it is also not the natural tool for this part of your
use case.

A comment: If your weight matrix, `W`, really is as described above, you
have 81 (`3**4`) choices for `W`. You could freeze `W` to be each one of
those choices in succession, and train the rest of your network using
gradient descent while holding the specific choice of `W` constant. You
will now have 81 trained networks. Use the network, with its specific
`W`, that performs the best. This is, of course, the “exhaustive search”
algorithm for the integer-programming problem, but with only 81 points
in your search space, it is perfectly doable. (But it will scale horribly
should `W` grow in size beyond 3 by 4.)

Good luck.

K. Frank

Thanks again for the nice comments!

May I ask one more quick question? How do I freeze only a subset of the variables during the training process?

For example, if I have a layer with weight matrix `W` and bias vector `b`, how do I only freeze `b` and still update the `W` during the training process?