Pytorch: Neural Network for classification - Constrain some weights to be chosen from a finite set

When training Neural Network for classification in Pytorch, is it possible to put constraints on the weights in the output layer such that they are chosen from a specific finite feasible set?

For example, let’s say W is the weight in the output layer, is it possible to put constraints on W such that the optimal W is selected from the set S={W_1, W_2, …, W_n}, where each W_i is a given feasible value for W? i.e. I will give the values of the W_1,…,W_n to the model

If this is not possible in Pytorch, is there any other ways to achieve this?

Thanks!

Hi Zhen!

This is not a particularly natural thing to do in the context of neural
networks and gradient-descent optimization (and I am not aware of
anything built in to pytorch that will do this).

A general note outside of the context of neural networks and gradient
descent: The kind of constraints you contemplate lead to a so-called
integer-programming problem – a known “hard” problem.

Back to pytorch: Let’s say your feasible weights are the integers 1
through 10, S = {1, 2, ..., 10}. Suppose – based on the learning
rate and the backpropagated gradient – your opt.step() wants to
set W = 1.23. What do you do? You could round W to 1, but that is
not necessarily satisfactory.

Suppose W happens to be 3 at a certain stage of training, but you
and I magically know that it should end up being 5. But let’s say that
the learning rate is small so that your optimizer is only taking small
steps, say no larger than 0.01. How is W suppose to get from 3 to 5,
when it can’t take a step of 2 or even 1.

The best I can imagine is to add a penalty to your loss function that is
zero when W = W_i, and greater than zero otherwise. For example:

penalty = ((W - W_1) * (W - W_2) * ... * (W - W_n))**2
loss_with_penalty = regular_loss + alpha * penalty
opt.zero_grad()
loss_with_penalty.backward()
opt.step()

Here alpha is an adjustable parameter that you have start out at
zero, and then have increase as you train. When you make alpha
very large, you will force W to be close to one of the W_i, but while
alpha is still zero or small, the training is not prevented from moving
from one W_i to another. But you will have to experiment in order to
determine how to schedule how alpha increases throughout training.

And at the end, when W is close to one of the W_i, you can always
round it to the nearest W_i if it has to be equal to one of the specified
values.

Good luck.

K. Frank

Thank you so much! The penalty approach makes a lot of sense!

I figure out that what I want on the weights W can be stated in the following way:

Suppose the weight or kernel matrix is a 3 by 4 matrix W, and its elements are W_ij

I would like for each column j, there is one and only one nonzero W_ij, and W_ij = 1.

What would be a good way to implement this requirement?

One possible solution I can think of is to put the following constraints using the penalty approach you mentioned:

W_1j + W_2j + W_3j = 1 for all j = 1,2,3,4

and

W_ij * (1-W_ij) = 0, for all i, j

But is it possible to implement these constraints directly? Or is there any better way to require the property (in bold) that I want?

If this is not possible in Pytorch, can TensorFlow do it?

Hi Zhen!

Again, this is an integer-programming problem – not a natural fit for
gradient descent.

But why do you want to do this? Suppose you had weights that didn’t
have this constrained form, but gave a better-performing network, e.g.,
more accurate predictions. Why would you not prefer those?

I haven’t thought about it carefully, but these constraints do look like
they are consistent with what you described above, and can be turned
into penalties.

I know less about tensorflow, but I’m not aware of anything built into
it that would handle such constraints for you. Tensorflow is also a
neural-network framework that is based on backpropagation and
gradient descent, so it is also not the natural tool for this part of your
use case.

A comment: If your weight matrix, W, really is as described above, you
have 81 (3**4) choices for W. You could freeze W to be each one of
those choices in succession, and train the rest of your network using
gradient descent while holding the specific choice of W constant. You
will now have 81 trained networks. Use the network, with its specific
W, that performs the best. This is, of course, the “exhaustive search”
algorithm for the integer-programming problem, but with only 81 points
in your search space, it is perfectly doable. (But it will scale horribly
should W grow in size beyond 3 by 4.)

Good luck.

K. Frank

Thanks again for the nice comments!

May I ask one more quick question? How do I freeze only a subset of the variables during the training process?

For example, if I have a layer with weight matrix W and bias vector b, how do I only freeze b and still update the W during the training process?