Simple categorical cross entropy model not learning

Hey, i have some training data in the format:
X = [[0.1, 0.2], [0.1, 0.3], …] - (x,y) input data
y = [[1,0,0], [0,1,0], … ] - one-hot encoded output

I have defined the model and training process as the following:

w0 = torch.tensor(np.random.randn(2,64), requires_grad=True)
w1 = torch.tensor(np.random.randn(64,3), requires_grad=True)

optim = torch.optim.Adam([w0, w1])

for epoch in range(1000):
        out = torch.tensor(X, requires_grad=True).matmul(w0).relu().matmul(w1)
        out = torch.nn.functional.log_softmax(out, dim=1)

        y_pred = out.detach().numpy().argmax(axis=1)
        acc = np.mean(y_pred == y_true) # y_true is the categorical representation

        # Categorical cross-entropy loss
        loss = -(y * out).sum(dim=1).mean()

        optim.zero_grad()
        loss.backward()
        optim.step()

        print(f"loss: {loss.data}, acc: {acc}")

The accuracy is stuck at 40% when it should be around 90%.

Hi Kevin!

The short story is that I don’t see anything wrong with what you are doing.

Some further comments, below:

Note, you don’t need or want X to have requires_grad=True. You’re
not optimizing the values of X (and you haven’t added X to optim).
However, I don’t think this actually hurts anything, so I don’t think it’s
a problem. Also, if X is already a pytorch tensor, you don’t need to
wrap it in a newly-constructed tensor (not even for the purpose of turning
on requires_grad=True).

This only involves calculating acc, which is not part of the loss you
optimize, so this is not an issue. However, I find it good practice to make
a point of performing such calculations using pytorch tensor operations,
rather than numpy, wherever possible. If this were part of loss, the
numpy calculations would “break the computation graph,” and your
training would fail.

This looks correct (although I haven’t tried your code).

Is there anything perniciously non-linear about the relationship between
X and y? It can be hard to fit some functions with a one-hidden-layer
network.

Some suggestions to try: Play around with the learning rate – lower,
higher, or maybe use a learning-rate schedule (which you can easily
do by hand, if you choose).

I also like to try plain-vanilla SGD (with and without momentum). Even
though things like Adam often train faster, they can be less robust.

It might be fun to re-implement your training using a Model with two
Linear layers (without biases, to match what your current code is doing).
Convert your one-hot targets to integer categorical labels using argmax()
(you can do this on-the-fly, if you want), and use CrossEntropyLoss. This
should be the same as what you are already doing, but if it works, then it
means you have a bug somewhere that we haven’t caught, and gives you
a cross-check that might help you track down the bug.

If you could tell us where your training data, X and y, come from or how
they are generated, we may have some thoughts on whether your training
task is unusually difficult for some reason.

Good luck.

K. Frank

1 Like

Thank you for the extensive answer. My problem was the missing bias unit. Rewriting the forward pass as the following made the model learn non-linear patterns:

...
b0 = torch.tensor(np.random.randn(64))
b1 = torch.tensor(np.random.randn(3))
...
out = torch.tensor(X, requires_grad=True).matmul(w0).add(b0).relu().matmul(w1).add(b1)
out = torch.nn.functional.log_softmax(out, dim=1)