# Simple categorical cross entropy model not learning

Hey, i have some training data in the format:
X = [[0.1, 0.2], [0.1, 0.3], …] - (x,y) input data
y = [[1,0,0], [0,1,0], … ] - one-hot encoded output

I have defined the model and training process as the following:

``````w0 = torch.tensor(np.random.randn(2,64), requires_grad=True)

for epoch in range(1000):
out = torch.nn.functional.log_softmax(out, dim=1)

y_pred = out.detach().numpy().argmax(axis=1)
acc = np.mean(y_pred == y_true) # y_true is the categorical representation

# Categorical cross-entropy loss
loss = -(y * out).sum(dim=1).mean()

loss.backward()
optim.step()

print(f"loss: {loss.data}, acc: {acc}")
``````

The accuracy is stuck at 40% when it should be around 90%.

Hi Kevin!

The short story is that I don’t see anything wrong with what you are doing.

Note, you don’t need or want `X` to have `requires_grad=True`. You’re
not optimizing the values of `X` (and you haven’t added `X` to `optim`).
However, I don’t think this actually hurts anything, so I don’t think it’s
a problem. Also, if `X` is already a pytorch tensor, you don’t need to
wrap it in a newly-constructed tensor (not even for the purpose of turning
on `requires_grad=True`).

This only involves calculating `acc`, which is not part of the `loss` you
optimize, so this is not an issue. However, I find it good practice to make
a point of performing such calculations using pytorch tensor operations,
rather than numpy, wherever possible. If this were part of `loss`, the
numpy calculations would “break the computation graph,” and your
training would fail.

This looks correct (although I haven’t tried your code).

Is there anything perniciously non-linear about the relationship between
`X` and `y`? It can be hard to fit some functions with a one-hidden-layer
network.

Some suggestions to try: Play around with the learning rate – lower,
higher, or maybe use a learning-rate schedule (which you can easily
do by hand, if you choose).

I also like to try plain-vanilla `SGD` (with and without momentum). Even
though things like `Adam` often train faster, they can be less robust.

It might be fun to re-implement your training using a `Model` with two
`Linear` layers (without biases, to match what your current code is doing).
Convert your one-hot targets to integer categorical labels using `argmax()`
(you can do this on-the-fly, if you want), and use `CrossEntropyLoss`. This
should be the same as what you are already doing, but if it works, then it
means you have a bug somewhere that we haven’t caught, and gives you

If you could tell us where your training data, `X` and `y`, come from or how
they are generated, we may have some thoughts on whether your training
task is unusually difficult for some reason.

Good luck.

K. Frank

1 Like

Thank you for the extensive answer. My problem was the missing bias unit. Rewriting the forward pass as the following made the model learn non-linear patterns:

``````...
b0 = torch.tensor(np.random.randn(64))
b1 = torch.tensor(np.random.randn(3))
...