The short story is that I don’t see anything wrong with what you are doing.
Some further comments, below:
Note, you don’t need or want X to have requires_grad=True. You’re
not optimizing the values of X (and you haven’t added X to optim).
However, I don’t think this actually hurts anything, so I don’t think it’s
a problem. Also, if X is already a pytorch tensor, you don’t need to
wrap it in a newly-constructed tensor (not even for the purpose of turning
This only involves calculating acc, which is not part of the loss you
optimize, so this is not an issue. However, I find it good practice to make
a point of performing such calculations using pytorch tensor operations,
rather than numpy, wherever possible. If this were part of loss, the
numpy calculations would “break the computation graph,” and your
training would fail.
This looks correct (although I haven’t tried your code).
Is there anything perniciously non-linear about the relationship between X and y? It can be hard to fit some functions with a one-hidden-layer
Some suggestions to try: Play around with the learning rate – lower,
higher, or maybe use a learning-rate schedule (which you can easily
do by hand, if you choose).
I also like to try plain-vanilla SGD (with and without momentum). Even
though things like Adam often train faster, they can be less robust.
It might be fun to re-implement your training using a Model with two Linear layers (without biases, to match what your current code is doing).
Convert your one-hot targets to integer categorical labels using argmax()
(you can do this on-the-fly, if you want), and use CrossEntropyLoss. This
should be the same as what you are already doing, but if it works, then it
means you have a bug somewhere that we haven’t caught, and gives you
a cross-check that might help you track down the bug.
If you could tell us where your training data, X and y, come from or how
they are generated, we may have some thoughts on whether your training
task is unusually difficult for some reason.