I would recommend to scale down the problem a bit and try to overfit a small data snippet (e.g. just 10 samples) to make sure your training code doesn’t have any obvious errors.
If that’s not working out of the box, you could play around with some hyperparameters.
Here I have my NB’s updated and working all of them. What I realized, is that when I try a higher LR, my seems to stop learning (the plots show that situation)
Do you have any resource where I can learn from the following topics?
How many layer and neurons to use in certain context?
Which activaton function should I use en each layer?
What is the rationale of using certain activation function on a layer?