I tried out your code and got your results of non-convergence. I then upped the hidden dimension to 20, and I got convergence 100% of the time. Can you confirm that you see that also before we try to figure out why you can converge with a bigger one-layer hidden dimension? (I think it has to do with saddle points but first I want to know you get convergence with a bigger net.)
Also, you don’t need to try out so many losses and Activations. For this problem, BCELoss() and RELU will work just fine.