Why this convnet with SGD (batch size 1) fail?

I’d like to understand why this script:

witch try to learn mnsit with a convnet architure fail badly (failure in learning) for batch of size 1.

What kind of failure? Error msg?

Sorry I mean failing in learning sense… In the sense that the model prediction is equivalent to random prediction.

Of course batch_size=1 won’t work because it’s super difficult to obtain smooth gradient for SGD. Try larger batch size such as the default 64

I see but I cannot explain then why with tensorflow via keres batch size 1 seam to work https://raw.githubusercontent.com/keras-team/keras/master/examples/mnist_cnn.py

I have never tested other frameworks, but it seems like the script you attached uses a batch size of 128.

Well, as you can see the network architecture in Keras is quite different from the example in Pytorch. Anyway, lower the learning rate to 0.001 and you should be fine with batch_size=1.

I was aware of the difference but not sure what was difference that would make one work over the other. I thought the default learning rate of 0.01 would have made it learn at least something, that was also the defaut of SGD that worked also for Keras. Thanks for your suggestion.

I haven’t tried it, but maybe the default learning rate will also work in Pytorch using the Adadelta optimizer. In your Keras reference it’s at least what is used. :wink:

In Keras I tried also vanilla SGD and seam to work … so I suppose the difference is made also just from network architecture … I do not know if can be size of the parameters or layers sequence…types of layer seam are the same.