Thank you for the LSTM threads, I'm learning so much from them!
(This one and the more recent one, but I felt that this was better fitting here.)
A few observations that may or may not be interesting regarding the pytorch example (in particular with (entire) batch:
- At least with single precision (on cuda) it seems to me that lower loss apparently does not necessarily mean nicer looking predictions (at ~1e-4), I find both.
- I would expect something to be up regarding single precision given that the example is done with doubles...
- It seems that after switching from LBFGS to Adam also converges similarly.
- I have not been entirely successful using double precision on cuda.
Is that similar to your experiences? What's the conclusion, in particular for the first point.