The model and training code looks good. I think there’s something weird in your get_batch
function or related batching logic - by turning down batch_size
it unexpectedly runs slower but gets better results:
Epoch 11 -- train loss = 0.005119644441312117 -- val loss = 0.01055418627721996