Retrain_variables in the loss function

Your linear layer is not doing the same as a TimeDistributedDense in Keras. You are only using the last time step, and ditching everything else.

Have a look at my TimeDistributed wrapper here: