Momentum and such are handled by the Optimizer itself, but as far as I know, weight decay, such as L1 and L2, can be implemented as a separate step, after the optimizer step?
So, seems like you could just grab the parameter Tensors/Variables for your LSTM, and subtract a fraction of the L2 norm from them?