As for different number of layers I guess you could just initialize the layer in the decoder. But I haven’t come across different number of layers in encoder and decoder in any of the implementations. Why do you need this?
Edit : I’ve found two approaches to address this:
Concatenate the hidden states of the forward and backward passes.
Use the hidden states from either the forward or the backward pass.
I don’t know the effects either of the two options will have on the model performance. You could try out each and see what works best for your model.