I think what you want is what is done here in the original torch impl of an NTM.
You want to initialize your memory matrix with a vanilla variable (either normal distribution or all constant values).
Then pass it through a linear layer. You will thereby learn a layer that can initialize your hidden state.