Question about nn.LSTMCell

Recently, I deccided to realize a dialogue system. But focus on the formula as followings, I’m confused how to realize it by pytorch.

At time t, the encoding hidden state of the first layer l1
and second layer l2 are defined as:

$h_{t}^{l_1} = LSTM(x_t,h_{t-1}^{l_{1}})$
$h_{t}^{l_2} = LSTM(<pad>,h_{t}^{l_1},h_{t-1}^{l_{2}})​$
where $x_t$ is the input word embedding. The special symbol<pad> denotes padding with zero. $h_{0}^{l_1} $ and $h_{0}^{l_2} $ are initialized with zero vectors.

so how to deal with it when LSTM need multi inputs? I think it needs LSTMCell because of the differences between layer 1 and layer 2. But LSTMCell’s input are input,h,c. How to realize it as the representation of formula (2) and (3) Or shoud I concatenate $y_{t-1}$ and $s_{t-1}^{l_2}$ as input?