Puzzled by implementation of LSTM

In many papers and blogs(LSTM), the input of forget gate often be described like this picture.

The input seems like the result of a concatenate operation on xt and ht-1. But in the doc of nn.LSTM(nn.LSTM), there is a example like this:

rnn = nn.LSTM(10, 20, 2)
x1 = torch.randn(5, 3, 10)
h0 = torch.randn(2, 3, 20)
c0 = torch.randn(2, 3, 20)
output, (hn, cn) = rnn(x1, (h0, c0))

I notice that the shape is not matched on x1 and h0, they can not be concatenated. But the code runs well. I want to know why.

I think they are not concatenated in the literal sense, this is just to simplify notation. According to the docs the forget gate is computed like so:

sigmoid(bfi + sum(U_t * X_t) + sum(W_t * h_t-1 +bfh)

With U_t being the input weights at time t and W_t being the recurrent weights at time t. So there are 2 weight matrices that are used in each cell. I am definetely not an expert so maybe wait for a more accurate response.

P.S: Someone plz tell me how to use latex in here :slight_smile:

1 Like

Hi! I think (I may be wrong) that it is just two ways of saying the same:

Think that we have a vector XY that is a concat of two other vectors (X and Y). Lets say that X has n x 1 as shape and Y mx1. Then, XY is (n+m) x 1. Lets also say that X has its weight matrix and Y too, and both matrixes map to the same dimension k. So Mx (X matrix) has k x n as shape and My has k x m. You can also say that there is a Mxy matrix that is a (concat?) of Mx and My (with shape k x (n+m)) and

Mx X + My Y = Mxy XY

If I just said bull* im sorry

Thanks for your reply! In the doc there is not sum operation in your expression, and the expression of forget gate indicate that xt and ht-1 should have same features, because ux and wh can be summed. There is still sth I don’t make clear.

Thanks for your reply! But in my opinion, we must guarantee the uniformity of features when do a concatenate operation rather than concatenate the features. So there is some questions I have not make sense of.