Puzzled by implementation of LSTM

ryan_stark · March 19, 2019, 12:03pm

In many papers and blogs(LSTM), the input of forget gate often be described like this picture.

The input seems like the result of a concatenate operation on xt and ht-1. But in the doc of nn.LSTM(nn.LSTM), there is a example like this:

rnn = nn.LSTM(10, 20, 2)
x1 = torch.randn(5, 3, 10)
h0 = torch.randn(2, 3, 20)
c0 = torch.randn(2, 3, 20)
output, (hn, cn) = rnn(x1, (h0, c0))

I notice that the shape is not matched on x1 and h0, they can not be concatenated. But the code runs well. I want to know why.

Diego · March 19, 2019, 10:47pm

I think they are not concatenated in the literal sense, this is just to simplify notation. According to the docs the forget gate is computed like so:

sigmoid(bfi + sum(U_t * X_t) + sum(W_t * h_t-1 +bfh)

With U_t being the input weights at time t and W_t being the recurrent weights at time t. So there are 2 weight matrices that are used in each cell. I am definetely not an expert so maybe wait for a more accurate response.

P.S: Someone plz tell me how to use latex in here

Paulo_Augusto_Nino · March 20, 2019, 1:15pm

Hi! I think (I may be wrong) that it is just two ways of saying the same:

Think that we have a vector XY that is a concat of two other vectors (X and Y). Lets say that X has n x 1 as shape and Y mx1. Then, XY is (n+m) x 1. Lets also say that X has its weight matrix and Y too, and both matrixes map to the same dimension k. So Mx (X matrix) has k x n as shape and My has k x m. You can also say that there is a Mxy matrix that is a (concat?) of Mx and My (with shape k x (n+m)) and

Mx X + My Y = Mxy XY

If I just said bull* im sorry

ryan_stark · March 21, 2019, 1:35am

Thanks for your reply! In the doc there is not sum operation in your expression, and the expression of forget gate indicate that xt and ht-1 should have same features, because ux and wh can be summed. There is still sth I don’t make clear.

ryan_stark · March 21, 2019, 1:36am

Thanks for your reply! But in my opinion, we must guarantee the uniformity of features when do a concatenate operation rather than concatenate the features. So there is some questions I have not make sense of.

maxwell-lx · April 3, 2023, 4:17am

Essentially, the two methods are consistent, but just written in a different way. Suppose the LSTM’s hidden layer has 128 units, and the input layer has 300 units. In this case, ht-1 is a 128x1 vector, xt is a 300x1 vector, and ft has a dimension of 128x1. Wf is defined as [Whf, Wif]. Whf has a dimension of 128x128, Wif has a dimension of 128x300, and Wf has a dimension of 128x428.