LSTM with doc2vec word embedding

Lei_Gong · December 10, 2023, 3:06am

Hello guys,

I am trying to use the doc2vec to embed each of my sentence, and then put each sentence to the lstm model to do text classification task. How should I initialize my lstm input_size, as each batch_text is ‘96, 120’, 96 is the batch size and the 120 is the vector size of each sentence after doc2vec.infer_vector(sentence)

vdw · December 10, 2023, 6:37am

What exactly is the sequence in your setup here?

If word2vec gives you a single vector for a sentence, and you treat sentences individually, you no longer have a sequence. So there’s no need for an LSTM/GRU.

Or do you have whole paragraphs, so the sequences is the list of sentences?

Lei_Gong · December 10, 2023, 7:22am

# Define the LSTM model
class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, dropout_rate, output_size,):
        super(LSTMClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_size, batch_first=True)
        self.dropout = nn.Dropout(dropout_rate)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = self.embedding(x)
        print("Embedding?")
        _, (h_n, _) = self.lstm(x)
        print('LSTM?')
        h_n = self.dropout(h_n[-1, :, :])  # Apply dropout before the fully connected layer
        output = self.fc(h_n)
        return output

Here is my LSTM model.

# Instantiate the model, define loss function, and optimizer

token_size = tokenizer.get_vocab_size() # Adjust based on your vocabulary size

embedding_dim = 100 # Adjust based on your preference

hidden_size = 64

output_size = 5 # Number of classes

model = LSTMClassifier(120, embedding_dim, hidden_size, 0.45, output_size)

criterion = nn.CrossEntropyLoss()

optimizer = Adam(model.parameters(), lr=0.001)


model.to(device)

Here are my parameters, 120 is the length of each input vector.
The errored occurred when

 for batch in trn_loader:
        texts, labels = batch['text'], batch['label']
        texts, labels = texts.to(device), labels.to(device)

        # Zero the gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(texts) # This line occurred error. 
        # When the dimension of the texts is [8(batch size), 120 (vector size)], the error occurred in the forward function feeding text to the lstm layer.

Lei_Gong · December 10, 2023, 7:23am

I did not setup any sequence here. I have a list of sentences as the training data. I just used the tokenizer to padd them into the same length list of tokens, then used the doc2vec model to convert these lists of tokens into numeric vectors with the same length. And then take the numeric vectors as input to the LSTM model. But it seems like for my data there is no seq_length, so I can not feed these data into the lstm model. Is this right?

vdw · December 10, 2023, 7:31am

Right, doc2vec gives you a single vector for a sentence – compared to a sequences of vectors reflecting each word in a sentence. Since you have no a fixed-sized vector for each sentence, there’s no meaningful point to use an LSTM anymore, and you can just use an FFNN.

Lei_Gong · December 10, 2023, 7:35am

Actually, all of the sentences will be converted to a fixed length. But it seems it is not meaningful to do so.

vdw · December 10, 2023, 7:42am

Why? In some sense this is what an LSTM/GRU is doing, take a sequence to generate a fixed-sized vector representation, which is then typically pumped through some additional linear layers. Here you just replace the LSTM layer with doc2vec.

Admittedly, I don’t really know how word2vec is setup is being trained, but the underlying goal is always the same: convert a sequence of words (sentence/paragraphs) into a fixed-size vector to hopefully capture the meaning of the sentence/paragraph.

Lei_Gong · December 10, 2023, 4:21pm

Hello Chris, do you mean that I should use the doc2vec to replace the embedding layer of LSTM model in the structure of the neural network?

Lei_Gong · December 10, 2023, 5:30pm

besides, I think my error occured in the embedding layer of the lstm. Lstm always takes three dimensional input, but my batch[‘text’] is always two dimensional, but I do not know how to setup it. My batch size is 96, the length of each text vector is 120. How should I setup the lstm layer

Lei_Gong · December 10, 2023, 6:49pm

I think the issue is that the Doc2Vec returns the vector with continuous values instead of integers, which may cause the problem,

J_Johnson · December 11, 2023, 3:15am

An LSTM takes a sequence. To get a meaningful use out of the LSTM, you need to pass in a sequence of length greater than 1. If you’re condensing the entirety of your sequence into a vector, then you could probably just use a fully connected network(linear layers and activations) to accomplish your task.

But if you have sequences of sentences, that are each being encoded to a vector, you can pass those in order into an LSTM either all at once or one after another. See here for an example with code: LSTM on Time series with CrossEntropyLoss is unstable - #9 by J_Johnson

Lei_Gong · December 11, 2023, 3:27am

sure! Thank you for your advice Johnson!

Lei_Gong · December 11, 2023, 3:40am

Hello Johnson, if all of my texts are in 120 length, and the batch size is 96, how should I setup the input shape of lstm layer?

J_Johnson · December 11, 2023, 4:43am

I’d like to draw a careful distinction here. A vector, that is, the output of an embedding layer or some other sentence2vec, doc2vec, is not a sequence. The way we can know whether something is sequentially related or not is whether the order additionally conveys some meaning vs. the order being arbitrary.

Examples of sequential information would be sound waves, pictures(2d), daily price close, etc.

Examples of non sequential information would be how red something is, temperature, brightness, etc. The order you decide on at the start to put them into the network is irrelevant. So we call these features or channels.

You could have sequentially related information on 1 or more dims and have it be non sequential on another dim, such as is the case with rgb images, where you have channels, height, and width. The height and width dims are sequentially related while the channels are not. If we shuffled all of the pixels on the height and width dims, you’d likely no longer be able to identify an image. While shuffling the red, green, and blue makes no difference.

So if your text features are 120 in length and a batch size of 96, my second question would be are the text features sequentially related in the batch dim? For example, sentence 1, sentence 2, etc.