I am trying to use the doc2vec to embed each of my sentence, and then put each sentence to the lstm model to do text classification task. How should I initialize my lstm input_size, as each batch_text is ‘96, 120’, 96 is the batch size and the 120 is the vector size of each sentence after doc2vec.infer_vector(sentence)
If word2vec gives you a single vector for a sentence, and you treat sentences individually, you no longer have a sequence. So there’s no need for an LSTM/GRU.
Or do you have whole paragraphs, so the sequences is the list of sentences?
# Instantiate the model, define loss function, and optimizer
token_size = tokenizer.get_vocab_size() # Adjust based on your vocabulary size
embedding_dim = 100 # Adjust based on your preference
hidden_size = 64
output_size = 5 # Number of classes
model = LSTMClassifier(120, embedding_dim, hidden_size, 0.45, output_size)
criterion = nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=0.001)
model.to(device)
Here are my parameters, 120 is the length of each input vector.
The errored occurred when
for batch in trn_loader:
texts, labels = batch['text'], batch['label']
texts, labels = texts.to(device), labels.to(device)
# Zero the gradients
optimizer.zero_grad()
# Forward pass
outputs = model(texts) # This line occurred error.
# When the dimension of the texts is [8(batch size), 120 (vector size)], the error occurred in the forward function feeding text to the lstm layer.
I did not setup any sequence here. I have a list of sentences as the training data. I just used the tokenizer to padd them into the same length list of tokens, then used the doc2vec model to convert these lists of tokens into numeric vectors with the same length. And then take the numeric vectors as input to the LSTM model. But it seems like for my data there is no seq_length, so I can not feed these data into the lstm model. Is this right?
Right, doc2vec gives you a single vector for a sentence – compared to a sequences of vectors reflecting each word in a sentence. Since you have no a fixed-sized vector for each sentence, there’s no meaningful point to use an LSTM anymore, and you can just use an FFNN.
Why? In some sense this is what an LSTM/GRU is doing, take a sequence to generate a fixed-sized vector representation, which is then typically pumped through some additional linear layers. Here you just replace the LSTM layer with doc2vec.
Admittedly, I don’t really know how word2vec is setup is being trained, but the underlying goal is always the same: convert a sequence of words (sentence/paragraphs) into a fixed-size vector to hopefully capture the meaning of the sentence/paragraph.
besides, I think my error occured in the embedding layer of the lstm. Lstm always takes three dimensional input, but my batch[‘text’] is always two dimensional, but I do not know how to setup it. My batch size is 96, the length of each text vector is 120. How should I setup the lstm layer
An LSTM takes a sequence. To get a meaningful use out of the LSTM, you need to pass in a sequence of length greater than 1. If you’re condensing the entirety of your sequence into a vector, then you could probably just use a fully connected network(linear layers and activations) to accomplish your task.
I’d like to draw a careful distinction here. A vector, that is, the output of an embedding layer or some other sentence2vec, doc2vec, is not a sequence. The way we can know whether something is sequentially related or not is whether the order additionally conveys some meaning vs. the order being arbitrary.
Examples of sequential information would be sound waves, pictures(2d), daily price close, etc.
Examples of non sequential information would be how red something is, temperature, brightness, etc. The order you decide on at the start to put them into the network is irrelevant. So we call these features or channels.
You could have sequentially related information on 1 or more dims and have it be non sequential on another dim, such as is the case with rgb images, where you have channels, height, and width. The height and width dims are sequentially related while the channels are not. If we shuffled all of the pixels on the height and width dims, you’d likely no longer be able to identify an image. While shuffling the red, green, and blue makes no difference.
So if your text features are 120 in length and a batch size of 96, my second question would be are the text features sequentially related in the batch dim? For example, sentence 1, sentence 2, etc.