Build Auto Tagging System

Guna_Sekhar_Venkata · October 13, 2024, 5:15am

Hi Community,

I’m build auto tagging system. Where I have two features in my dataframe. One is embedding of each sentence with an dimension (386,) and another is tags. I used sklearn multi label binarizer for performing the multi label classification. I’m interested to use the RNN. I have the doubt that what should be the input size for the network? May anyone can think that it’s a silly questions. But the problem is that using input_size=1 means treating 386 time stamps and at last using the 386th time stamp hidden state to the next layer. Can anyone help me that whether I’m going in right direction or not?

vdw · October 14, 2024, 2:21am

If your input is a single embedding vector, you don’t need an RNN. Or is your input a sequence of vectors, i.e., a sequence of sentences?

Guna_Sekhar_Venkata · October 14, 2024, 4:09am

For example one row of my df is length of 386 and it has an tag. Now Im interested to treat every number from that 386 as an each input where it creates a timestamp so that at last we have 386 timestamps . Atlast I’m interested to take the last hidden state and pass that to the next layer. Let say we have [1,2,3,?] Where treating every input as scalar and a timestamp.

If we treat the entire vector as an input then does it make sense? Because considering every value like a sequence by sequence for predicting the tag making sense right. If not then we can use normal ANN instead of the rnn right

vdw · October 14, 2024, 11:05am

It depends where the 386 values come from :).

In your initial post you said that these 386 values represent an embedding vector of a sentence. The general idea of such embeddings are to capture the semantics of a sentence. This includes that this embedding vector does not represent any sequence but is a “stand-alone” feature vector. So, yes, an ANN will be suitable for that.

Guna_Sekhar_Venkata · October 14, 2024, 11:19am

Thanks for clarifying. It means no use of RNN right. But actually people are solving auto tagging system in that they are using embedded method which starts with random numbers and gets updated right. So if I use RNN then how can I reformulate my strategy? I mean how should I convert my sentences into numerical representation except using tfidf, bog etc.

vdw · October 16, 2024, 12:33am

Can you quickly describe what you’re trying to do. I’m not sure what you mean by auto tagging system. Can you give a concrete example what an input and the predicted output would be?

Guna_Sekhar_Venkata · October 20, 2024, 12:01pm

I have input as text and ouput as tags. Basically it’s an mulitlabel classification. I made all the necessary text preprocessing and i used sentence-transformer all-MiniLM-L6-v2 model for converting the preprocessed text into the embeddings. Then i converted my labels using scikit-learn multi-label binarizer. Now i im interested to RNN model for predicitng the label. For every sentence my embedding length was 386. So i i wnat to use rnn then in one layer it should have 386 time stamps where the hidden state of the last time stamp (386) in the layer1 will passed to the next layer right. I agree that embeddings means semantics. Even i want to experiment with the RNN.

vdw · October 22, 2024, 1:02am

No, the sentence transformer embeds the the whole sentence into a 386-dim vector. This vector does not represent a sequence. There is no meaningful order between the vector elements. You can and should feed this vector directly into a sequence of linear layers.