NLP basic array shape

Tank · October 11, 2019, 9:22pm

I am looking into building an RNN where the inputs are words. As you know not every word is the same size and therefore when I am one hot encoding each letter I end up having to pad the end of most words with zero arrays. Is there a better way to do this? Does a tensor even work if the array is not consistent. Where consistent is defined as having the same dimensions.

In other words I am currently doing the following:
number_of_words * len(longest_word) * number_of_chars

since I do not believe the following is possible
number_of_words * len(word_i) * number_of_chars

vdw · October 12, 2019, 2:25am

I’m not sure what you’re trying to do. I seems to me that you’re mixing words and characters. What is your “unit of interest” w.r.t. each time step of the RNN? I see just two alternatives

Characters, i.e., at each time step, you feed the next character to the RNN. Here, you can simply one-hot encode each character. Word lengths are in this case not important.
Words, i.e., at each time step, you feed the next word to the RNN. Just use an Embedding layer to map each word to a vector representation of a fixed size. For example, each word is mapped to a 100-dim vector. Again, the actual lengths of words is not important.

dhpollack · October 12, 2019, 8:49am

You need to create word vectors with something like word2vec or fasttext or create character vectors in a similar fashion. Then each of your vectors will be num_words x word_vector_size or more generically length x feature_dim. You can see examples in the pytorch tutorials for this:

https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

But the general idea is that you want to avoid using one hot-encodings and instead use dense vector representations.