NLP basic array shape

I am looking into building an RNN where the inputs are words. As you know not every word is the same size and therefore when I am one hot encoding each letter I end up having to pad the end of most words with zero arrays. Is there a better way to do this? Does a tensor even work if the array is not consistent. Where consistent is defined as having the same dimensions.

In other words I am currently doing the following:
number_of_words * len(longest_word) * number_of_chars

since I do not believe the following is possible
number_of_words * len(word_i) * number_of_chars

I’m not sure what you’re trying to do. I seems to me that you’re mixing words and characters. What is your “unit of interest” w.r.t. each time step of the RNN? I see just two alternatives

  • Characters, i.e., at each time step, you feed the next character to the RNN. Here, you can simply one-hot encode each character. Word lengths are in this case not important.
  • Words, i.e., at each time step, you feed the next word to the RNN. Just use an Embedding layer to map each word to a vector representation of a fixed size. For example, each word is mapped to a 100-dim vector. Again, the actual lengths of words is not important.

You need to create word vectors with something like word2vec or fasttext or create character vectors in a similar fashion. Then each of your vectors will be num_words x word_vector_size or more generically length x feature_dim. You can see examples in the pytorch tutorials for this:

https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

But the general idea is that you want to avoid using one hot-encodings and instead use dense vector representations.