Torchtext input into CNN

dhecloud · October 24, 2018, 11:53am

I am using torchtext to feed characters into a simple CNN.
I am not allowed to use any embedding layers yet.

The input shape to the CNN has to be (128,1,100,256). 128 is the batch size, 1 is number of channels. 100 is the length of the document, and 256 is the 1-hot vector of each character of the 100.
I got this matrix because of the 1st convolution layer which needs to have a filter size of 20,256

When i use torch text, i use char_tokenize = lambda x: list(x) to tokenize my string

    ENTITY = data.Field(sequential=True,fix_length=100, tokenize=char_tokenize)
    ENTITY.build_vocab(train)
    train_iter = data.BucketIterator(train, batch_size=128, sort_key=lambda x: len(x.text))

After building the vocab, my resulting tensor shape is (100,128).

How do i get from (100,128) to (128,1,100,256)?
I tried using unsqueeze and manually creating the 1hot vectors for each character, but it seems really convoluted and complex. Is there simpler and more efficient way that i might have missed out?