I am using torchtext to feed characters into a simple CNN.
I am not allowed to use any embedding layers yet.
The input shape to the CNN has to be (128,1,100,256)
. 128 is the batch size, 1 is number of channels. 100 is the length of the document, and 256 is the 1-hot vector of each character of the 100.
I got this matrix because of the 1st convolution layer which needs to have a filter size of 20,256
When i use torch text, i use char_tokenize = lambda x: list(x)
to tokenize my string
ENTITY = data.Field(sequential=True,fix_length=100, tokenize=char_tokenize)
ENTITY.build_vocab(train)
train_iter = data.BucketIterator(train, batch_size=128, sort_key=lambda x: len(x.text))
After building the vocab, my resulting tensor shape is (100,128).
How do i get from (100,128)
to (128,1,100,256)
?
I tried using unsqueeze and manually creating the 1hot vectors for each character, but it seems really convoluted and complex. Is there simpler and more efficient way that i might have missed out?