Hi, I am trying to use torchtext while using pre-trained elmo as word embeddings. My attempt is as follows :
I have a csv, where a column contains the strings of the training data.
The pretrained elmo provides me a simple function elmo() which when called with a sentence as input, automatically tokenizes it and returns a list of torch Tensors, where number of elements in list is same as number of words in sentence.
So, I wrote a simple tokenizer which is :
def myTokenizer(sent):
return elmo(sent)
I defined the field for the column as :
elmo_field = Field( sequential=True, use_vocab=False, batch_first=True, pad_token = torch.zeros(DIMS,dtype=torch.float64), dtype=torch.float64, tokenize=myTokenizer, fix_length=100
)
But when I iterate over this using BucketIterator, I get this error :
for series,_ in valid_loader:
File “xxx/lib/python3.7/site-packages/torchtext/data/iterator.py”, line 156, in iter
yield Batch(minibatch, self.dataset, self.device)
File “xxx/lib/python3.7/site-packages/torchtext/data/batch.py”, line 34, in init
setattr(self, name, field.process(batch, device=device))
File “xxx/lib/python3.7/site-packages/torchtext/data/field.py”, line 237, in process
tensor = self.numericalize(padded, device=device)
File “xxx/lib/python3.7/site-packages/torchtext/data/field.py”, line 359, in numericalize
var = torch.tensor(arr, dtype=self.dtype, device=device)
ValueError: only one element tensors can be converted to Python scalars
I have no clue what might be going wrong, I tried writing custom field where in numericalize function, I simply return the input argument, that gives issues in later parts of the training code.
Any help or pointers would be really appreciated, spent a lot of time but couldn’t figure anything out.
If any more information is required, please let me know.
Versions -
torch 1.4.0
torchtext 0.6.0