How do I use torchtext with contextual word embeddings?

dm23 · April 19, 2021, 8:47am

Hi, I am trying to use torchtext while using pre-trained elmo as word embeddings. My attempt is as follows :

I have a csv, where a column contains the strings of the training data.

The pretrained elmo provides me a simple function elmo() which when called with a sentence as input, automatically tokenizes it and returns a list of torch Tensors, where number of elements in list is same as number of words in sentence.

So, I wrote a simple tokenizer which is :
def myTokenizer(sent):
return elmo(sent)

I defined the field for the column as :
elmo_field = Field( sequential=True, use_vocab=False, batch_first=True, pad_token = torch.zeros(DIMS,dtype=torch.float64), dtype=torch.float64, tokenize=myTokenizer, fix_length=100
)

But when I iterate over this using BucketIterator, I get this error :

for series,_ in valid_loader:

File “xxx/lib/python3.7/site-packages/torchtext/data/iterator.py”, line 156, in iter
yield Batch(minibatch, self.dataset, self.device)
File “xxx/lib/python3.7/site-packages/torchtext/data/batch.py”, line 34, in init
setattr(self, name, field.process(batch, device=device))
File “xxx/lib/python3.7/site-packages/torchtext/data/field.py”, line 237, in process
tensor = self.numericalize(padded, device=device)
File “xxx/lib/python3.7/site-packages/torchtext/data/field.py”, line 359, in numericalize
var = torch.tensor(arr, dtype=self.dtype, device=device)
ValueError: only one element tensors can be converted to Python scalars

I have no clue what might be going wrong, I tried writing custom field where in numericalize function, I simply return the input argument, that gives issues in later parts of the training code.

Any help or pointers would be really appreciated, spent a lot of time but couldn’t figure anything out.

If any more information is required, please let me know.
Versions -
torch 1.4.0
torchtext 0.6.0

ptrblck · April 20, 2021, 4:01am

I’m not sure where your code is exactly failing, but the error message is raised if you try to convert a tensor with more than a single element to a Python scalar value:

torch.randn(2).item()
> ValueError: only one element tensors can be converted to Python scalars

torch.randn(1).item() # works

PS: you can post code snippets by wrapping them into three backticks ```, which makes debugging easier.

dm23 · April 20, 2021, 6:47am

That example helps. I will try looking in that direction.

A related question, if I iterate over the data multiple times (like when I have multiple epochs), will the myTokenizer and the processing functions run every time I iterate over the whole data, or only at the first iteration? Because, if all the tokenizing and processing is performed every time, then wouldn’t it be better to first store all of the processed data in a matrix and then perform multiple iterations using it?

ptrblck · April 20, 2021, 8:31am

I don’t know where the tokenizing is performed, but the __getitem__ method of your Dataset will be called to load and process each sample.
If the tokenizing (and other processing steps) are performed there, they will be executed in each iteration for all epochs, so you might want to preprocess the data beforehand.