Torchtext: Setting field with preprocessing

usakira · February 7, 2019, 2:44am

Hi, I’m working with my original data and struggling with creating a dataset on torchtext.

The data consist of sentences.
i.e.) [[“He”, “plays”, “piano”],[“He”, “plays”, “guitar”, “well”],…]

I would like to choose one of them for train.
So I set the Field for them as followings.

n_samples_field = Field(use_vocab=True,
                        eos_token=SpecialToken.EOS.value,
                        pad_token=SpecialToken.Padding.value,
                        unk_token=SpecialToken.Unknown.value,
                        preprocessing=lambda sen: random.choice(sen) \
                        if sen != [] else [""],
                        include_lengths=True)

But it failed and there is the error code.
I thought I need to convert each sen into int after preprocessing, but how?

  File "/home/ubuntu/test/train.py", line 132, in run
    for batch in X:
  File "/home/ubuntu/anaconda3/envs/test_env/lib/python3.7/site-packages/torchtext/data/iterator.py", line 156, in __iter__
    yield Batch(minibatch, self.dataset, self.device)
  File "/home/ubuntu/anaconda3/envs/test_env/lib/python3.7/site-packages/torchtext/data/batch.py", line 34, in __init__
    setattr(self, name, field.process(batch, device=device))
  File "/home/ubuntu/anaconda3/envs/test_env/lib/python3.7/site-packages/torchtext/data/field.py", line 237, in process
    tensor = self.numericalize(padded, device=device)
  File "/home/ubuntu/anaconda3/envs/test_env/lib/python3.7/site-packages/torchtext/data/field.py", line 336, in numericalize
    arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
  File "/home/ubuntu/anaconda3/envs/test_env/lib/python3.7/site-packages/torchtext/data/field.py", line 336, in <listcomp>
    arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
  File "/home/ubuntu/anaconda3/envs/test_env/lib/python3.7/site-packages/torchtext/data/field.py", line 336, in <listcomp>
    arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
AttributeError: 'Field' object has no attribute 'vocab'

usakira · February 7, 2019, 5:07am

I got processing is like that.
So I need to pick 1 example before adopting field.

preprocessing – The Pipeline that will be applied to examples using this field after tokenizing but before numericalizing. Many Datasets replace this attribute with a custom preprocessor. Default: None.

Bridget_Murphy · January 29, 2024, 11:13am

Good luck on your project.