torchtext.data.iterator.BucketIterator
I am writing some sentiment analysis code using torchtext bucketiterator and surprised by the behavior of how we make dataset
for example if we have
from torchtext.data import TabularDataset
TEXT = data.Field(tokenize = 'spacy', include_lengths = True, preprocessing= lambda x: preprocessor(x), lower=True)
LABEL = data.LabelField(dtype = torch.long)
INDEX = data.RawField()
INDEX.is_target = False
train_data = TabularDataset('./data/train.tsv',
format='tsv',
skip_header=True,
fields=[('PhraseId', None), ('SentenceId', None), ('Phrase', TEXT), ('Sentiment', LABEL)])
test_data = TabularDataset('./data/test.tsv',
format='tsv',
skip_header=True,
fields=[('PhraseId', INDEX), ('SentenceId', None), ('Phrase', TEXT)])
TEXT.build_vocab(train_data,
#max_size = MAX_VOCAB_SIZE,
vectors = 'glove.6B.100d',
unk_init = torch.Tensor.normal_)
LABEL.build_vocab(train_data)
Above is kinda thing you wouldn’t be interested, but magic happens below:
BATCH_SIZE= 64
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
test_iterator = data.BucketIterator.splits(
test_data,
sort = True,
sort_within_batch=True,
sort_key = lambda x: len(x.Phrase),
batch_size = BATCH_SIZE,
device = device)
if we have data like above, below code shows
vars(test_iterator[107].dataset)
Out[47]:
{‘Phrase’: [‘movie’, ‘becomes’, ‘heady’, ‘experience’], ‘PhraseId’: ‘156168’}
But below shows
train_iterator, test_iterator = data.BucketIterator.splits(
(train_data, test_data),
sort = True,
sort_within_batch=True,
sort_key = lambda x: len(x.Phrase),
batch_size = BATCH_SIZE,
device = device)
vars(test_iterator[107].dataset)
it throws an error that (TypeError: ‘BucketIterator’ object does not support indexing)
Don’t know why but the only difference there is whether you construct iterator data using train_data and test_data together or just using test_data. Just parenthesis for test data like
test_iterator = data.BucketIterator.splits(
(test_data),
...
does not show any difference compared to using only test_data(i.e., without parenthesis)