BucketIterator shows different behavior

torchtext.data.iterator.BucketIterator

I am writing some sentiment analysis code using torchtext bucketiterator and surprised by the behavior of how we make dataset

for example if we have

from torchtext.data import TabularDataset


TEXT = data.Field(tokenize = 'spacy', include_lengths = True, preprocessing= lambda x: preprocessor(x), lower=True)
LABEL = data.LabelField(dtype = torch.long)
INDEX = data.RawField()
INDEX.is_target = False

train_data = TabularDataset('./data/train.tsv', 
                            format='tsv',
                            skip_header=True,
                            fields=[('PhraseId', None), ('SentenceId', None), ('Phrase', TEXT), ('Sentiment', LABEL)])
test_data = TabularDataset('./data/test.tsv', 
                            format='tsv',
                            skip_header=True,
                            fields=[('PhraseId', INDEX), ('SentenceId', None), ('Phrase', TEXT)])
TEXT.build_vocab(train_data,
                #max_size = MAX_VOCAB_SIZE,
                vectors = 'glove.6B.100d',
                unk_init = torch.Tensor.normal_)

LABEL.build_vocab(train_data)

Above is kinda thing you wouldn’t be interested, but magic happens below:

BATCH_SIZE= 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

test_iterator = data.BucketIterator.splits(
    test_data,
    sort = True,
    sort_within_batch=True,
    sort_key = lambda x: len(x.Phrase),
    batch_size = BATCH_SIZE,
    device = device)

if we have data like above, below code shows

vars(test_iterator[107].dataset)

Out[47]:

{‘Phrase’: [‘movie’, ‘becomes’, ‘heady’, ‘experience’], ‘PhraseId’: ‘156168’}

But below shows

train_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, test_data),
    sort = True,
    sort_within_batch=True,
    sort_key = lambda x: len(x.Phrase),
    batch_size = BATCH_SIZE,
    device = device)

vars(test_iterator[107].dataset)

it throws an error that (TypeError: ‘BucketIterator’ object does not support indexing)

Don’t know why but the only difference there is whether you construct iterator data using train_data and test_data together or just using test_data. Just parenthesis for test data like

 test_iterator = data.BucketIterator.splits(
    (test_data),
...

does not show any difference compared to using only test_data(i.e., without parenthesis)

You should use BucketIterator.splits() when you actually have a split data. If you want to create BucketIterator only for one split e.g. test or train, use BucketIterator only. That means your above case where you only pass test_data should be changed to:

test_iterator = data.BucketIterator(
    test_data,
    sort = True,
    sort_within_batch=True,
    sort_key = lambda x: len(x.Phrase),
    batch_size = BATCH_SIZE,
    device = device)
1 Like