Can't Iterate Validation set using BucketIterator

j-mehul · October 8, 2020, 7:16am

Hello Pytorch Experts,

I am facing an issue while using BucketIterator. Below is a code snippet to divide the data into the train and validation set. I am able to iterate the training data but getting the error in the validation dataset. I have tried different seed to check if there is any issue in the data but couldn’t resolve.

train_iterator, valid_iterator = data.BucketIterator.splits(
    (train_data, valid_data), 
    batch_size = BATCH_SIZE,
    sort_key = lambda x: len(x.text),
    sort_within_batch=True,
    device = device)

for batch in valid_iterator :
        text, text_lengths = batch.text  
        print(text)

ptrblck · October 10, 2020, 10:43am

Based on the error message it seems the validation dataset cannot load a specific sample due to a KeyError.
You could check the index of the failing sample via:

for idx, batch in enumerate(valid_iterator):
    print(idx)
    text, text_lengths = batch.text

and based on this index check why this sample is failing to load.

j-mehul · October 10, 2020, 6:51pm

Hello @ptrblck, Thankyou for responding.
I have tried using your snippet.
I don’t think its a data issue as I have changes seeds and split_ratio.
Always, the error is in the validation set.
If there had been an issue in data then I guess train_iterator would also throw an error but train_iterator runs perfectly well.

Full Code Snippet for your reference:

TEXT = data.Field(tokenize='spacy',batch_first=True,include_lengths=True)
LABEL = data.LabelField(dtype = torch.float,batch_first=True)

fields = [('question_text',TEXT),('target', LABEL)]
training_data=data.TabularDataset(path = 'quora.csv',format = 'csv',fields = fields,skip_header = True)

SEED = 20
train_data, valid_data = training_data.split(split_ratio=0.7, random_state = random.seed(SEED))

TEXT.build_vocab(train_data,min_freq=3,vectors = "glove.6B.100d")  
LABEL.build_vocab(train_data)

#set batch size
BATCH_SIZE = 64

#Load an iterator
train_iterator, valid_iterator = data.BucketIterator.splits(
    (train_data, valid_data), 
    batch_size = BATCH_SIZE,
    sort_key = lambda x: len(x.question_text),
    sort_within_batch=True,sort=False)

for idx, batch in enumerate(valid_iterator):
    print(idx)
    text, text_lengths = batch.question_text

Abhilash_Srivastava · October 21, 2020, 2:54am

Can you try building the vocab using the train_data and valid_data combined.

Harsh_Sharma · July 18, 2021, 12:33pm

This One actually worked.
It seems like while using torchtext data split for different build_vocabs, the valid vocab doesn’t seem to have an index located, even after using Glove Embeddings.
I was getting various key errors and then realized this could be one of the issues.