Datapipe warning: Is this a problem?

FALL_ML · July 15, 2022, 9:43pm

Hi all, I am constructing a vocab from the Multi30k dataset using building_vocab_from_iterator and the spaCy tokenizer. However, I’m getting a warning saying that “some child datapipes are not exhausted.” Is this an issue with my implementation? Here is the snippet of code where the error first appears (the error sporadically re-appears while running epochs):

import torch
from torchtext.datasets import Multi30k
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

train_data, validation_data, test_data = Multi30k()

tokenizer_en = get_tokenizer('spacy', language="en_core_web_sm")
tokenizer_de = get_tokenizer('spacy', language="de_core_news_sm")

def yield_tokens_en(data_iter):
    for _, text in data_iter:
        yield tokenizer_en(text)

def yield_tokens_de(data_iter):
    for text, _ in data_iter:
        yield tokenizer_de(text)

vocab_en = build_vocab_from_iterator(yield_tokens_en(train_data), specials=["<pad>", "<bos>", "<eos>", "<unk>"])
vocab_de = build_vocab_from_iterator(yield_tokens_de(train_data), specials=["<pad>", "<bos>", "<eos>", "<unk>"])

And the corresponding output:

UserWarning: Some child DataPipes are not exhausted when __iter__ is called. We are resetting the buffer and each child DataPipe will read from the start again.
  warnings.warn("Some child DataPipes are not exhausted when __iter__ is called. We are resetting "

Relevant library versions:

pytorch==1.12.0
torchtext==0.13.0

IamMohitM · March 18, 2023, 11:08pm

What I have noticed is that when I get this warning, while iterating through the corresponding dataloader, I get the same batch for multiple iterations:

for sample in dataloader:
    source_batch, _, _, _, _, _ = sample

When I compare the source_batch from the first iteration upto 5th (?) iteration source_batch, they are exactly the same. I wonder if it’s relevant to the above warning.