Hi all, I am constructing a vocab from the Multi30k dataset using building_vocab_from_iterator and the spaCy tokenizer. However, I’m getting a warning saying that “some child datapipes are not exhausted.” Is this an issue with my implementation? Here is the snippet of code where the error first appears (the error sporadically re-appears while running epochs):
import torch
from torchtext.datasets import Multi30k
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
train_data, validation_data, test_data = Multi30k()
tokenizer_en = get_tokenizer('spacy', language="en_core_web_sm")
tokenizer_de = get_tokenizer('spacy', language="de_core_news_sm")
def yield_tokens_en(data_iter):
for _, text in data_iter:
yield tokenizer_en(text)
def yield_tokens_de(data_iter):
for text, _ in data_iter:
yield tokenizer_de(text)
vocab_en = build_vocab_from_iterator(yield_tokens_en(train_data), specials=["<pad>", "<bos>", "<eos>", "<unk>"])
vocab_de = build_vocab_from_iterator(yield_tokens_de(train_data), specials=["<pad>", "<bos>", "<eos>", "<unk>"])
And the corresponding output:
UserWarning: Some child DataPipes are not exhausted when __iter__ is called. We are resetting the buffer and each child DataPipe will read from the start again.
warnings.warn("Some child DataPipes are not exhausted when __iter__ is called. We are resetting "
Relevant library versions:
pytorch==1.12.0
torchtext==0.13.0