torchtext.data.BucketIterator.splits for subset not working

awsgcp · February 1, 2020, 10:13am

Regarding text data, torchtext.data.BucketIterator.splits is working well with dataset, however, in order to split the dataset to train and validate, torch.utils.data.random_split will be used, but it returns subset not dataset.

so this will trigger a problem, if we use torch.utils.data.random_split to split a dataset to train and validate, how can we still use torchtext.data.BucketIterator.splits to generate train_loader and validate_loader from the subsets?

Thank you very much.

awsgcp · February 1, 2020, 12:27pm

AttributeError: 'Subset' object has no attribute 'sort_key'

the above error will show if apply torchtext.data.BucketIterator.splits to subset. Since subset is not dataset, how can we easily generate loader from subset? thank you very much.

awsgcp · February 1, 2020, 1:05pm

found a solution: dataset itself has a method called split, it can split the dataset by ratio, which can solve the problem of split dataset to train and validation.

The split method of dataset class will return dataset, NOT subset, which is different from torch.utils.data.random_split.

However, this will create a puzzle: when to use torch.utils.data.random_split? what is subset for?

Hope the solution helps people with similar problems. thanks.

zhangguanheng66 · February 3, 2020, 4:02pm

Thanks @awsgcp. Indeed, I think torchtext has some duplicate code, which should be retired. See an issue post here where we discuss a new abstraction, which is more compatible with torch.utils.data. In that case, we don’t need to maintain those duplicate functions anymore.