Using splits on Custom dataset

On pre-existing dataset, I can do:

from torchtext import datasets
from torchtext import data
TEXT = data.Field(tokenize = 'spacy')
LABEL = data.LabelField(dtype = torch.float)
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

But in case I define a custom dataset, it doesn’t seem possible.

In: type(dataset)
Out: dataset.CustomDataset
train_data, test_data = dataset.splits(TEXT, LABEL)

AttributeError: ’ CustomDataset’ object has no attribute ‘splits’

What is the usual workflow? If check the type of datasets.IMDB it gives, which is confusing.

In: type(datasets.IMDB)
Out: type

Not NLP but vision datasets can also be custom and split… you can read up on that here

def main():
    # train on the GPU or on the CPU, if a GPU is not available
    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

    # our dataset has two classes only - background and person
    num_classes = 2
    # use our dataset and defined transformations
    dataset = PennFudanDataset('PennFudanPed', get_transform(train=True))
    dataset_test = PennFudanDataset('PennFudanPed', get_transform(train=False))

    # split the dataset in train and test set
    indices = torch.randperm(len(dataset)).tolist()
    dataset =, indices[:-50])
    dataset_test =, indices[-50:])

    # define training and validation data loaders
    data_loader =
        dataset, batch_size=2, shuffle=True, num_workers=4,

    data_loader_test =
        dataset_test, batch_size=1, shuffle=False, num_workers=4,

    # get the model using our helper function
    model = get_model_instance_segmentation(num_classes)

Hi emcap,
Thanks for the reply! In my case, I also want to perform the tokenize function on the data as well. I am not sure will be possible in the way you said.

as I understand it, those are two very different goals.

  1. splitting up a dataset into test/train/validation/etc
    as alluded to in the earlier post you can do something with the example from vision

  2. tokenization
    this is a step earlier in the process, in which you are creating your dataset of tokens. Then after you have your dataset of tokens, split it up randomly…


if you want to perform the tokenization as a means for splitting… just replace my example in which I split the array… and use your tokenization strategy instead.

Hope I am understanding you,