On pre-existing dataset, I can do:
from torchtext import datasets
from torchtext import data
TEXT = data.Field(tokenize = 'spacy')
LABEL = data.LabelField(dtype = torch.float)
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
But in case I define a custom dataset, it doesn’t seem possible.
train_data, test_data = dataset.splits(TEXT, LABEL)
AttributeError: ’ CustomDataset’ object has no attribute ‘splits’
What is the usual workflow? If check the type of datasets.IMDB it gives, which is confusing.
Not NLP but vision datasets can also be custom and split… you can read up on that here
# train on the GPU or on the CPU, if a GPU is not available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# our dataset has two classes only - background and person
num_classes = 2
# use our dataset and defined transformations
dataset = PennFudanDataset('PennFudanPed', get_transform(train=True))
dataset_test = PennFudanDataset('PennFudanPed', get_transform(train=False))
# split the dataset in train and test set
indices = torch.randperm(len(dataset)).tolist()
dataset = torch.utils.data.Subset(dataset, indices[:-50])
dataset_test = torch.utils.data.Subset(dataset_test, indices[-50:])
# define training and validation data loaders
data_loader = torch.utils.data.DataLoader(
dataset, batch_size=2, shuffle=True, num_workers=4,
data_loader_test = torch.utils.data.DataLoader(
dataset_test, batch_size=1, shuffle=False, num_workers=4,
# get the model using our helper function
model = get_model_instance_segmentation(num_classes)
Thanks for the reply! In my case, I also want to perform the tokenize function on the data as well. I am not sure will be possible in the way you said.
as I understand it, those are two very different goals.
splitting up a dataset into test/train/validation/etc
as alluded to in the earlier post you can do something with the example from vision
this is a step earlier in the process, in which you are creating your dataset of tokens. Then after you have your dataset of tokens, split it up randomly…
if you want to perform the tokenization as a means for splitting… just replace my example in which I split the array… and use your tokenization strategy instead.
Hope I am understanding you,