Truncate an iterable dataset for efficient debugging

am_ar · January 31, 2023, 2:10pm

Hello,
Here is the problem. I am looking for a concise solution to truncate an iterable dataset to an iterable with a set of limited elements (in order to mock this dataset with the real one, and iterate rapidly for debugging).

In the case of “map_style” dataset, we can the method “torch.utils.data.Subset”, to shorten the dataset, which doesn’t seem to work on the other case.

The solution , I thought of was to transform an “Iterable-style” dataset into a “map-style” dataset using the following code before using the “torch.utils.data.Subset” command :

def convert_to_map_dataset(my_dataset):
class MyDataset(torch.utils.data.Dataset):
def init(self):
self.dataset = list(my_dataset)
def getitem(self,idx):
return self.dataset[idx]
def len(self):
return len(self.dataset)
my_new_dataset = MyDataset()
return my_new_dataset

I was wondering , if there was any more straightforward solution to the problem.

To be more specific, the data is loaded using the following commands:

from torchtext.datasets import Multi30k
train_dataset, val_dataset, test_dataset = Multi30k(language_pair=(“en”, “de”), split=(‘train’, ‘valid’, ‘test’))

Thank you !

ptrblck · February 1, 2023, 5:56am

Your idea to process a few samples and to wrap them in a map-style Dataset sounds like a good idea.
However, this line of code:

self.dataset = list(my_dataset)

would try to create all samples, if I’m not mistaken, and could be quite expensive.
Maybe iterating my_dataset for a new steps would work instead which would allow you to wrap the samples into a TensorDataset.
I’m thinking about something like this:

data = []
target = []
dataset_iter = iter(my_dataset)
for _ in range(nb_samples):
    x, y = next(dataset_iter)
    data.append(x)
    target.append(y)
data = torch.stack(data)
target = torch.stack(target)
self.dataset = TensorDataset(data, target)

am_ar · February 1, 2023, 3:25pm

Hello,
Thank you for your reply, it’s a good remark indeed!
I replaced my current solution by adding the following code:

from itertools import islice
m = 5 # number of samples of my new  dataset
dataset = list(islice(my_dataset,m))

but I am still not satisfied by my solution, which doesn’t seem clean.

Concerning your proposed solution, although it is cleaner, it works for regular dataset but not for the specified one, because the dataset I am working with outputs a pairs of sentences , of variable length.

If you have any other suggestion, to resolve the problem, I’ll be glad to hear about it.
Thanks

ptrblck · February 2, 2023, 1:27am

This would also mean that you are not able to create batches using these samples unless you pad them or return a list, right?
In this case the cleanest approach might be to create a custom IterableDataset using an end counter as shown in e.g. the example from the docs.