Take a small subset of data using Dataset object

kirbiyik · July 17, 2018, 2:02pm

I’ve implemented a specific dataset class for my purpose by inheriting Dataset object. It works properly. I’d like to take a very small subset of dataset, say 50, to see if my model overfits it successfully. Yet the data consist of many h5 files and json files, therefore changing it from my dataset class seems very hard and infeasible.

I tried manipulating the training file by using indexing. But that was not possible since Dataset object or enumerate object does not support indexing.I can provide additional info or code, if requested. The way I use Dataloader is:

for idx, batch in enumerate(dataloader_train):
...

nwesemann · September 23, 2018, 9:37am

did it work? how did you do it?

pbloem · September 23, 2018, 9:47am

This may not be applicable to your case, but for small sanity checks like these I often just insert a break statement after a few batches:

for idx, batch in enumerate(dataloader_train):
    if idx > 10:
        break
    ...

You should turn shuffling off in the dataloader, to get the same batches each epoch.

You have to edit the lines out afterwards instead of getting a proper regression test, but for quick-and-dirty model development it’s a simple trick.

kirbiyik · September 23, 2018, 10:31am

My workaround was like:

# indices to draw samples from the dataset.
picks = np.random.permutation(20)

dataloader_train = DataLoader(
    dataset,
    batch_size=batch_size,
    shuffle=False, # note that sampler and shuffle arguments are mutually exclusive
    sampler=picks,
    collate_fn=dataset.collate_fn
)