Dataset Multiple Samples per getitem Call

callahman · January 29, 2021, 2:35am

I have a custom Dataset I’m trying to build out. The actual details of my Dataset are below, but for now I’m going to focus on the following example code.

The goal is to load some data into __getitem__() and segment the array into several samples which I can then stack and output with the batch.

from torch.utils.data import Dataset, DataLoader
import torch
import numpy as np

class Example_DS(Dataset):

    def __init__(self, data):
        self.data = data
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        
        record = self.data[idx]
        
        X1 = record[:5]
        X2 = record[5:]
        
        X1 = torch.from_numpy(X1)
        X2 = torch.from_numpy(X2)
        
        X = torch.stack([X1, X2])
        
        sample = {
            'sample' : X
            }
        
        return sample


########################################
data = [
        np.random.randint(0, 10, 10)
        ,np.random.randint(0, 10, 10)
        ,np.random.randint(0, 10, 10)
        ,np.random.randint(0, 10, 10)
        ,np.random.randint(0, 10, 10)
        ,np.random.randint(0, 10, 10)
        ]


ds = Example_DS(data)

dls = DataLoader(ds, batch_size = 2, shuffle = True, num_workers = 1)

for batch in dls:
    print(batch['sample'].shape)

As seen when running the example code the output tensors for each batch have a shape of [2, 2, 5]. I would like to have the shape be [4, 5] instead.

What is the best way to make this happen?

More detail as promised:
I’m trying to work with waveform data. I could get multiple samples from 1 audio file, but without this method I would have to open each audio file N times for each of N samples. I would prefer to open the file once, slice the N samples, and concat/stack the resulting tensors along with the rest of the files/samples in the batch.

Current:

loadtime = time_to_load_file * N_files * N_samples

Preferred:

loadtime = time_to_load_file * N_files

Thanks in advance for any help you can provide!
Will update if I find a solution.

ptrblck · January 31, 2021, 11:51pm

Assuming the returned tensor contains the desired values, you could flatten it in the DataLoader loop via:


for batch in dls:
    sample = batch['sample']
    sample = sample.view(-1, 5)
    print(sample.shape) # should print [4, 5] now

Using a custom collate_fn would probably also work, but I think the view operation might be easier.

callahman · February 2, 2021, 12:38am

Thanks for the response!
Probably not as elegant as the collate_fn idea, but it gets the job done!

Thanks again!

heh · September 6, 2023, 6:23pm

I use Pytorch Lightning, and I have to feed in a dataloader() object into the Lightning Trainer, and thus you won’t get access to the DataLoader loop. That’s where the collate_fn comes in handy.

datalw · February 28, 2024, 8:59am

I am facing the same issue. I would like to know, how collate_fn works here. Could you give a code example?
And if the batch size is 2, how can it be integrated into collate_fn, so that a batch won’t be length of 4 in the end after cutting each record in the middle?

Thanks!

ptrblck · February 28, 2024, 2:28pm

The collate_fn will receive a list of samples returned by Datasert.__getitem__ and creates the final batch. This code shows its input:

class Example_DS(Dataset):
    def __init__(self):
        self.data = torch.arange(10).float().view(10, 1)
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        x = self.data[idx]
        sample = {
            'sample' : x
            }        
        return sample


def my_collate(batch):
    print(batch)
    print(len(batch))
    return batch


ds = Example_DS()
dls = DataLoader(ds, batch_size=2, shuffle=True, num_workers=1, collate_fn=my_collate)

for batch in dls:
    pass

# Output
# [{'sample': tensor([3.])}, {'sample': tensor([6.])}]
# 2
# [{'sample': tensor([8.])}, {'sample': tensor([2.])}]
# 2
# [{'sample': tensor([5.])}, {'sample': tensor([9.])}]
# 2
# [{'sample': tensor([0.])}, {'sample': tensor([7.])}]
# 2
# [{'sample': tensor([1.])}, {'sample': tensor([4.])}]
# 2

It depends on your goal and use case what exactly you want to manipulate and how.
The original question was easier solved by just calling a view op on the final batch.

Also, take a look at the default_collate as a reference.