I have a custom Dataset I’m trying to build out. The actual details of my Dataset are below, but for now I’m going to focus on the following example code.
The goal is to load some data into __getitem__() and segment the array into several samples which I can then stack and output with the batch.
from torch.utils.data import Dataset, DataLoader
import torch
import numpy as np
class Example_DS(Dataset):
def __init__(self, data):
self.data = data
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
record = self.data[idx]
X1 = record[:5]
X2 = record[5:]
X1 = torch.from_numpy(X1)
X2 = torch.from_numpy(X2)
X = torch.stack([X1, X2])
sample = {
'sample' : X
}
return sample
########################################
data = [
np.random.randint(0, 10, 10)
,np.random.randint(0, 10, 10)
,np.random.randint(0, 10, 10)
,np.random.randint(0, 10, 10)
,np.random.randint(0, 10, 10)
,np.random.randint(0, 10, 10)
]
ds = Example_DS(data)
dls = DataLoader(ds, batch_size = 2, shuffle = True, num_workers = 1)
for batch in dls:
print(batch['sample'].shape)
As seen when running the example code the output tensors for each batch have a shape of [2, 2, 5]. I would like to have the shape be [4, 5] instead.
What is the best way to make this happen?
More detail as promised:
I’m trying to work with waveform data. I could get multiple samples from 1 audio file, but without this method I would have to open each audio file N times for each of N samples. I would prefer to open the file once, slice the N samples, and concat/stack the resulting tensors along with the rest of the files/samples in the batch.
Current:
loadtime = time_to_load_file * N_files * N_samples
Preferred:
loadtime = time_to_load_file * N_files
Thanks in advance for any help you can provide!
Will update if I find a solution.