Data processing as a batch way

ptrblck · March 1, 2018, 11:18pm

You are right. Every batch will grab 10 chunks of size 3600.
I would suggest to preprocess your data in the __getitem__ method, since you will most likely wrap your Dataset into a DataLoader, which can load the batches using multi-processing.
Using this, your DataLoader can grab some batches in the background, while your training loop is still busy.
I have created a small example snippet to play around:

# Create dummy csv data
nb_samples = 110
a = np.arange(nb_samples)
df = pd.DataFrame(a, columns=['data'])
df.to_csv('data.csv', index=False)


# Create Dataset
class CSVDataset(Dataset):
    def __init__(self, path, chunksize, nb_samples):
        self.path = path
        self.chunksize = chunksize
        self.len = nb_samples / self.chunksize

    def __getitem__(self, index):
        x = next(
            pd.read_csv(
                self.path,
                skiprows=index * self.chunksize + 1,  #+1, since we skip the header
                chunksize=self.chunksize,
                names=['data']))
        x = torch.from_numpy(x.data.values)
        return x

    def __len__(self):
        return self.len


dataset = CSVDataset('data.csv', chunksize=10, nb_samples=nb_samples)
loader = DataLoader(dataset, batch_size=10, num_workers=1, shuffle=False)

for batch_idx, data in enumerate(loader):
    print('batch: {}\tdata: {}'.format(batch_idx, data))

The last batch will only have 10 samples.
Is this code of any help?
I think your data should be divisible without a remainder by chunksize, because otherwise the last batch will have different sizes. At least I cannot come up with a fast solution at the moment.