How to iterate through composed dataset with no overlapped batches?

Ennautik · July 12, 2021, 7:41pm

Hello,
I am looking for a way to connect two DataSets to one, so that it can be trained in one loop. However the batches are not allowed to mix between the datasets. In the following example should only be batches in range 1 to 10 and 41 to 50:

import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader, ConcatDataset

df1 = pd.DataFrame(list(range(1,11)))
df2 = pd.DataFrame(list(range(41,51)))

class testset(Dataset):
    def __init__(self,data):
        self.data = data
    
    def __len__(self):
            return len(self.data)
        
    def __getitem__(self, index):
        return self.data[0][index]

testdataset1 = testset(df1)
testdataset2 = testset(df2)

datasets = []
datasets.append(testdataset1)
datasets.append(testdataset2)

concat_dataset = ConcatDataset(datasets)

loader = DataLoader(
    concat_dataset,
    shuffle=False,
    num_workers=0,
    batch_size=3
)

for data in loader:
    print(data)

tensor([1, 2, 3])
tensor([4, 5, 6])
tensor([7, 8, 9])
tensor([10, 41, 42]) ← That should not exist
tensor([43, 44, 45])
tensor([46, 47, 48])
tensor([49, 50])

In the real case I am combining two timeseries, where overlapping in batches with values of both datasets causes a littlebit trouble…

This shouldn’t be a though one, right?

berkninan · March 23, 2022, 5:30am

First consider if you really need to iterate over rows in a DataFrame. Iterating through pandas dataFrame objects is generally slow. Iteration beats the whole purpose of using DataFrame. It is an anti-pattern and is something you should only do when you have exhausted every other option. It is better look for a List Comprehensions , vectorized solution or DataFrame.apply() method for iterate through DataFrame.

Pandas DataFrame loop using list comprehension

result = [(x, y,z) for x, y,z in zip(df['Name'], df['Promoted'],df['Grade'])]

Pandas DataFrame loop using DataFrame.apply()

result = df.apply(lambda row: row["Name"] + " , " + str(row["TotalMarks"]) + " , " + row["Grade"], axis = 1)

Matias_Vasquez · March 23, 2022, 7:01am

You should use two different Dataloaders. That way the batches are separate.

This is just an example, but you could take randomly batches from one or the other, or alternate or whatever you want.

Hope this helps.

dl1 = DataLoader(
    testdataset1,
    shuffle=False,
    num_workers=0,
    batch_size=3
)

dl2 = DataLoader(
    testdataset2,
    shuffle=False,
    num_workers=0,
    batch_size=3
)

iter1 = iter(dl1)
iter2 = iter(dl2)

for i in range(len(dl1)+len(dl2)):
    if i < len(dl1):
        batch = next(iter1)
    else:
        batch = next(iter2)
    print(batch)

#tensor([1, 2, 3])
#tensor([4, 5, 6])
#tensor([7, 8, 9])
#tensor([10])
#tensor([41, 42, 43])
#tensor([44, 45, 46])
#tensor([47, 48, 49])
#tensor([50])