Hello,
I am looking for a way to connect two DataSets to one, so that it can be trained in one loop. However the batches are not allowed to mix between the datasets. In the following example should only be batches in range 1 to 10 and 41 to 50:
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader, ConcatDataset
df1 = pd.DataFrame(list(range(1,11)))
df2 = pd.DataFrame(list(range(41,51)))
class testset(Dataset):
def __init__(self,data):
self.data = data
def __len__(self):
return len(self.data)
def __getitem__(self, index):
return self.data[0][index]
testdataset1 = testset(df1)
testdataset2 = testset(df2)
datasets = []
datasets.append(testdataset1)
datasets.append(testdataset2)
concat_dataset = ConcatDataset(datasets)
loader = DataLoader(
concat_dataset,
shuffle=False,
num_workers=0,
batch_size=3
)
for data in loader:
print(data)
tensor([1, 2, 3])
tensor([4, 5, 6])
tensor([7, 8, 9]) tensor([10, 41, 42]) ā That should not exist
tensor([43, 44, 45])
tensor([46, 47, 48])
tensor([49, 50])
In the real case I am combining two timeseries, where overlapping in batches with values of both datasets causes a littlebit troubleā¦
First consider if you really need to iterate over rows in a DataFrame. Iterating through pandas dataFrame objects is generally slow. Iteration beats the whole purpose of using DataFrame. It is an anti-pattern and is something you should only do when you have exhausted every other option. It is better look for a List Comprehensions , vectorized solution or DataFrame.apply() method for iterate through DataFrame.
Pandas DataFrame loop using list comprehension
result = [(x, y,z) for x, y,z in zip(df['Name'], df['Promoted'],df['Grade'])]