How train_test_split and dataloader work together

John_Deatherage · April 5, 2023, 4:46pm

I have a Dataset Class (DS) with 21006 rows by 75 feature columns and 1 output column. my Dataset class splits the data into X & y. (and converts X & y into torch tensors…)
(DS.X is 21006 x 75) (DS.y is 21006 x 1)… I’ve verified this with print (len(DS.X), len(DS.y)) & print(DS.X.shape, DS.y.shape)

len of DS.X & DS.y 21006 21006
shape of … torch.Size([21006, 75]) torch.Size([21006])

I want to pass it through the function ( X_train, X_test, y_train, y_test = train_test_split(DS.X, DS.y, test_size=.25, shuffle=True))

After the train_test_split, I get:
len of X_train & y_train 15754 15754
shape of … torch.Size([15754, 75]) torch.Size([15754]) (note: 75% of 21006 is 15754)

Now I want to pass X_train & y_train into a dataloader but how??? the dataloader seems to accept a dataset, not a tuple. " dl = DataLoader(ds, batch_size=2, shuffle=True)"

I think I want dl = DataLoader(ds=(X_train, y_train), batch_size=2, shuffle = True) …

    training_DS = (X_train, y_train)
    dataloader = DataLoader(ds=training_DS, batch_size=2, shuffle = True) 	


    for epoch in range(epochs): #outer loop
	    for step, (X_train, y_train) in enumerate(dataloader): #inner loop
            print (epoch, step, len(X_train), y_train[0])

I get this error message: for step, (X_train, y_train) in enumerate(dataloader):
ValueError: not enough values to unpack (expected 2, got 1)

What am I missing / not understanding?

THANK YOU

john.david.deatherage@gmail.com 417 527 4042

AbdulsalamBande · April 5, 2023, 9:34pm

The issue here is that the DataLoader expects a Dataset object as its input, not a tuple. You can create a custom Dataset class that takes X_train and y_train as inputs and implements the necessary methods required by DataLoader. Check this site on creating Custom Dataloaders. Or something like below

class CustomTensorDataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y

    def __getitem__(self, index):
        return (self.X[index], self.y[index])

    def __len__(self):
        return len(self.X)

John_Deatherage · April 5, 2023, 11:18pm

Thank you for taking the time to help me with this… Here is my custom data class:

class Dow30_Dataset(Dataset): # inherits the Dataset class

def init(self): # download, read into pandas dataframe
df_Xy = pd.read_csv(r’C:\Users\user\OneDrive\Desktop\NN_1OUT\DataSet_DOW30.csv’)

define the X features

self.X = (df_Xy[[‘RSI_Sc’, ‘RSIsc-1’, ‘RSIsc-2’, ‘RSIsc-3’, ‘RSIsc-4’,
‘RSIsc-5’, ‘RSIsc-6’, ‘RSIsc-7’, ‘RSIsc-8’, ‘RSIsc-9’, ‘ROC_Sc’,
‘ROCsc-1’, ‘ROCsc-2’, ‘ROCsc-3’, ‘ROCsc-4’, ‘ROCsc-5’, ‘ROCsc-6’,
‘ROCsc-7’, ‘ROCsc-8’, ‘ROCsc-9’, ‘STOCH_Sc’, ‘STOCHsc-1’, ‘STOCHsc-2’,
‘STOCHsc-3’, ‘STOCHsc-4’, ‘STOCHsc-5’, ‘STOCHsc-6’, ‘STOCHsc-7’,
‘STOCHsc-8’, ‘STOCHsc-9’, ‘CCI_Sc’, ‘CCIsc-1’, ‘CCIsc-2’, ‘CCIsc-3’,
‘CCIsc-4’, ‘CCIsc-5’, ‘CCIsc-6’, ‘CCIsc-7’, ‘CCIsc-8’, ‘CCIsc-9’,
‘Price_Sc-0’, ‘Price_Sc-1’, ‘Price_Sc-2’, ‘Price_Sc-3’, ‘Price_Sc-4’,
‘Price_Sc-5’, ‘Price_Sc-6’, ‘Price_Sc-7’, ‘Price_Sc-8’, ‘Price_Sc-9’,
‘Volume_Sc-0’, ‘Volume_Sc-1’, ‘Volume_Sc-2’, ‘Volume_Sc-3’,
‘Volume_Sc-4’, ‘Volume_Sc-5’, ‘Volume_Sc-6’, ‘Volume_Sc-7’,
‘Volume_Sc-8’, ‘Volume_Sc-9’, ‘VIX_cls_0’, ‘VIX_cls_1’, ‘VIX_cls_2’,
‘VIX_cls_3’, ‘VIX_cls_4’, ‘VIX_cls_5’, ‘VIX_cls_6’, ‘VIX_cls_7’,
‘VIX_cls_8’, ‘VIX_cls_9’, ‘3MoSc’, ‘1YrSc’, ‘5YrSc’, ‘10YrSc’, ‘30YrSc’]]) # upto but excludes last column

self.X = torch.from_numpy(self.X.to_numpy()).float()

define the Y output class column(s)

self.y = (df_Xy[‘Output_NB’]) #includes only the last column
self.y = torch.from_numpy(self.y.to_numpy()).float()

determine the number of rows in the dataframe df_Xy

self.n_samples = df_Xy.shape[0]

support indexing such that DS[i] can be used to get the i-th sample

def getitem (self, index):
return (self.X[index], self.y[index]) # returns a tuple

def len(self):
return self.n_samples

you are suggesting I need a new method in the custom data class? Can you explain in pseudo code what you mean?

Thank you!