Dataset and Dataloader for a Siamese Neural Network Approach

tarekiraki · April 13, 2022, 11:47am

Hello,

I want to implement the Siamese Neural Networks approach with Pytorch. The approach requires two separate inputs (left and right). My data is split into train and test. I would like to use the entire data set for model training.

For this purpose, I created a custom dataset class. In order to use all data, there is a separate dataset and dataloader instance for each left and right and train and test. However, this means that the data set has to be stored twice in the data loaders. Which leads to memory problems.

Is there a way to do this with just one pair of loaders (for train and test)? The source code is roughly as follows.

Thanks for help.

import torch
from torch.utils.data import DataLoader
from torch.utils.data import Dataset


class MyDataset(Dataset):
    def __init__(self, np_X, np_Y):
        self.np_X = np_X
        self.np_Y = np_Y
        self.len = len(self.np_X)

    def __getitem__(self, index):
        to_X = torch.tensor(self.np_X[index])
        to_Y = torch.tensor(self.np_Y[index])
        return to_X, to_Y

    def __len__(self):
        return self.len

### Load Data from File System
my_X_train, my_X_test, my_Y_train, my_Y_test = LoadData()


### Dataset

# left
train_set_left = MyDataset(np_X=my_X_train, np_Y=my_Y_train)
test_set_left = MyDataset(np_X=my_X_test, np_Y=my_Y_test)

# right
train_set_right = MyDataset(np_X=my_X_train, np_Y=my_Y_train)
test_set_right = MyDataset(np_X=my_X_test, np_Y=my_Y_test)


### DataLoader

# left
train_loader_left = DataLoader(dataset=train_set_left, batch_size=self._batch_size, shuffle=True)
test_loader_left = DataLoader(dataset=test_set_left, batch_size=self._batch_size, shuffle=False)

# right
train_loader_right = DataLoader(dataset=train_set_right, batch_size=self._batch_size, shuffle=True)
test_loader_right = DataLoader(dataset=test_set_right, batch_size=self._batch_size, shuffle=False)

The training loop for a single epoch is as follows:

for i, (data_left, data_right) in (enumerate(zip(self.data_loader_left, self.data_loader_right)):
	# Code

nivek · April 13, 2022, 1:09pm

Hi,

Based on your code snippet, it seems like train_set_left and train_set_right are defined exactly the same.

It should be possible to return from both your left and right inputs from the __getitem__ of your custom Dataset. Is there any reason why that cannot be done?

tarekiraki · April 13, 2022, 3:53pm

I dont think that this will solve the problem. Since it is the same numpy array, the
__getitem__(index) would return the same sample, but I need a different one. In my current setup, this is fullfiled via two loaders and shuffle=True . Actually I need something like __getitem__(index1, index2).

tarekiraki · April 13, 2022, 5:11pm

I found a possible workaround. I omitted the creation of the datasets and dataloaders for an input line (i.e. the right one: train_set_right, test_set_right, train_loader_right, test_loader_right). In my training loop I enumerate twice over the same dataloader, which returns different samples for each input line. My first training experiments seems ok. Does anyone know, if this has any side effects?

Training loop:.

for i, (data_left, data_right) in enumerate(zip(self.data_loader_left, self.data_loader_left)):
# Code

nivek · April 13, 2022, 8:24pm

This is fine but I don’t think you are going to get all possible combination of data_left and data_right, you will only have n random samples, where n is the length of your DataLoader. I’m unsure if that is what you want.

If you want all possible combination of data_left and data_right you should use two for loops:

for data_left in self.data_loader:
    for data_right in self.data_loader:
        pass # Logic to process a pair of data here