DataLoader for custom dataset for Siamese Network is very slow / not responding

Hi there,

I am implementing a Siamese Neural Network and therefore wrote a custom Dataset for it. The Dataset is implemented to fit my folder structure.

The folder structure looks like the following:

root/
  0001/
     0001-normal.jpg
     0001-blurry.jpg
     0001-shifted.jpg
     ....
   0002/
     ....

In total I have more than 6000 folders with 8 images of the same person.
Here’s my implementation of the Dataset:

class SiameseDataset(Dataset):
    def __init__(self, root_dir, transform=None):
        super().__init__()
        self.h_table = {}
        self.root_dir = root_dir
        self.folders = os.listdir(root_dir)
        self.folders.sort()
        self.num_folders = len(self.folders) - 1
        # store all files in h_table[label] as list
        for folder in self.folders:
            self.h_table[str(int(folder) - 1)] = os.listdir("{}/{}".format(root_dir, folder))

    def __len__(self):
        return len(self.h_table)

    def __getitem__(self, index):
        same = random.uniform(0,1) > 0.5
        h_len = len(self.h_table[str(index)]) - 1
        h_ = self.h_table[str(index)]
        if same:
            first_idx = random.randint(0, h_len)
            while True:
                second_idx = random.randint(0, h_len)
                if first_idx != second_idx:
                    break

            first_path = "{}/{:04d}/{}".format(self.root_dir, index + 1, h_[first_idx])
            second_path = "{}/{:04d}/{}".format(self.root_dir, index + 1, h_[second_idx])

            first = plt.imread(first_path)
            second = plt.imread(second_path)

            first = transforms.ToTensor()(first)
            second = transforms.ToTensor()(second)

            return (first, index), (second, index)
        else:
            first_idx = random.randint(0, h_len)
            while True:
                sec_class_idx = random.randint(0, self.num_folders)
                if sec_class_idx != index:
                    break
            second_idx = random.randint(0, h_len)
            second = self.h_table[str(sec_class_idx)][second_idx]

            first_path = "{}/{:04d}/{}".format(self.root_dir, index + 1, h_[first_idx])
            second_path = "{}/{:04d}/{}".format(self.root_dir, index + 1, h_[second_idx])

            first = plt.imread(first_path)
            second = plt.imread(second_path)

            first = transforms.ToTensor()(first)
            second = transforms.ToTensor()(second)

            return (first, index), (second, sec_class_idx)

I think the code is very simple. If same is True, the __getitem__ method simply takes two random images of the same folder, applies a ToTensor transformation and returns both images with the index each. If same is False, then it takes a random image of the folder at the given index, chooses another random folder index and takes another random image of that folder.

The code works if I do not use a DataLoader.

root_dir = "data/256x256/1"
dataset = SiameseDataset(root_dir, device)
(x1, y1), (x2, y2) = dataset[0] # <- this works

But if I use a DataLoader object and try to iterate over it in a training loop for my Siamese Network it simply loads forever.

loader = DataLoader(dataset, batch_size=128)
for i, data in enumerate(loader): # <- this takes forever
    # do stuff

I’m working on a large machine provided by my university so performance should not be an issue and I never had issues like this before, when I used the ImageFolder dataset class.

I tried opening all images and saving them in h_table, such that they are already in memory. But that did not help either.

I also tried calling __getitem__ manually. This works without issues.

for i in range(10):
    (x1,y1), (x2,y2) = dataset.__getitem__(i)
    print(y1,y2)

This returns:


0 585
1 5848
2 262
3 3
4 4001
5 5
6 1696
7 2027
8 8
9 9

I managed to find the issue. It has nothing to do with the dataset or dataloader but with torch multiprocessing. I am using multiple GPUs and therefore call torch.multiprocessing.set_start_method('spawn') before running the training loop.

If I don’t execute this command everything works fine.