Hi there,
I am implementing a Siamese Neural Network and therefore wrote a custom Dataset for it. The Dataset is implemented to fit my folder structure.
The folder structure looks like the following:
root/
0001/
0001-normal.jpg
0001-blurry.jpg
0001-shifted.jpg
....
0002/
....
In total I have more than 6000 folders with 8 images of the same person.
Here’s my implementation of the Dataset:
class SiameseDataset(Dataset):
def __init__(self, root_dir, transform=None):
super().__init__()
self.h_table = {}
self.root_dir = root_dir
self.folders = os.listdir(root_dir)
self.folders.sort()
self.num_folders = len(self.folders) - 1
# store all files in h_table[label] as list
for folder in self.folders:
self.h_table[str(int(folder) - 1)] = os.listdir("{}/{}".format(root_dir, folder))
def __len__(self):
return len(self.h_table)
def __getitem__(self, index):
same = random.uniform(0,1) > 0.5
h_len = len(self.h_table[str(index)]) - 1
h_ = self.h_table[str(index)]
if same:
first_idx = random.randint(0, h_len)
while True:
second_idx = random.randint(0, h_len)
if first_idx != second_idx:
break
first_path = "{}/{:04d}/{}".format(self.root_dir, index + 1, h_[first_idx])
second_path = "{}/{:04d}/{}".format(self.root_dir, index + 1, h_[second_idx])
first = plt.imread(first_path)
second = plt.imread(second_path)
first = transforms.ToTensor()(first)
second = transforms.ToTensor()(second)
return (first, index), (second, index)
else:
first_idx = random.randint(0, h_len)
while True:
sec_class_idx = random.randint(0, self.num_folders)
if sec_class_idx != index:
break
second_idx = random.randint(0, h_len)
second = self.h_table[str(sec_class_idx)][second_idx]
first_path = "{}/{:04d}/{}".format(self.root_dir, index + 1, h_[first_idx])
second_path = "{}/{:04d}/{}".format(self.root_dir, index + 1, h_[second_idx])
first = plt.imread(first_path)
second = plt.imread(second_path)
first = transforms.ToTensor()(first)
second = transforms.ToTensor()(second)
return (first, index), (second, sec_class_idx)
I think the code is very simple. If same
is True
, the __getitem__
method simply takes two random images of the same folder, applies a ToTensor
transformation and returns both images with the index each. If same
is False
, then it takes a random image of the folder at the given index, chooses another random folder index and takes another random image of that folder.
The code works if I do not use a DataLoader.
root_dir = "data/256x256/1"
dataset = SiameseDataset(root_dir, device)
(x1, y1), (x2, y2) = dataset[0] # <- this works
But if I use a DataLoader
object and try to iterate over it in a training loop for my Siamese Network it simply loads forever.
loader = DataLoader(dataset, batch_size=128)
for i, data in enumerate(loader): # <- this takes forever
# do stuff
I’m working on a large machine provided by my university so performance should not be an issue and I never had issues like this before, when I used the ImageFolder
dataset class.
I tried opening all images and saving them in h_table
, such that they are already in memory. But that did not help either.
I also tried calling __getitem__
manually. This works without issues.
for i in range(10):
(x1,y1), (x2,y2) = dataset.__getitem__(i)
print(y1,y2)
This returns:
0 585
1 5848
2 262
3 3
4 4001
5 5
6 1696
7 2027
8 8
9 9