LFW dataset train and test distribution

andergalvan · March 5, 2024, 1:58pm

Hello everyone,

I’m a bit new to PyTorch and I’d like to use the LFW dataset for a face recognition problem. I noticed that if there are images of a person in the training set, there are no images of that person in the test set (looking in each set how many images there are of each person). Am I doing something wrong? This is the code I have used:

train_data = datasets.LFWPeople(root=“data/LFW”, split=“train”, image_set=“original”, transform=ToTensor(), download=True)
train_counter = dict(Counter(train_data.targets))

test_data = datasets.LFWPeople(root=“data/LFW”, split=“test”, image_set=“original”, transform=ToTensor(), download=True)
test_counter = dict(Counter(test_data.targets))

Thank you in advanced!
Ander

melgor · October 18, 2024, 2:17pm

I know that it has been 8 months but here is the answer:
LFW data is designed not for a classification task, it is a face-retrieval task. It means that during test time your task is to build an index of known persons (from half of the dataset).
And second half of the dataset should be used as a query image.
Then you calculate the distance between a query and all persons in the gallery. You take the most similar image to a query and check if the query matches your query label.

Thanks to that protocol, classes during training and testing can be different, like in LFW.

In my blog-post I wrote little bit more about it, with link to notebooks with code implementations: Foundation Models for Computer Vision | by Bartosz Ludwiczuk | Medium