How to retrieve matching data pairs from different dataloaders?

Alpha_Sight · February 27, 2022, 6:45am

Hello everyone,

I am doing a binary classification and have four different datasets:

Dataset_1: Ground truth of matching pairs (e.g. columns x, y)
Dataset_2: Graph features of x1, x2, …, xn
Dataset_3: Numerical features of x1, x2, …, xn
Dataset_4: Graph features of y1, y2, …, yn

Due to the nature of the datatype, it is not feasible to join them into a single dataset.

I intend to train each of these datasets with different models, and concat the results into an ensemble model:

Dataloader_1(Dataset_1) → Matching pairs (e.g. x10/y10)
Dataloader_2(Dataset_2) → x10 → Model_2 → Output 2
Dataloader_3(Dataset_3) → x10 → Model_3 → Output 3
Dataloader_4(Dataset_4) → y10 → Model_4 → Output 4
Output 2, 3, 4 → Model_5 → Final prediction

However, I am unsure of the following:

Positive Training Examples:
How do I get Dataloader_2, Dataloader_3, and Dataloader_4 to return the correct matching IDs to the model?
For example, if training on x5/y5 pair, Dataloader_2 and 3 should return features on x5, and Dataloader_4 should return features on y5. The purpose is to train the model to learn what pairs will match.
Negative Training Examples:
How do I get the Dataloader_2, Dataloader_3, and Dataloader_4 to return wrong matching IDs to the model?
For example, if training on x6/y6 pair, Dataloader_2 and 3 should return features on x6, but Dataloader_4 should return all other ys, except for y6. The purpose is to train the model to learn what pairs won’t match.

Thanks for your help!

ptrblck · February 27, 2022, 11:39pm

I think one potential approach would be to define all Datasets first and make sure the internal data is stored in the same order. I.e. assuming you are either lazily loading the data or are pre-loading it from a tensor or numpy array, make sure that these samples are sorted accordingly.
Once this is done, you could then create a custom Dataset and call the internal 4 datasets with the same index (so you could also use shuffle=True in the DataLoader).
For 2: I don’t know when his negative sampling should be done, but I assume you could trigger this condition by resampling a new index to sample the negative example inside dataset_4.