Dataloader with custom modified randomization


I am trying to implement a dataloader with the following sampling characteristics:

Essentially, given that there are N (e.g. N=10) data points in a dataset, I am trying to build a dataloader where the data points sampled in the second half of a batch will depend on the data points sampled in the first half of a batch. Let’s say that in the first half of the batch (batch size of 4, so half the batch size is 2), we sample datapoint 1 and datapoint 5. Let’s also assume that if we sample datapoint 1, we want to prioritize sampling datapoint 9 within that same batch. Since we have sampled datapoint 1 in the first half of the batch, we want to sample datapoint 9 and some other datapoint to complete the 4 datapoint sample, i.e.,

N: {1,2,3,4,5,6,7,8,9,10}
Batch size=4
Fist 2 elements of a batch: {1, 5}
Desired 4 elements of this batch: {1,5,9,n}, where datapoint 9 is desired based on the presence of datapoint 1 in this batch, and n is some other random datapoint.

Can someone please advise on whether such custom dataloader can be implemented with PyTorch, and if yes, how I can implement that? Thanks a lot in advance!


One simple approach would be to use batch_size=2 in the DataLoader and add the sampling logic to Dataset.__getitem__.
This method would get the two indices as its index argument, and you could use this index to sample the desired paired sample. If you want or need, you could also use a BatchSampler, which would pass both indices together and might make your custom sampling easier.

A disadvantage would be that your batch_size is specified as e.g. 2, while your actual DataLoader returns a tensor with a batch_size of 4.

Alternatively, you could also write a custom sampler.

1 Like

Ingenious! You are awesome :slight_smile: