Tip: sampling withour replacement from my custom dataloader

I have a custom dataloader where my available ids for picking samples of my dataset are stored during the initialization of the dataloader as follows:

self.tileIds = [4, 56, 78, 10, 23],

and in the __get_item()__ function i sample elements as follows:

def __getitem__(self, idx):
        dataId = self.tileIds[idx]
        img = self.getRawSample(dataId)
        meta = self.meta[idx]

the problem is that I can draw the same id from tileIds in the same epoch and I wonder if there is a smart method to avoid this behavior with some sort of callbacks for example. I would like to use pure Pytorch without lightning.

hi,

You might want to look into the sampler/batch_sampler parameters of the Dataloader.

Here are different strategies to create samplers. You could create your own, using your available ids.

If you post a little more of your code I might be able to help you a little more.

Hope this helps :smile:

You can sort your self.tileIds to mimic sampling without replacement

I found these samplers but I didn’t get how to integrate the inside the dataloader getitem() method!

What part of the code do you need to understand better my problem?

The list is already sorted, however the getitem() will sample at random from this list.

Maybe I’m not understanding right, so please correct me if I’m wrong.

What I think you have is a custom Dataset defined, where you have a list with the possible indexes that can be used.

Maybe something like this:

class MyCustomDataset(Dataset):
    def __init__(self, tileIds):
        super().__init__()
        self.tileIds = tileIds
        self.data

    def __getitem__(self, idx):
        dataId = self.tileIds[idx]
        img = self.getRawSample(dataId)
        meta = self.meta[idx]
        return img, meta

    def __len__(self):
        return len(self.samples) # Whatever the actual length is
    
    def getRawSample(self, idx):
        return random.random() # Whatever your RawSample is

Then I am assuming that you create your actual DataLoader

ds = MyCustomDataset(tileIds=[4, 56, 78, 10, 23])
dl = DataLoader(ds, batch_size=8)

What I meant, is that when you create your DataLoader, you can specify the Sampler, to something custom that you create on your own, or something that is already defined, like the SubsetRandomSampler. This will take a list of indices, and will select values from this list at a random order, which will then be used with your Dataset, as the index that is passed to __getitem__().

With this, you only need to take care of the indices in the DataLoader and not the Dataset.

from torch.utils.data import SubsetRandomSampler


indices=[4, 56, 78, 10, 23]
ds = MyCustomDataset() # No need to give the indices to the dataset
dl = DataLoader(ds, batch_size=8, sampler=SubsetRandomSampler(indices)

Now every time you iterate through the dataloader, you will only receive indices defined in the list, and they will not repeat. In this example, the list has only 5 elements and the batch is supposed to have 8 elements, this means you will only get the 5 idx from the list.

They will NOT repeat during the batch.

However, they WILL repeat during new epochs. (this is normal behavior)

But if you ABSOLUTELY do not want the values to repeat during new epochs, you can redefine the dataloader in each new epoch, by changing the indices from where it can get items.

Then you can do something like this:

from torch.utils.data import Dataset

class MyCustomDataset(Dataset):
    def __init__(self):
        super().__init__()

    def __getitem__(self, idx):
        print(idx) # Print the index when the dataloader fetches something
        return idx
# Create the dataset
ds = MyCustomDataset()
epochs = 3
all_items = random.sample(list(range(15)), 15)

for e in range(epochs):
    print(f"Epoch: {e}")
    dl = torch.utils.data.DataLoader(ds, batch_size=1, sampler=SubsetRandomSampler(all_items[e*5:e*5+5]))
    for batch in dl:
        pass
# Output:
Epoch: 0
7
13
3
2
6
Epoch: 1
11
14
9
10
0
Epoch: 2
4
5
8
1
12

But I would NOT recommend doing this.

Hope this helps,

If something is not clear, please let me know :wink:

Thanks for your reply. However this solution is not right, as you pass indices to the random sampler (which is nothing but tileIds=[4, 56, 78, 10, 23]). However the idx of __getitem__ method are the number of samples not the indices in my opinion. So for example if idx=2 I will get 78 from my list. So probabily the olsution is just to pass a list of integers of the size of the dataset like indices=[0,1,2,3,4,5,6,7,8,9,10] if my dataset has just 10 elements.

This Dataloder seems weird but tileIds=[4, 56, 78, 10, 23] represent the id of the images of interest given a unique grid of images stored in memory essentially. Bacause I don’t have the images but a single file which I access with byte shifting.

The Sampler give back numbers from the list that you give as input.

In this example I passed a list from 0 to 14 (range(15)), that is why the outputs range from 0 to 14.

If you pass a list with the numbers such as [4, 56, 78, 10, 23], then you will only get these values back, which will be passed to __getitem__().

You can try this yourself and see what actually happens.

Also, that example was only to show you how you could change what values the DataLoader can draw values from.

This example was with your data.

So, the numbers will not repeat.
For each batch you will get numbers from this list.

I will try to explain again in the next post to make it more simple

With this simple example you can see that the indices are taken from this list I have given the sampler.

In each epoch, there appears to be each value just once.
They do not repeat.

from torch.utils.data import Dataset

class MyCustomDataset(Dataset):
    def __init__(self):
        super().__init__()

    def __getitem__(self, idx):
        return idx
ds = MyCustomDataset()

all_items = [4, 56, 78, 10, 23, 6, 7, 105, 90, 38, 76, 49, 123, 321, 0]
dl = torch.utils.data.DataLoader(ds, batch_size=5, sampler=SubsetRandomSampler(all_items))

epochs = 3
for e in range(epochs):
    print(f"\nEpoch: {e}")
    for batch in dl:
        print(batch)
# Output:
Epoch: 0
tensor([  0, 321,  10,  76,   7])
tensor([ 56, 123,   6,   4,  49])
tensor([105,  23,  90,  78,  38])

Epoch: 1
tensor([ 7, 78,  6, 23, 49])
tensor([ 76,   0,  10,  90, 123])
tensor([105,  56,   4, 321,  38])

Epoch: 2
tensor([ 76,  56,  38,  10, 321])
tensor([ 0,  4, 78,  7,  6])
tensor([ 49,  90, 123, 105,  23])

Hi! Thanks I get the point with your first answer, the solution is to keep these indices inside my dataloader as before and creating a list of samples ids from 0 to the length of the dataset in order to obtain the correct idxs for my batches!

1 Like