Correct way to implement Sampler

Mriganka_Nath · December 9, 2021, 7:12am

I am working in a regression model, the data distribution is quite a multimodal type. So I wanted to make a sampler that can for target values greater than, say x(in code below, x =8) , is found at least 1% of all the dataset. Basically oversampling.
But because of this my models are greatly overfitting.

class My_sampler(Sampler):
    def __init__(self ,dataset , pct = 0.1):
        self.df = dataset.df.Target
        self.pct = pct
    def __len__(self):
        return len(self.df)
    def __iter__(self):
        greater_idx = np.where(self.df > 8)[0] 
        rest_idx = np.where(self.df <= 8)[0]
        greater = np.random.choice(greater_idx , int(self.pct*len(self.df)) )
        rest = np.random.choice(rest_idx , int((1-self.pct)*len(self.df))+1 , replace = False)
        idxs = np.hstack([greater ,rest ])
        np.random.shuffle(idxs)
        idxs = idxs[:len(self.df)]
        return iter(idxs)
our_sampler = My_sampler(dataset)
loader = DataLoader(dataset , sampler=our_sampler , batch_size =8 , drop_last = True)

I don’t know why it’s overfitting, maybe I am not making the Sampler class properly.

Also is it possible that if I declare my sampler class like this, the same example is being used every batch, and in an epoch, every image is not being used as input?