I am working in a regression model, the data distribution is quite a multimodal type. So I wanted to make a sampler that can for target values greater than, say x(in code below, x =8) , is found at least 1% of all the dataset. Basically oversampling.
But because of this my models are greatly overfitting.
class My_sampler(Sampler):
def __init__(self ,dataset , pct = 0.1):
self.df = dataset.df.Target
self.pct = pct
def __len__(self):
return len(self.df)
def __iter__(self):
greater_idx = np.where(self.df > 8)[0]
rest_idx = np.where(self.df <= 8)[0]
greater = np.random.choice(greater_idx , int(self.pct*len(self.df)) )
rest = np.random.choice(rest_idx , int((1-self.pct)*len(self.df))+1 , replace = False)
idxs = np.hstack([greater ,rest ])
np.random.shuffle(idxs)
idxs = idxs[:len(self.df)]
return iter(idxs)
our_sampler = My_sampler(dataset)
loader = DataLoader(dataset , sampler=our_sampler , batch_size =8 , drop_last = True)
I don’t know why it’s overfitting, maybe I am not making the Sampler class properly.
Also is it possible that if I declare my sampler class like this, the same example is being used every batch, and in an epoch, every image is not being used as input?