Let’s say I have a pandas dataframe that’s ‘name’ and ‘count’ column. Is there a way to make a custom dataset so that when I turn it into a dataloader the values with higher ‘count’ get sampled more often in proportion to their count compared to all the other rows?
You can do this by repeating the rows with “higher values” more often than the lower values so that in the training data they will show up more often. e.g. you can increase then rows 50% more if you want to have 50% additional weightage in weight updates.
I thought about doing that but this dataset is 5 million rows and some have counts of 19000+. Something more memory and computationally reasonable is preferred.