Hello,
The data that I am trying to train my model on is represented as sort of a list of unique values. To be more precise, the data is represented as a table of two columns. The first column contains the datapoint values, and the second column contains an integer >= 1 which represents its frequency in the dataset.
Data represented like:
|Value |Frequency |
-----------------------
|AAA |1 |
|BBB |1 |
|CCC |3 |
|DDD |1 |
|... |... |
|ZZZ |1 |
Now, when training my model, I would like to incorporate the fact that certain values are more frequent in the dataset compared to others. Therefore, I cannot just feed my model the values in the first column (which would just be a list of unique values), because then every value would be presented to the model once, regardless of whether their frequency value was 1 or 100.
What I would like to do is to have my Dataloader randomly sample from my dataset, but for rows with an associated frequency value of N, sample that row N times before the end of the epoch. It would be as if instead of a list of unique values, I just had a big list of values with certain values repeated the correct number of times, and that was shuffled and I was sampling from that.
Make it equivalent to sampling from:
|Value |
----------
|AAA |
|BBB |
|CCC |
|CCC |
|CCC |
|DDD |
|... |
|ZZZ |
What would be the best way of going about doing this?
Thank you very much in advance,
Yuta