Smooth Sampling Rate Adjustment for Different Datasets

Variable Weighted Sampling for Different Datasets

When training a language model, we use multiple datasets concurrently. We aim to adjust the sampling ratios of different datasets as training progresses. Ideally, this should support a nested structure where weights for a category of datasets can be modified at each level, with random sampling conducted layer by layer. It is not required that sampling without replacement be enforced globally; rather, we only require that within each sub-dataset, no replacement occurs until a full traversal has been completed.

Is there an existing solution for this?

1 Like