Smooth Sampling Rate Adjustment for Different Datasets

syt-nju · December 25, 2024, 4:33pm

Variable Weighted Sampling for Different Datasets

When training a language model, we use multiple datasets concurrently. We aim to adjust the sampling ratios of different datasets as training progresses. Ideally, this should support a nested structure where weights for a category of datasets can be modified at each level, with random sampling conducted layer by layer. It is not required that sampling without replacement be enforced globally; rather, we only require that within each sub-dataset, no replacement occurs until a full traversal has been completed.

Is there an existing solution for this?