What is the best way to sample batches from different datasets in sequence in DDP

I have many datasets that I would like to train jointly but with different shapes. More specifically, for each batch, I would like to have samples only from one dataset. Samples in each batch are sampled randomly from their belonging datasets. Different datasets are sampled in sequence to form batches. In short, the training looks like this (taking 3 datasets for illustration): [batch 1 from dataset 1]-[batch 2 from dataset 2]-[batch 3 from dataset 3]-[batch 4 from dataset 1]- etc…
On the Internet, I found two commonly used approaches to achieve this: either constructing dataloaders per dataset or writing a custom batch sampler. I implemented both and performed some simple experiments, and I found that the two approaches do not seem super optimized for DDP training with multiple nodes and GPUs. As when I tested the performance, I found the wall time for each batch during DDP training is significantly larger than training on one GPU (eg., 0.8s vs 0.3s). I suspect the issue may come from CPU and data IO, since data are loaded on the fly and to optimize the IO time, I set the num_workers for each dataloader to be 4; and in the batch sampler approach, I set the num_replicas for each distributed_sampler also to be 4. However, when I have many datasets (>10), the number of processes I create will be larger than the number of available CPU cores, and it causes a huge overhead.

I wonder if my reasoning makes sense here. If so, is there a recommended way to fix it, or maybe just a better approach to achieve my desired multi-dataset sampling behavior?

Many thanks.