I recommend using a custom sampler.
Related thread: DistributedSampler
By default, DistributedSampler divides the dataset by the number of processes (equivalent to #GPUs).
In the above thread, I provided an example modification on the sampler to avoid duplication of data.