How to use my own sampler when I already use DistributedSampler?

Thanks for the info.

I checked, and yes - you’re right.

Just found DistributedSamplerWrapper from here. It allows you to wrap DistributedSampler on the top of existing sampler. Might be good feature to add in PyTorch!

9 Likes

I think this is indeed the most natural solution - subsampler to be passed to the distributed sampler.

It is good that Catalyst offers this but I am wondering whether this component should belong to Catalyst or PyTorch.

this cannot be right @ptrblck you do not shuffle the targets also when this modify the indices, you do not do either for targets, could you please assist in providing modified implementation? thanks

I don’t understand your claim. Could you explain a bit more what is not shuffled in which code and what is expected?

Someone can correct me if I’m wrong, but does this solution correctly handle coordinating the (non-distributed) sampler sampling among processes?

I think this implementation works if you use it to return indices sequentially, but I’ve been trying to use it to work with SubsetRandomSampler (I would like it to work with a general sampler including randomization).

As far as I can tell, every time you fetch a new iterator it returns a new randomized list of indices (using SubsetRandomSampler’s __iter__), and this randomization is not coordinated among processes. This means that although each DistributedSamplerWrapper is using its own subset of DistributedSampler indices to access the list of indices found in DatasetFromSampler , each process has a different randomized list of indices in DatasetFromSampler , and so the subsets are no longer guaranteed to be separate.

This problem is fixed in DistributedSampler as well as in the above example DistributedWeightedSampler implementation by shuffling deterministically based on epoch, e.g.:

@glyn , @ptrblck thank you for the implementations. I have found an issue (at least for me) that my whole dataset is unbalanced, not only the “subampled” versions of it, so I did this version of iter where I calculate the weights and the weighted sampling on the whole self.dataset.targets and then I subsample the balanced indices for each GPU. Nevertheless without this topic I would have been stuck in always predicting the most numerous class of my dataset.

    def __iter__(self):
        # deterministically shuffle based on epoch
        g = torch.Generator()
        g.manual_seed(self.epoch)
        if self.shuffle:
            indices = torch.randperm(len(self.dataset), generator=g).tolist()
        else:
            indices = list(range(len(self.dataset)))

        # add extra samples to make it evenly divisible
        indices += indices[:(self.total_size - len(indices))]
        assert len(indices) == self.total_size

        # subsample indices
        indices = indices[self.rank:self.total_size:self.num_replicas]
        assert len(indices) == self.num_samples

        # get targets (you can alternatively pass them in __init__, if this op is expensive)
        targets = self.dataset.targets.clone()
        # calculate weights on the complete targets
        weights = self.calculate_weights(targets)
        # do the weighted sampling
        subsample_balanced_indices = torch.multinomial(weights, self.total_size, self.replacement)
        # subsample the balanced indices
        subsample_balanced_indices = subsample_balanced_indices[indices]

        return iter(subsample_balanced_indices.tolist())

I think your second to last line is wrong. It should be:

        subsample_balanced_indices = indices[subsample_balanced_indices]