Weird performance differences with RandomSampler/torch.randperm

bmenn · October 7, 2019, 6:21pm

Not sure what I am doing wrong but I have had some weird issues where if I use torch.utils.data.RandomSampler with torch.utils.data.DataLoader (i.e. DataLoader(shuffle=True)) I get dramatically different F1 scores when I load identical data from different locations. One location is a local drive and the other is a NFS drive. Both locations contain exactly the same files when checked with md5sum over each one. I found that if I create my own version of RandomSampler (below) I no longer see the difference in performance. I think the issue might be with randperm if my code is correct (the issue exists on both CPU and GPU).

class RandomSampler(torch.utils.data.RandomSampler):
    def __iter__(self):
        n = len(self.data_source)
        if self.replacement:
            return iter(np.random.randomint(0, n, size=n).tolist())
        arr = np.arange(n)
        np.random.shuffle(arr)
        return iter(arr.tolist())

I would post a minimal working example here but not sure how to reduce my working code (dataset is >1GB as well) to reproduce.

ptrblck · October 8, 2019, 1:20am

Is this effect reproducible, i.e. how many times have you trained your model from the local drive and the NFS drive?

bmenn · October 8, 2019, 3:15pm

I’ve been able to reproduce it dozens of times now (took me a long time to figure out this was the difference). Very puzzled as to why.