RandomSampler uses excessive amounts of memory for indices list when shuffling

leokraft · February 28, 2023, 4:42pm

Hello,

we’re currently trying to train a model on a large dataset (>60m samples).
However, when training we notice that our CPU memory is steadily increasing.

This also happens when we’re using a dummy Dataset i.e.

class DummyDataset(Dataset):

    def __init__(self):
        super().__init__()

    def __len__(self):
        return 60_000_000

    def __getitem__(self, idx: int):

        return torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0, 6.0]), torch.tensor([7.0])

We’re using num_workers > 0 and for each worker there seems to be an additional ~1.5 GB of RAM consumption.

Our assumption is, that this happens due to shuffle=True in our DataLoader.
The RandomSampler in the DataLoader permutates the indices for sampling and converts them into a list of ints. (here)

For our example and on our machine (where sys.getsizeof(some_int)==28) the size of this list would then be around:

28B * 60 000 000 = 1.68GB

Which would be copied for each worker and explain the memory consumption.

Is there a solution or alternative to this that still enables shuffling of the dataset after each epoch?
Furthermore, are there any best practices around shuffling of large datasets, that eventually also don’t fit into memory anymore?

Thank you for your help!

The same/similar problem was also mentioned in this issue.

ptrblck · February 28, 2023, 7:39pm

Could you add a comment in the linked issue with your use case so that the code owners could take another look at it, please?
This PR wanted to implement a fix but it seems it was abandoned and closed.

leokraft · February 28, 2023, 9:58pm

Thank you for the fast reaction.
I’ve commented on the issue.
In the meanwhile, is there any workaround for shuffling large datasets?

ptrblck · March 1, 2023, 1:54am

You could of course revisit (or revive) the stale PR to check which approach was being worked on, but as a workaround maybe random.sample could be helpful as it should work on range(N) without actually creating the population. Note however, that duplicates could be drawn so I’m unsure if this would fit your use case.