Hello,
we’re currently trying to train a model on a large dataset (>60m samples).
However, when training we notice that our CPU memory is steadily increasing.
This also happens when we’re using a dummy Dataset i.e.
class DummyDataset(Dataset):
def __init__(self):
super().__init__()
def __len__(self):
return 60_000_000
def __getitem__(self, idx: int):
return torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0, 6.0]), torch.tensor([7.0])
We’re using num_workers > 0
and for each worker there seems to be an additional ~1.5 GB of RAM consumption.
Our assumption is, that this happens due to shuffle=True
in our DataLoader.
The RandomSampler in the DataLoader permutates the indices for sampling and converts them into a list of int
s. (here)
For our example and on our machine (where sys.getsizeof(some_int)==28
) the size of this list would then be around:
28B * 60 000 000 = 1.68GB
Which would be copied for each worker and explain the memory consumption.
Is there a solution or alternative to this that still enables shuffling of the dataset after each epoch?
Furthermore, are there any best practices around shuffling of large datasets, that eventually also don’t fit into memory anymore?
Thank you for your help!
The same/similar problem was also mentioned in this issue.