DataLoader memory usage keeps increasing

tobiasr · April 8, 2024, 5:09pm

Hello,
i am trying to use pytorchs Dataset and DataLoader to load a large dataset of several 100GB. This is of course too large to be stored in RAM, so parallel, lazy loading is needed. I am trying to load one large HDF file with a combination of a custom Dataset and the DataLoader. This happens on a cluster where the submission of jobs is done with HT Condor. Here i can request an amount of RAM for my Jobs.
However when i try to load data it will always run out of allocated memory and the job will be terminated regardless of how much memory i request. When checking the log of my job i can see that the used RAM is continuously increasing until it overshoots the allocated memory at some point and the job is killed. Of course i have tried to increase the requested memory, however the ram will still overflow at some point.

The Dataset code:

class DataSet(torch.utils.data.Dataset):
    def __init__(self, file_name, file_dir):
        path = os.path.join(file_dir, file_name)
        file = h5py.File(path, 'r')
        self.labels = file['labels']
        self.specs = file['spectrogram']
        # some irrelevant lines defining self.means, self.std, ...

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        label = self.labels[idx]
        label = torch.from_numpy(label)
        label = (label - self.means) / self.stds            

        spec = self.specs[idx, ...]
        spec = np.array(spec).astype(float)
        spec = (torch.from_numpy(spec) - self.specmean) / self.specstd
        spec = torch.unsqueeze(spec, dim = 0) # feature dimension (convolutional layer)
    
        return label, spec

the main code i have is:

train_set = DataSet(
    filename, 
    data_directory
)

train_loader = torch.utils.data.DataLoader(
    train_set,
    batch_size=256, 
    shuffle=True, 
    drop_last=True, 
    num_workers=12, 
    pin_memory=False
)

for index, (labels, conditions) in enumerate(train_loader):
    labels, conditions = labels.to('cuda'), conditions.to('cuda')
    del labels, conditions

The for loop here is empty since i am only trying to monitor the loading of my data.

So i have tried to generate some smaller files and tried to see what happens when i try to load these. I have one file of ~7GB and one file of ~70GB, which i loaded with the same code. The iteration over the 7GB takes around 3s and it does not run out of memory. It also uses more RAM than i would expect. On one run it had a memory usage of almost 8GB. This is more than the actual file, implying to me that the whole file is loaded into RAM. This is what i particularly wanted to avoid and it would explain why the job on the larger files fails.

The 70Gb file will not be loaded again, even when i requested more than 70GB RAM (probably it would be loaded at some point, but i did not want to waste even more time and resources watching the used memory increase).
When i instead wrap another range(10) for loop around the loading of the 7GB file, this will work. It also still runs fairly quickly with ~30s. But the used RAM went up to 36GB in this particular run. This is of course way larger than the file itself, which should not happen! This is some other memory leak. I can fix this by running gc.collect() after each time i load the file in the for loop, but i would prefer if the leak were not there at all.

Some remarks:

When opening the HDF files datasets with file[‘dataset’] the data will not be loaded yet. Only when i call an event with file[‘dataset’][index] the actual data will be loaded. (I have also checked this using profile from memory_profiler on init() of the dataset)
I am using chunking in my HDF files
pin_memory = False seems to be the better choice for me (faster, less memory usage)
reducing the batch size did not seem to help
removing the normalizing steps from getitem() does not stop the memory leak

This seems to be an issue with large files and some kind of memory leak in my code, which i could not find yet. To me it seems like the code tries to load the entire file at some point. I have also seen some other posts here which show similar problems, however i did not see any solution yet. If you can spot my mistake i would be very happy. An alternative solution might be to just write many small files and load them together with a DataLoader, but i could not make this work either so far. So if this works i would also appreciate some pointers. I have been stuck on this way to long now, so any help would be much appreciated!

divinho · April 8, 2024, 6:09pm

One issue I see in your code is that each worker is going to use the entire dataset (it should use a subset). But that doesn’t explain the increasing memory.

There may be something going on with hdf5. The classic reason for this to happen is because of using lists to store data, see this issue: DataLoader num_workers > 0 causes CPU memory from parent process to be replicated in all worker processes · Issue #13246 · pytorch/pytorch · GitHub
Doesn’t seem like you’re doing that though.

this is a good blogpost Demystify RAM Usage in Multi-Process Data Loaders - Yuxin's Blog

tobiasr · April 10, 2024, 4:00pm

Thanks for the answer. However the blog post and this github discussion both start with big lists/arrays as a dataset, which are already loaded into memory completely, if im not mistaken. That is something which technically should not happen with a hdf file. I would think the copying from the multiprocessing mentioned in this blog post is still probably happening and filling my RAM though. So i might be able to fix this. My first attempt was to go to num_workers=0 to avoid multiprocessing, but that did not seem to help much. I still ran into memory consumption on the order of the file size.
What do you mean with the subdatasets? Should i make subdataset objects from torch.utils.data…? Should i create one DataLoader with one worker for each of these subsets?

divinho · April 12, 2024, 2:59pm

Ignore my comment about subsetting data, I was wrong.

If you still get the memory consumption with num_workers=0 then it’s not related to the issues I mentioned. I would check your other code for whether it’s using more data as training continues.

rrrongon · April 28, 2024, 9:53pm

I am experiencing the same issue. I created my custom dataset where I read a 2GB data that I want to keep in memory for the entire program having 5 epochs. I tried two version of approach.

Version 1: custom dataset gets initialized with the storage path. I load the data from storage. This program takes less than 1GB of host memory.
Version 2: Custom dataset loads the entire data into memory while initializing the dataset. Then I pass the dataset to Datalaoder. I allocated 6GB of memory.
In Version 2, during 2nd epoch I ran out of memory. I checked the initialization happens only once though.

I also ran out of memory even I am loading the data in my customDataset - it reaches to memory limit. Code structure is as bellow:

def run(rank, size, loss_fn, comm, imagenet_config):

    #Initializing dataset and dataloader 
    dataset = CustomDataset()
           # Here I am reding the file. Not even accessing the data in __get_item()__
   epoch_no = 3
   dataloader = DataLoader()
   for epoch in range(epoch_no):
        for data, label in Dataloader:
            continue

   # Using the DataLoader and Dataset.__get_item__(), I am not even accessing data that I loaded in Dataset. That means, Loading the file is only when Dataset class is initialized and I ensured that using printing statements.

In this setup I run out of memory. To get rid of confusions, I used one worker, I tried to read file in different way and location of file.

I tried to read data out of customDataset, in another class and was accessing loaded data from get_item() using an instance of the dataLoaded Class. Did not work. Memory Limit issue.

Please let me know if anyone has solved this situation before.

Thanks

rrrongon · April 29, 2024, 4:46pm

Dataloader’s memory usage keeps increasing during one single epoch. · Issue #20433 · pytorch/pytorch (github.com)

This discussion is the one probably you that can help you fixing the issue. It says, Except

Replace list with a numpy array
Wrap the list in multiprocessing.Manager
Encode a list of strings in a numpy array of integers

This three data structures, every thing else create a copy on access, that drastically increases the memory. This, is a Pytorch side memory issue. I hope they fix it soon.

So, converting your hdf5 data to any one of these structures can solve the problem I believe. I solved mine, converting PIL.image to numpy array. In my case, I had to convert each image to np array and enclosed each np.array to another final np.array. If I keep anyone of these as list, memory issue occurs.

I hope, this will solve your problem too.

Thanks.

tobiasr · May 3, 2024, 9:41am

An update on this:
I am currently dealing with several issues. The DataLoader “memory leak” can apparently be fixed like one of the comments suggests. Especially the blogpost Demystify RAM Usage … highlighted in one of the comments is a helpful contribution. However i could not confirm this, since i have an additional issue with the storage of my data, which i will not go into here. My current solution for this is just a denser dataset which fits into my RAM directly, without any usage of the DataLoader. This works very well, however at some point i may come back here with some additional comments on larger datasets.