Hello,
i am trying to use pytorchs Dataset and DataLoader to load a large dataset of several 100GB. This is of course too large to be stored in RAM, so parallel, lazy loading is needed. I am trying to load one large HDF file with a combination of a custom Dataset and the DataLoader. This happens on a cluster where the submission of jobs is done with HT Condor. Here i can request an amount of RAM for my Jobs.
However when i try to load data it will always run out of allocated memory and the job will be terminated regardless of how much memory i request. When checking the log of my job i can see that the used RAM is continuously increasing until it overshoots the allocated memory at some point and the job is killed. Of course i have tried to increase the requested memory, however the ram will still overflow at some point.
The Dataset code:
class DataSet(torch.utils.data.Dataset):
def __init__(self, file_name, file_dir):
path = os.path.join(file_dir, file_name)
file = h5py.File(path, 'r')
self.labels = file['labels']
self.specs = file['spectrogram']
# some irrelevant lines defining self.means, self.std, ...
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
label = self.labels[idx]
label = torch.from_numpy(label)
label = (label - self.means) / self.stds
spec = self.specs[idx, ...]
spec = np.array(spec).astype(float)
spec = (torch.from_numpy(spec) - self.specmean) / self.specstd
spec = torch.unsqueeze(spec, dim = 0) # feature dimension (convolutional layer)
return label, spec
the main code i have is:
train_set = DataSet(
filename,
data_directory
)
train_loader = torch.utils.data.DataLoader(
train_set,
batch_size=256,
shuffle=True,
drop_last=True,
num_workers=12,
pin_memory=False
)
for index, (labels, conditions) in enumerate(train_loader):
labels, conditions = labels.to('cuda'), conditions.to('cuda')
del labels, conditions
The for loop here is empty since i am only trying to monitor the loading of my data.
So i have tried to generate some smaller files and tried to see what happens when i try to load these. I have one file of ~7GB and one file of ~70GB, which i loaded with the same code. The iteration over the 7GB takes around 3s and it does not run out of memory. It also uses more RAM than i would expect. On one run it had a memory usage of almost 8GB. This is more than the actual file, implying to me that the whole file is loaded into RAM. This is what i particularly wanted to avoid and it would explain why the job on the larger files fails.
The 70Gb file will not be loaded again, even when i requested more than 70GB RAM (probably it would be loaded at some point, but i did not want to waste even more time and resources watching the used memory increase).
When i instead wrap another range(10) for loop around the loading of the 7GB file, this will work. It also still runs fairly quickly with ~30s. But the used RAM went up to 36GB in this particular run. This is of course way larger than the file itself, which should not happen! This is some other memory leak. I can fix this by running gc.collect() after each time i load the file in the for loop, but i would prefer if the leak were not there at all.
Some remarks:
- When opening the HDF files datasets with file[‘dataset’] the data will not be loaded yet. Only when i call an event with file[‘dataset’][index] the actual data will be loaded. (I have also checked this using profile from memory_profiler on init() of the dataset)
- I am using chunking in my HDF files
- pin_memory = False seems to be the better choice for me (faster, less memory usage)
- reducing the batch size did not seem to help
- removing the normalizing steps from getitem() does not stop the memory leak
This seems to be an issue with large files and some kind of memory leak in my code, which i could not find yet. To me it seems like the code tries to load the entire file at some point. I have also seen some other posts here which show similar problems, however i did not see any solution yet. If you can spot my mistake i would be very happy. An alternative solution might be to just write many small files and load them together with a DataLoader, but i could not make this work either so far. So if this works i would also appreciate some pointers. I have been stuck on this way to long now, so any help would be much appreciated!