Dataloader eating ram

I have a dataset of 9 gigs of wav files for music synthesis, and to manage batches across different files i load each file into custom WavFileDataset which i then combine in ConcatDataset to use as a dataset for dataloader. Problems begin when i try to sample from dataloader, even with batch_size = 1 and length of sequences of 100 samples: ram gets quickly filled up to 21gb, stays there and then restarts the notebook session. It looks like the problem is in the fact that while sampling, dataloader tries to load an entire dataset into ram, since when i use just one file to create ConcatDataset, everything works fine. So, is that an intended behavior, and if it is, how do i get around it?

Some code:

WavFileDataset:

class WavFileDataset(Dataset):
  def __init__(self, file_path, seq_length):
    sequence = pydub.AudioSegment.from_file(file_path)
    sequence = sequence.set_channels(1)
    sequence = np.asarray(sequence.get_array_of_samples())
    self.length = len(sequence) - (seq_length + 1)
    self.seq_length = seq_length+1 #We will be using seq_length samples as a feature 
#and one sample as a label
    self.file_path = file_path
  def __len__(self):
    return self.length
  def __getitem__(self,idx):
    sequence = pydub.AudioSegment.from_file(self.file_path)
    sequence = sequence.set_channels(1)
    sequence = np.asarray(sequence.get_array_of_samples())
    seq = sequence[idx:idx+self.seq_length]
    seq = (seq / (1<<15))
    feature = seq[0:-1].astype('float32')
    label = seq[-1:len(seq)].astype('float32')

    return feature,label

Dataset construction code:

data_dir = r"./datasets/rammwav"
save_dir = r'/content/drive/My Drive/checkpoints/rammwav/model.tar'
files = [os.path.join(data_dir,f) for f in os.listdir(data_dir) if os.path.isfile(os.path.join(data_dir, f))]
#files = [files[0]] #Take one file to speed up dataset processing
datasets = []
files_count = len(files)
seq_length = 100
batch_size = 1
i = 1
print("Processing datasets...")
for file in files:
  datasets.append(WavFileDataset(file,seq_length))
  print_inline("Processed {}/{}".format(i,files_count))
  i += 1
print("\n Datasets processed")

model = LSTM(input_size = 1, hidden_layer_size = 1, output_size = 1, batch_size_ = batch_size).cuda()
loss_function = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

dataset = torch.utils.data.ConcatDataset(datasets)

data_loader = torch.utils.data.DataLoader(dataset,
                                   batch_size=batch_size,
                                   shuffle=True,
                                   num_workers=0,
                                   pin_memory=False,
                                   drop_last=True)

I wouldn’t say “intentional” behavior, but it is expected behavior.

This doesn’t really have to do with PyTorch, whenever you load that data, it’s all loaded into RAM. What you need to do is implement lazy loading for your data.

An easy way of doing this would be to only load one file a time, train on that data, load the next file, and repeat. However, this may not be ideal depending on your task. Your model may become bias to this order (e.g. training a model on dog pictures only and then cat pictures).

The better solution may be to save all your data into a format that allows lazy loading. hdf5 may be a good contender.

1 Like

Hmm, ill try h5 format

I used concatdataset because i thought it would manage loading different pieces of data across different files without having to load them into memory all at once, but looks like that is what it is doing, exactly

There is a problem with converting wav to h5 format: it either eats up all 25 gigs of ram available in google colab if i create data array first and then save it to dataset, or it takes a lot of time, if i append to dataset sample-per-sample. Approximately a month, to be exact: ~250mins per file * 192 files, = 48000 minutes for entire dataset or 33.(3) days

Should i just implement lazy loading dataloader by myself?

The main goal would be to have your data in a format that supports lazy loading. I don’t know enough about audio files to give a great answer on what the solution would be.

If you were to use the h5 format, I would do something along the lines of creating a new member of the root group per file: /file1, /file2, ... or allocate enough room in the file and have each file write to it’s designated portion. This might take a while, but it should be doable (and it would be a one time thing).

I do want to say that HDF5 may not be the answer, you’re going to want to look for any method that allows you to load a subset of the wav file without loading everything. Something like that should be ideal for your task.

So: tested h5 dataset, dataloader still crashed my session

Turns out, dataset shuffling needs ram, and it needs a lot of it in my case

By turning it off it successfully next(iter(dataloader))'s new batches of data

Gotta figure out how to do shuffling

But at least it is working now

It’s not about shuffling. You can reduce the amount of RAM by reducing the amount of workers. Each worker loads a full batch, thus if you batch size is large and you have lot of workers you are linearlly increasing the ram needed.

There is a running time until dataloader reaches its nominal workflow. You cannot really test dataloader by doing next(iter(dataloader))

You can also use numpy memory map not to load the big files you mentioned all at once but on demand. Soundfile allows to read on demand too (wav files).

My batch size is 1, num_workers is 0, i think the problem is in how pytorch implements shuffling:

As i understood, it takes a list of indexes from 0 to len(dataset), then randomly permutes them, turns into a list, and then iterates over it to sample from the dataset

I think the fact that len() of my dataset is 2,548,407,963 is creating some problems, maybe this indexes list occupies all ram available?

EDIT:
And since my dataset is normalized to float32 (-1 to 1), it doesnt really matter if you load a list of int32 with length of len(dataset) for random sampling, or an entire dataset of same length, memory usage is the same

And if you use int 64, the list for shuffling itself would take up more space than dataset itself

Yes It does work that way.
I never worked with such a big dataset. The problem is to ensure there is no repetition.

torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None)

You can code your own batch_sampler

[docs]class SequentialSampler(Sampler):
    r"""Samples elements sequentially, always in the same order.

    Arguments:
        data_source (Dataset): dataset to sample from
    """

    def __init__(self, data_source):
        self.data_source = data_source

    def __iter__(self):
        return iter(range(len(self.data_source)))

    def __len__(self):
        return len(self.data_source)



[docs]class RandomSampler(Sampler):
    r"""Samples elements randomly. If without replacement, then sample from a shuffled dataset.
    If with replacement, then user can specify :attr:`num_samples` to draw.

    Arguments:
        data_source (Dataset): dataset to sample from
        replacement (bool): samples are drawn with replacement if ``True``, default=``False``
        num_samples (int): number of samples to draw, default=`len(dataset)`. This argument
            is supposed to be specified only when `replacement` is ``True``.
    """

    def __init__(self, data_source, replacement=False, num_samples=None):
        self.data_source = data_source
        self.replacement = replacement
        self._num_samples = num_samples

        if not isinstance(self.replacement, bool):
            raise ValueError("replacement should be a boolean value, but got "
                             "replacement={}".format(self.replacement))

        if self._num_samples is not None and not replacement:
            raise ValueError("With replacement=False, num_samples should not be specified, "
                             "since a random permute will be performed.")

        if not isinstance(self.num_samples, int) or self.num_samples <= 0:
            raise ValueError("num_samples should be a positive integer "
                             "value, but got num_samples={}".format(self.num_samples))

    @property
    def num_samples(self):
        # dataset size might change at runtime
        if self._num_samples is None:
            return len(self.data_source)
        return self._num_samples

    def __iter__(self):
        n = len(self.data_source)
        if self.replacement:
            return iter(torch.randint(high=n, size=(self.num_samples,), dtype=torch.int64).tolist())
        return iter(torch.randperm(n).tolist())

    def __len__(self):
        return self.num_samples

if you look at sequential sampler it uses an iterator thus it doesn’t require ram. Ensuring replacement requires to store a list of indices which (given the size of the dataset) can eat the ram.
You can do an offline shuffling for later on reading next indices on demand using your own batch sampler. The only problem is that you would have static batches through epochs. You can simply sample a random index from a uniform distribution… but it wouldn’t ensure you use all the samples in the dataset.

Well my current “At leas it works™” solution is to just disable shuffling and hope for the best

My dataset has such a big length because i made it so i could sample at any point, so len(dataset) = len(h5dataset) - seq_length, and since i use 44100hz wav files, it creates a dataset of crazy length

One solution might be to sample only from idx*seq_length points. It limits data variety a little bit, but it might enable suffling

Ill try it out, will write back with results

Alright, now i am sampling at idx*seq_length points, and shuffling works nicely. Lesson is: giant dataset will be giant at least somewhere: if not in ram used to store it, then in ram used to shuffle it, or in time you will spend debugging model trained on unshuffled dataset