How to load complete dataset to RAM?

I have a network training that is particularly slow now, but I have enough memory to read the completed training set at one time to improve the training speed. How should I achieve it?

1 Like

If you are using pytorch dataset/dataloader, in the dataset __init__method, load all the data. If you want to load it into gpu just map it to gpu inside init too.

1 Like

How do I read all the data in the init method?

Is it possible to do this: in the dataset init function, build a list, then use the for loop to append the entire data set in the list, and then in the getitem function to retrieve data from the list according to the index, this operation is to load the data Going into RAM?

It depends on. That’s the straightforward pipeline, however there are two options. The typical one is to make a list of file paths or memory map tensors which will be loaded in __getitem__, however, in order to load it in RAM, you need to make a list of loaded files.


Since my dataset is a bit large and takes a lot of time to load to RAM (But the RAM is large enougt). I’m thinking if we can load the data to RAM, when the data has been used for the first time.

That means, I want to create an empty data list in init() to save the loaded data. And data will be written into the list in get_item(). Since get_item() is called in dataloader with multiple workers, so in this case, the data list can be filled very fast.

However, this not works. I think, it is because that every worker has its own memory. So after 1 epoch, the data list will be cleaned.

Do you have any suggestions for this(like speed up loading the data in RAM)? Maybe like create the data list in the shared memory?


You could check this post which uses shared arrays as a “caching” mechanism.

1 Like