Load new training files after previous one is exhausted

Hello everyone,
I know that something like this was asked before, but I never saw any details (maybe I missed it. Feel free to point me to the right thread).

I have 60 data sets with hourly data split by year. They are in total around 60GB, so too large to load into memory.
I should be able to load 4 of them at a time.
My current idea is to change the data class in a way that it loads new files once the current training set is exhausted.
So basically: load 4 years of data → train on the daily data → load the next 4 years → continue training → … → finish epoch.

My questions are:

  1. I assume that saving the data sets in pytorch format is the fastest way for training, so that I don’t have to convert them to pytorch data everytime?
  2. How/where does the dataloader keep track of which indices he used and how many are still available?
  3. My current idea would be to somehow tell the dataset class once it used all indices to load the next data sets and reset the internal index tracker of the dataloader. Does this idea work at all and how would I pull this off? Or would I rather create some sort of custom dataloader?

My current dataset class is very simple

class MyDataset(Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.targets = torch.LongTensor(labels)

    def __getitem__(self, index):
        x = self.data[index]
        y = self.targets[index]

        return x, y

    def __len__(self):
        return len(self.data)

I think a better approach would be to define multiple MyDataset and use torch.utils.data.ConcatDataset (link) to concatenate them all the dataloader will only put only one batch at a time in the memory.

Thank you for the suggestion. I have several questions on your approach, because I am new to pytorch.

  1. Does ConcatDataset actually concat (load them all into memory and create one large data set) the dataset or basically just link them (tell the data loader to treat all the data classes as one data set)? As I said: I can’t read all the data into memory at the same time due to the size.

  2. If I concat the datasets, does it then only pick from one dataset until it has used everything and then moves on to the next?

  3. If I create multiple MyDataset classes with the different data, do I have to somehow prevent it from loading the data at the point of the creation?

Sorry for all the questions, I am just worried about the available amount of RAM and try to understand how these systems work.

So, torch’s datasets should not load everything in memory, the dataloader’s num workers call the __getitem__ method of each dataset and should be responsible for putting the data into the memory. (To confirm make sure that you are not loading all data in the __init__ function of Dataset it should just have a metadata file in which you have a location of the actual datapoint which you load with the get_item method. The example is below ).

  1. ConcatDataset, therefore, doesn’t puts anything into the memory but just links them together.
  2. If you turn off shuffle in Dataloader then this should be the expected behaviour
  3. As I said before as long as you don’t load data in the __init__ function and just metadata pointing to the datapoint at a certain index. a common approach is to have a CSV file with a filename, which you load in init instead and then you load it in the getitem, My idea would be to do something like:
class MyDataset(Dataset):
    def __init__(self, metadata_file_loc):
        self.data = pd.read_csv(metadata_file_loc)

    def __getitem__(self, index):
        x = torch.load(self.data['location'][i])
        y = self.data['target'][i]
        return x, y

    def __len__(self):
        return len(self.data)

So you actually do not load the entire data but rather iterate over it as it should be with large datasets.

Where metadata.csvshould have format similar to

location, target
1.png, 0
2.png, 0
3.png, 1
...

Ah that explains a couple of things and is an interesting approach.
The only problem is that this approach might not work that well for me.
My files contain one large numpy array each (size is like 365x24x173x373). I could do your approach by creating a new numpy file (or pytorch file) for each day, but this would result in 22,000 separate files and I am not sure on the performance implications.
I just saw that numpy.memmap is a thing. Maybe I could get it working in a certain way, but it is actually a function and not a file that would get loaded.