reuse dataloader

rabbitwayne · June 24, 2020, 1:26am

My training code is like the following:

for epoch in range(n_epoch):
for i, batch in enumerate(dataset):
train()

dataset is created using Dataloader and I have only one data file. I saw that Dataloader loads everything in this file into memory and then extracts batches. I want to train this data file for multiple epochs. But I noticed that every time a new epoch begins, the data file will be loaded from disk again and loading this file takes a lot of time. How to load this file only once for the first epoch and reuse the dataset without fetching from disk repeatedly for the following epochs? Thanks a lot!

ptrblck · June 24, 2020, 6:15am

I assume you are loading the data in the Dataset.__init__ method?
If that’s the case, you could preload the data before creating the Dataset and pass it to its __init__ method:

data = torch.load(...)
dataset = MyDataset(data)
loader = DataLoader(dataset)

This should avoid reloading the data. However, the Dataset.__init__ method would still be called in each epoch, if I’m not mistaken, so you should make sure that no heavy loading is there.

rabbitwayne · June 24, 2020, 11:04pm

Hi @ptrblck, thank you for your reply! The training uses the same data file and trains it for many epochs. How can I avoid reloading the file on every new epoch? How to reuse the data loaded in the previous epoch? Thanks!

ptrblck · June 25, 2020, 4:28am

Is my suggestion not working for you?
If not, what would be the reason?