Large input files for each channel

Hi all,
I’m trying to train a NN with multiple channels as inputs. Each channel is approximately 6 GB. It works when I only use 5 channels but failed with 8 channels because of the memory limit. The code for my dataset is the following:

n_features = len(forcings)
f_data = xa.open_dataarray(path+forcings[0]+'.nc')
self.x_d = np.zeros((
    f_data.shape[0], n_features, f_data.shape[1], f_data.shape[2]))
for i, forcing in enumerate(tqdm(forcings, desc='Forcings')):
    f_data = xa.open_dataarray(path+forcing+'.nc').data
    print(forcing, ' loaded')
    f_data[np.isnan(f_data)] = 0
    self.x_d[:, i] = f_data
    del f_data
self.x_d, self.y_d = torch.from_numpy(self.x_d.astype('float32')), \
            torch.from_numpy((self.y_d.astype('float32')))

And getitem is simple:

def __getitem__(self, index):
    return self.x_d[index], self.x_s, self.y_d[index]

Is it possible to do a ‘lazy’ loading with different channels? And is there any way to speed up the io process? For the first four channels, it takes 30 s to load one channel and the time increases to 15 mins for the rest channels.

Thanks!

It is possible to lazily load your data using torch.utils.data.IterableDataset, you can find the documentation here. You should be able to write a __iter__ method that opens all the files and yield one tuple of (x_d, x_s, y_d) at a time.

You can also use IterDataPipe from the TorchData library, but do note that the project is still in the prototype stage (will be in beta soon).

Thanks! I have checked IterableDataset and it works for me. I have put the IO process in the __iter__ method. I assume we cannot avoid this I/O slowdown during the training right?