Large input files for each channel

UCDuan · February 11, 2022, 8:18pm

Hi all,
I’m trying to train a NN with multiple channels as inputs. Each channel is approximately 6 GB. It works when I only use 5 channels but failed with 8 channels because of the memory limit. The code for my dataset is the following:

n_features = len(forcings)
f_data = xa.open_dataarray(path+forcings[0]+'.nc')
self.x_d = np.zeros((
    f_data.shape[0], n_features, f_data.shape[1], f_data.shape[2]))
for i, forcing in enumerate(tqdm(forcings, desc='Forcings')):
    f_data = xa.open_dataarray(path+forcing+'.nc').data
    print(forcing, ' loaded')
    f_data[np.isnan(f_data)] = 0
    self.x_d[:, i] = f_data
    del f_data
self.x_d, self.y_d = torch.from_numpy(self.x_d.astype('float32')), \
            torch.from_numpy((self.y_d.astype('float32')))

And getitem is simple:

def __getitem__(self, index):
    return self.x_d[index], self.x_s, self.y_d[index]

Is it possible to do a ‘lazy’ loading with different channels? And is there any way to speed up the io process? For the first four channels, it takes 30 s to load one channel and the time increases to 15 mins for the rest channels.

Thanks!

nivek · February 14, 2022, 4:13pm

It is possible to lazily load your data using torch.utils.data.IterableDataset, you can find the documentation here. You should be able to write a __iter__ method that opens all the files and yield one tuple of (x_d, x_s, y_d) at a time.

You can also use IterDataPipe from the TorchData library, but do note that the project is still in the prototype stage (will be in beta soon).

UCDuan · March 1, 2022, 5:07pm

Thanks! I have checked IterableDataset and it works for me. I have put the IO process in the __iter__ method. I assume we cannot avoid this I/O slowdown during the training right?