Lazy loading in ConcatDataset

Hello, I have around 400 GB of image data which I have stored in n .nc files (Working with xarrays). I instantiate n Dataset objects, one for each file and use ConcatDataset. I do not want to load them all on memory together. Therefore, I lazy load the dataset onto memory using the logic below inside __getitem__:

if not self.isDataLoaded:
    self.data = xr.load_dataset(self.data_file, engine="h5netcdf")
    self.isDataLoaded= True

With shuffle=False, my expectation was that only one file will be loaded and iterated before going to the next one. However, all the files are loaded onto memory in the beginning of the epoch leading to OOM issues. I would understand if the memory usage increased gradually during the epoch because I still have not figured out how to identify end of iteration for that single dataset and delete the loaded memory. Could anyone spot an issue here? Thanks!

Hi! hope you are doing well:

Regarding your issue, it is because xr.load_dataset, in the documentation, says:

“Open, load into memory, and close a Dataset from a file or file-like object.”

Unless you have more than 400GB of ram, this will not be possible. Instead, you would like to try xr.open_dataset which performs a lazy load by using chunks (dask package is required).

If you are trying to load multiple .nc files, use xr.open_mfdataset and specify the coordinates to concatenate the file.

Note: not related to the question, this just happened to me. If you are installing dask along with xarray in a new environment and it is not recognized by the xarray library, try reinstalling dask after the environment is created.