I am trying to set up a network to train on fMRI task data (sequential volumes) along side with T1 (single volume). fMRI data has 1200 volumes for one single subject. What is a smart and efficient way to set up this data loader to avoid memory issues? Help me brain storming on this.
Thank you in advance,
It depends on you resources. The fastest way is always using numpy arrays with memory map but ofc that’s raw data and you need hard disk to store it. If your data is compressed then you will have to pay the price of reading and decoding.
If your data is too big to fit in memory. You can use a Dataset/Dataloader similar to what is used for Imagenet: Every time a sample is needed, it is read on the disk, preprocessed then returned. The dataloader
num_workers argument allow you to have multiple processes doing this loading/preprocessing on the side to make sure it is not a bottleneck.
In particular, you can implement your own torch.utils.data.Dataset. Which requires to implement a
__getitem__ to retrieve one sample from your dataset and
__len__ that tells the length of your dataset.
You should do loading and any required preprocessing in the
__getitem__ method and return one sample ready to be forwarded in your network.
Then you can use the basic dataloader with few workers to make that dataset loading parallel ! Note that you should experiment with the number of workers and find the best one by experiment.