Create a custom dataloader for 3-d segmented images in .mat

Hi! I am a beginner in PyTorch. I have a folder containing thousands of 3-d segmented images (H x W x D x C) in .mat (Matlab files). I have searched for similar topics on PyTorch forum (e.g., this link), but my dataloader remains not workable. Specifically, elapsed time is too long when I call ‘real_batch = next(iter(dataloader))’. My dataloader is written as below. Can anyone offer any idea? Your time is highly appreciated. Many thanks!

import torch
import torch.utils.data
import scipy.io as spio

# Create the dataset
class customDataset(torch.utils.data.Dataset):
    '''
    Custom dataset for .mat
    '''
    def __init__(self, image_folder):
        self.image_folder = os.path.abspath(image_folder)
        self.image_list = os.listdir(self.image_folder)
    def __getitem__(self, index):
        image_path = self.image_list[index]
        image = spio.loadmat(os.path.join(self.image_folder, image_path))['data']
        return image
    def __len__(self):
        return len(self.image_list)

dataroot = r"C:\dataset"
batch_size = 16
workers = 2
dataset = customDataset(os.path.join(dataroot, 'images'))
dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
                                        shuffle=True, num_workers=workers)
real_batch = next(iter(dataloader))  # Wrong: this line does not work

The code looks generally alright.
How long does it take to load a batch and where are you storing the data (HDD, SSD, M.2, etc.)?

Hi, ptrblck. Thanks a lot for your reply! I know you always offer helpful ideas.
When calling ‘real_batch = next(iter(dataloader))’, I waited for about 10 min and then stopped the program :smiling_face_with_tear:. For comparison, I ran the PyTorch DCGAN tutorial, and the dataloader for loading the Celeba dataset (about 202K images) worked quite fast, say around 1 sec.
In addition, the dataset is stored in my One-Drive folder.
Thank you!

I assume you have a local copy of the data or are you lazily downloading each sample?
In any case, how long does spio.loadmat take for a single sample and for all 16 samples?

Thanks for your kind help, ptrblck. Unfortunately, it does not change the situation after I move the dataset to a local folder. For finding the bug, I did another test: I create a new dataset containing only 50 images and load data from this dataset. Surprisingly, an error pops up in a few seconds (as attached below). It seems that the dataloader cannot get data, but why? I have googled the error, but have not idea yet. Could you offer any idea for this error?

---------------------------------------------------------------------------
Empty                                     Traceback (most recent call last)
File ~\anaconda3\envs\pythonProject\lib\site-packages\torch\utils\data\dataloader.py:761, in _MultiProcessingDataLoaderIter._try_get_data(self, timeout)
    760 try:
--> 761     data = self._data_queue.get(timeout=timeout)
    762     return (True, data)

File ~\anaconda3\envs\pythonProject\lib\multiprocessing\queues.py:108, in Queue.get(self, block, timeout)
    107     if not self._poll(timeout):
--> 108         raise Empty
    109 elif not self._poll():

Empty: 

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
Input In [5], in <cell line: 1>()
----> 1 real_batch = next(iter(dataloader))

File ~\anaconda3\envs\pythonProject\lib\site-packages\torch\utils\data\dataloader.py:345, in _BaseDataLoaderIter.__next__(self)
    344 def __next__(self):
--> 345     data = self._next_data()
    346     self._num_yielded += 1
    347     if self._dataset_kind == _DatasetKind.Iterable and \
    348             self._IterableDataset_len_called is not None and \
    349             self._num_yielded > self._IterableDataset_len_called:

File ~\anaconda3\envs\pythonProject\lib\site-packages\torch\utils\data\dataloader.py:841, in _MultiProcessingDataLoaderIter._next_data(self)
    838     return self._process_data(data)
    840 assert not self._shutdown and self._tasks_outstanding > 0
--> 841 idx, data = self._get_data()
    842 self._tasks_outstanding -= 1
    844 if self._dataset_kind == _DatasetKind.Iterable:
    845     # Check for _IterableDatasetStopIteration

File ~\anaconda3\envs\pythonProject\lib\site-packages\torch\utils\data\dataloader.py:808, in _MultiProcessingDataLoaderIter._get_data(self)
    804     # In this case, `self._data_queue` is a `queue.Queue`,. But we don't
    805     # need to call `.task_done()` because we don't use `.join()`.
    806 else:
    807     while True:
--> 808         success, data = self._try_get_data()
    809         if success:
    810             return data

File ~\anaconda3\envs\pythonProject\lib\site-packages\torch\utils\data\dataloader.py:774, in _MultiProcessingDataLoaderIter._try_get_data(self, timeout)
    772 if len(failed_workers) > 0:
    773     pids_str = ', '.join(str(w.pid) for w in failed_workers)
--> 774     raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
    775 if isinstance(e, queue.Empty):
    776     return (False, None)

RuntimeError: DataLoader worker (pid(s) 8268, 18776) exited unexpectedly

Could you set num_workers=0 and rerun the test, which could hopefully give a better error message, please?

ptrblck, your solution are really amazing! It works and the dataset is successfully loaded in a few seconds. Thank you so much! This weird error has been bothering me for days. Btw, could you tell me why ‘num_workers = 0’ can help?

I don’t know why multiple workers fail to load the data, but would guess that you are running into issues with multiprocessing in loadmat or HDF5. Based on the docs HDF5 seems to be used internally:

You will need an HDF5 Python library to read MATLAB 7.3 format mat files.

1 Like

Got it. Thank you so much for your kind help! Have a nice day!

@ptrblck Hi, I am new to PyTorch. Can you please have a look at the error? It would be really helpful if I could get some insights on how to solve this issue. Thanks in advance.

I have a custom .mat dataset. The data looks like this:

print(x_train.shape, x_train, y_train.shape, y_train)

((30508, 256),
array([[ 0.42129787-0.16991965j, 0.36276836-0.26392504j,
0.26979125-0.33810524j, …, -0.02828086-0.17962576j,
-0.13922529-0.11955454j, -0.18388582+0.02993972j],
[-0.16232005+0.22424035j, -0.01326727+0.28548271j,
0.16299008+0.25424894j, …, -0.05188078+0.1124093j ,
0.03477429+0.12717485j, 0.10394195+0.0970679j ],
[-0.14510996-0.06126717j, -0.16250012+0.0387307j ,
-0.09767821+0.1080583j , …, 0.3880517 -0.02151187j,
0.28764753-0.23641774j, 0.09376167-0.34648794j],
…,
[-0.15492494-0.05131445j, -0.13771235+0.04141434j,
-0.08409637+0.10275555j, …, 0.11677223+0.18459846j,
0.20143423+0.08588961j, 0.19610319-0.09028822j],
[-0.09103348-0.05533604j, -0.13293657+0.00201364j,
-0.05136824+0.10219111j, …, 0.15189247+0.15919279j,
0.16143821+0.11385279j, 0.16244496+0.00631943j],
[-0.20020891-0.08525992j, -0.14457089+0.05950821j,
-0.0999863 +0.18009997j, …, 0.05141494-0.10477183j,
0.00139747-0.14153811j, -0.07764385-0.10034507j]]),
(30508, 2048),
array([[1.85357980e-03, 1.74808095e-03, 1.40078473e-03, …,
4.68181052e-05, 4.92985346e-05, 5.30191787e-05],
[2.31476919e-04, 2.12018706e-04, 1.54907589e-04, …,
1.33759553e-04, 1.29119840e-04, 1.18369283e-04],
[2.46307219e-04, 2.45744874e-04, 2.40683767e-04, …,
1.82749537e-04, 1.44496678e-04, 1.15648177e-04],
…,
[1.50155607e-03, 1.23499299e-03, 6.21390513e-04, …,
4.37290822e-05, 4.56725970e-05, 4.64284083e-05],
[3.46533912e-04, 3.50528539e-04, 3.49529882e-04, …,
7.40785739e-05, 3.70392870e-05, 5.15122217e-05],
[7.22670712e-03, 6.78763140e-03, 5.49826145e-03, …,
1.67431347e-04, 1.65770322e-04, 1.62448272e-04]]))

In your __len__ function you are returning (self.m) directly instead of its len.

1 Like