Hi! I am a beginner in PyTorch. I have a folder containing thousands of 3-d segmented images (H x W x D x C) in .mat (Matlab files). I have searched for similar topics on PyTorch forum (e.g., this link), but my dataloader remains not workable. Specifically, elapsed time is too long when I call ‘real_batch = next(iter(dataloader))’. My dataloader is written as below. Can anyone offer any idea? Your time is highly appreciated. Many thanks!
import torch
import torch.utils.data
import scipy.io as spio
# Create the dataset
class customDataset(torch.utils.data.Dataset):
'''
Custom dataset for .mat
'''
def __init__(self, image_folder):
self.image_folder = os.path.abspath(image_folder)
self.image_list = os.listdir(self.image_folder)
def __getitem__(self, index):
image_path = self.image_list[index]
image = spio.loadmat(os.path.join(self.image_folder, image_path))['data']
return image
def __len__(self):
return len(self.image_list)
dataroot = r"C:\dataset"
batch_size = 16
workers = 2
dataset = customDataset(os.path.join(dataroot, 'images'))
dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
shuffle=True, num_workers=workers)
real_batch = next(iter(dataloader)) # Wrong: this line does not work
Hi, ptrblck. Thanks a lot for your reply! I know you always offer helpful ideas.
When calling ‘real_batch = next(iter(dataloader))’, I waited for about 10 min and then stopped the program . For comparison, I ran the PyTorch DCGAN tutorial, and the dataloader for loading the Celeba dataset (about 202K images) worked quite fast, say around 1 sec.
In addition, the dataset is stored in my One-Drive folder.
Thank you!
I assume you have a local copy of the data or are you lazily downloading each sample?
In any case, how long does spio.loadmat take for a single sample and for all 16 samples?
Thanks for your kind help, ptrblck. Unfortunately, it does not change the situation after I move the dataset to a local folder. For finding the bug, I did another test: I create a new dataset containing only 50 images and load data from this dataset. Surprisingly, an error pops up in a few seconds (as attached below). It seems that the dataloader cannot get data, but why? I have googled the error, but have not idea yet. Could you offer any idea for this error?
---------------------------------------------------------------------------
Empty Traceback (most recent call last)
File ~\anaconda3\envs\pythonProject\lib\site-packages\torch\utils\data\dataloader.py:761, in _MultiProcessingDataLoaderIter._try_get_data(self, timeout)
760 try:
--> 761 data = self._data_queue.get(timeout=timeout)
762 return (True, data)
File ~\anaconda3\envs\pythonProject\lib\multiprocessing\queues.py:108, in Queue.get(self, block, timeout)
107 if not self._poll(timeout):
--> 108 raise Empty
109 elif not self._poll():
Empty:
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
Input In [5], in <cell line: 1>()
----> 1 real_batch = next(iter(dataloader))
File ~\anaconda3\envs\pythonProject\lib\site-packages\torch\utils\data\dataloader.py:345, in _BaseDataLoaderIter.__next__(self)
344 def __next__(self):
--> 345 data = self._next_data()
346 self._num_yielded += 1
347 if self._dataset_kind == _DatasetKind.Iterable and \
348 self._IterableDataset_len_called is not None and \
349 self._num_yielded > self._IterableDataset_len_called:
File ~\anaconda3\envs\pythonProject\lib\site-packages\torch\utils\data\dataloader.py:841, in _MultiProcessingDataLoaderIter._next_data(self)
838 return self._process_data(data)
840 assert not self._shutdown and self._tasks_outstanding > 0
--> 841 idx, data = self._get_data()
842 self._tasks_outstanding -= 1
844 if self._dataset_kind == _DatasetKind.Iterable:
845 # Check for _IterableDatasetStopIteration
File ~\anaconda3\envs\pythonProject\lib\site-packages\torch\utils\data\dataloader.py:808, in _MultiProcessingDataLoaderIter._get_data(self)
804 # In this case, `self._data_queue` is a `queue.Queue`,. But we don't
805 # need to call `.task_done()` because we don't use `.join()`.
806 else:
807 while True:
--> 808 success, data = self._try_get_data()
809 if success:
810 return data
File ~\anaconda3\envs\pythonProject\lib\site-packages\torch\utils\data\dataloader.py:774, in _MultiProcessingDataLoaderIter._try_get_data(self, timeout)
772 if len(failed_workers) > 0:
773 pids_str = ', '.join(str(w.pid) for w in failed_workers)
--> 774 raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
775 if isinstance(e, queue.Empty):
776 return (False, None)
RuntimeError: DataLoader worker (pid(s) 8268, 18776) exited unexpectedly
ptrblck, your solution are really amazing! It works and the dataset is successfully loaded in a few seconds. Thank you so much! This weird error has been bothering me for days. Btw, could you tell me why ‘num_workers = 0’ can help?
I don’t know why multiple workers fail to load the data, but would guess that you are running into issues with multiprocessing in loadmat or HDF5. Based on the docs HDF5 seems to be used internally:
You will need an HDF5 Python library to read MATLAB 7.3 format mat files.