I am developing with torchvision and its infrastructure, like Dataset and Dataloader. During the last months on Win10 there were no issues with that code and how I used it.
Yesterday I moved to a fresh Linux installation and setup the whole env, like CUDA, Python 3.9 and the packages, the requirements.txt created from the venv I used on the Win10 workstation. The GPU memory (2GB) is not sufficient anymore, even when using a batch size of 1 on the dataloaders.
But the strangest things happen with the multiprocessing dataloaders. Regardless if I use only one worker or multiples (4). In that code snippet the timeout is reached almost all the time. Only sometimes a record is returned by the call self._data_queue.get(timeout=timeout)
def _try_get_data(self, timeout=_utils.MP_STATUS_CHECK_INTERVAL):
# Tries to fetch data from `self._data_queue` once for a given timeout.
# This can also be used as inner loop of fetching without timeout, with
# the sender status as the loop condition.
#
# This raises a `RuntimeError` if any worker died expectedly. This error
# can come from either the SIGCHLD handler in `_utils/signal_handling.py`
# (only for non-Windows platforms), or the manual check below on errors
# and timeouts.
#
# Returns a 2-tuple:
# (bool: whether successfully get data, any: data if successful else None)
try:
data = self._data_queue.get(timeout=timeout)
return (True, data)
except Exception as e:
# At timeout and error, we manually check whether any worker has
# failed. Note that this is the only mechanism for Windows to detect
# worker failures.
failed_workers = []
for worker_id, w in enumerate(self._workers):
if self._workers_status[worker_id] and not w.is_alive():
failed_workers.append(w)
self._mark_worker_as_unavailable(worker_id)
if len(failed_workers) > 0:
pids_str = ', '.join(str(w.pid) for w in failed_workers)
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
if isinstance(e, queue.Empty):
return (False, None)
import tempfile
import errno
try:
# Raise an exception if we are this close to the FDs limit.
# Apparently, trying to open only one file is not a sufficient
# test.
# See NOTE [ DataLoader on Linux and open files limit ]
fds_limit_margin = 10
fs = [tempfile.NamedTemporaryFile() for i in range(fds_limit_margin)]
except OSError as e:
if e.errno == errno.EMFILE:
raise RuntimeError(
"Too many open files. Communication with the"
" workers is no longer possible. Please increase the"
" limit using `ulimit -n` in the shell or change the"
" sharing strategy by calling"
" `torch.multiprocessing.set_sharing_strategy('file_system')`"
" at the beginning of your code") from None
raise
I started to debug into the details, and what I see is that the Dataset (a Subset) is properly set in the MultiProcessingDataloaderIter:
This is how i setup the dataloaders:
# split the dataset in train and test set
train_size = int(len(dataset) * train_val_split)
val_size = len(dataset) - train_size
train_set, val_set = torch.utils.data.random_split(dataset, [train_size, val_size])
# define training and validation data loaders
data_loader_train = torch.utils.data.DataLoader(
train_set, batch_size=config['batch_size'], shuffle=True, num_workers=4,
collate_fn=utils.collate_fn)
data_loader_val = torch.utils.data.DataLoader(
val_set, batch_size=config['batch_size'], shuffle=False, num_workers=1,
collate_fn=utils.collate_fn)
I can observe that from start a timeout is reached three times in a row, and then the first record is returned from the dataset. But then I continue getting timeouts after a view records have been returned. And this pattern continues so that training is not feasible, due to the enormous delays.
Why is the behavior with Linux different? When I set num_workers to 0, then the blocking is gone.