DataLoader randomly crashes after few epochs

sauhaardac · June 28, 2018, 9:16pm

I’m training with a DataLoader and it randomly crashes with this error after three epochs:

Traceback (most recent call last):
  File "train.py", line 46, in <module>
    for batch_idx, (song, label) in enumerate(train_loader):
  File "/home/sauhaarda/.local/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 280, in __next__
    idx, batch = self._get_batch()
  File "/home/sauhaarda/.local/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 259, in _get_batch
    return self.data_queue.get()
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 343, in get
    res = self._reader.recv_bytes()
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
  File "/home/sauhaarda/.local/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 178, in
 handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 9777) exited unexpectedly with exit code 1.

My code is available here:

github.com

sauhaardac/raga_training/blob/nobptt/train.py

from data import *
from models import RagaDetector
from utils import Averager
from torch.utils.data import DataLoader
import multiprocessing
import progressbar
from tensorboardX import SummaryWriter

writer = SummaryWriter()
bptt = 1000

def repackage_hidden(h):
    """Wraps hidden states in new Tensors, to detach them from their history."""
    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)


def time_split(source, i):

This file has been truncated. show original

ptrblck · June 29, 2018, 9:09am

Could you set num_workers=0, run it again and see, if the worker still crashes.
If not, set it to 1 and try it again. It would be interesting to see, if your data is somehow corrupt or not.

Muna · March 8, 2021, 11:22pm

Hi, I have the same problem. It works with num_worker=0 but it is so slow.
I am running my code in cluster with 4 GPU, 48 CPU, and 360 GB memory.