I can't run my CNN training successfully for the past 20 tries or more

:bug: Bug

I can never finish my 5-6 epochs of training a single time. Have been bluescreening with varying reasons and DataLoader worker exited unexpectedly.

To Reproduce

Steps to reproduce the behavior:

  1. Train my net based on ResNet101 on ~98000 train images.
  2. Epoch 1-3 works normally most of the time, it’ll crash somewhere in the middle. The weird thing is it has worked perfectly before I did some data cleaning. That is, moving some images. The first epoch works fine most of the time though so I doubt there’s an issue with my data.

When running on Spyder/Jupyter Notebook, I constantly had bluescreens. This is my first time running directly through Anaconda Prompt and this was the error for this run.

Traceback (most recent call last):
  File "I:\School\Ana\lib\site-packages\torch\utils\data\dataloader.py", line 761, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "I:\School\Ana\lib\queue.py", line 178, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/Dan/Desktop/shopee.py", line 70, in <module>
    for b, (X_train, y_train) in enumerate(train_loader):
  File "I:\School\Ana\lib\site-packages\torch\utils\data\dataloader.py", line 345, in __next__
    data = self._next_data()
  File "I:\School\Ana\lib\site-packages\torch\utils\data\dataloader.py", line 841, in _next_data
    idx, data = self._get_data()
  File "I:\School\Ana\lib\site-packages\torch\utils\data\dataloader.py", line 798, in _get_data
    success, data = self._try_get_data()
  File "I:\School\Ana\lib\site-packages\torch\utils\data\dataloader.py", line 774, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 8180) exited unexpectedly
train_data, test_data = random_split(master_data, (n - n_test, n_test))
    train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True, pin_memory=True, num_workers=8)
    test_loader = DataLoader(test_data, batch_size=batch_size, shuffle=True, pin_memory=False, num_workers=8)
    
    model = models.resnet101(pretrained=True)
    for param in model.parameters():
        param.requires_grad = False
    
    model.fc = nn.Sequential(nn.Linear(2048, 2048),
                             nn.ReLU(inplace=True),
                             nn.Linear(2048, 42),
                             nn.LogSoftmax(dim=1))
    
    model.cuda()
    
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.00002)
    
    start_time = time.time()
    
    epochs = 6
    
    for i in range(epochs):
        tst_corr = 0
        total_loss = 0
        
        for b, (X_train, y_train) in enumerate(train_loader):
            b += 1

            loss = criterion(model(X_train.to(device)), y_train.to(device))
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            if b%300==0:
                print(f'epoch:{i+1:2}  batch: {batch_size*b:5}/97920')
    
        with torch.no_grad():
            for b, (X_test, y_test) in enumerate(test_loader):
                b += 1
                X_test = X_test.cuda()
                y_val = model(X_test).to(torch.device('cpu'))
    
                predicted = torch.max(y_val.data, 1)[1] 
                tst_corr += (predicted == y_test).sum()
                total_loss += criterion(y_val, y_test)
            print(f'epoch {i+1} test loss: {total_loss/b:5} test accuracy: {tst_corr.item()*100/(batch_size*b):7.3f}%\n')

Environment

PyTorch version: 1.5.1
Is debug build: No
CUDA used to build PyTorch: 10.2

OS: Microsoft Windows 10 Pro
GCC version: Could not collect
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce RTX 2080 Ti
Nvidia driver version: 451.22
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] efficientnet-pytorch==0.6.3
[pip3] numpy==1.18.5
[pip3] numpydoc==1.0.0
[pip3] torch==1.5.1
[pip3] torchvision==0.6.1
[conda] _pytorch_select           0.1                       cpu_0  
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               10.2.89              h74a9793_1  
[conda] efficientnet-pytorch      0.6.3                    pypi_0    pypi
[conda] libmklml                  2019.0.5                      0  
[conda] mkl                       2019.4                      245  
[conda] mkl-service               2.3.0            py37hb782905_0  
[conda] mkl_fft                   1.1.0            py37h45dec08_0  
[conda] mkl_random                1.1.0            py37h675688f_0  
[conda] numpy                     1.18.5           py37h6530119_0  
[conda] numpy-base                1.18.5           py37hc3f5095_0  
[conda] numpydoc                  1.0.0                      py_0  
[conda] pytorch                   1.5.1           py3.7_cuda102_cudnn7_0    pytorch
[conda] torchvision               0.6.1                py37_cu102    pytorch

Additional context

Blue screens have different stopcodes. IRQL not less than equal happens, kmode exception happens, kernal security check error, some other exception not handled thing has also happened before.

I run on 2080ti and it’s at about 82% for its CUDA utilization. CPU’s at about 50% and RAM isn’t maxed out.

Edit: Have been getting the above issue at different points during train_data training.
Also ran it again on a freshly installed Anaconda, Pytorch, CUDA and cuDNN. Got an ‘interrupt exception not handled’ blue screen at about epoch 2.
Changed the drive in which my data is saved at to another M.2 drive. Same errors.

What is your batch size? Potentially some corrupted images. Did you try running the code with a minimal set of images?

I’ve been using batch sizes of 24 or 32.
That makes sense actually, while I was cleaning my data some of the images looked corrupt. I’ll look through the dataset.

There was 1 out of 98000 images that were faulty. Doesn’t seem like it’s the issue. The corrupted images would either be in the test or train set. Each epoch will iterate over every image since I track test loss and accuracy. Epoch 1 survives so I don’t think there’s an issue with images.

Currently training after removing the image. Will update.

I:\Ana\lib\site-packages\torch\utils\data\dataloader.py in _try_get_data(self, timeout)
    760         try:
--> 761             data = self._data_queue.get(timeout=timeout)
    762             return (True, data)

I:\Ana\lib\multiprocessing\queues.py in get(self, block, timeout)
    104                     if not self._poll(timeout):
--> 105                         raise Empty
    106                 elif not self._poll():

Empty: 

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-12-85b51d9eba09> in <module>
     20     #fetch_time = time.time()
     21 
---> 22     for b, (X_train, y_train) in enumerate(train_loader):
     23         if b == max_trn_batch:
     24             break

I:\Ana\lib\site-packages\torch\utils\data\dataloader.py in __next__(self)
    343 
    344     def __next__(self):
--> 345         data = self._next_data()
    346         self._num_yielded += 1
    347         if self._dataset_kind == _DatasetKind.Iterable and \

I:\Ana\lib\site-packages\torch\utils\data\dataloader.py in _next_data(self)
    839 
    840             assert not self._shutdown and self._tasks_outstanding > 0
--> 841             idx, data = self._get_data()
    842             self._tasks_outstanding -= 1
    843 

I:\Ana\lib\site-packages\torch\utils\data\dataloader.py in _get_data(self)
    806         else:
    807             while True:
--> 808                 success, data = self._try_get_data()
    809                 if success:
    810                     return data

I:\Ana\lib\site-packages\torch\utils\data\dataloader.py in _try_get_data(self, timeout)
    772             if len(failed_workers) > 0:
    773                 pids_str = ', '.join(str(w.pid) for w in failed_workers)
--> 774                 raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
    775             if isinstance(e, queue.Empty):
    776                 return (False, None)

RuntimeError: DataLoader worker (pid(s) 2636) exited unexpectedly

This error popped up at Epoch 3 about halfway through.