Bug
I can never finish my 5-6 epochs of training a single time. Have been bluescreening with varying reasons and DataLoader worker exited unexpectedly.
To Reproduce
Steps to reproduce the behavior:
- Train my net based on ResNet101 on ~98000 train images.
- Epoch 1-3 works normally most of the time, it’ll crash somewhere in the middle. The weird thing is it has worked perfectly before I did some data cleaning. That is, moving some images. The first epoch works fine most of the time though so I doubt there’s an issue with my data.
When running on Spyder/Jupyter Notebook, I constantly had bluescreens. This is my first time running directly through Anaconda Prompt and this was the error for this run.
Traceback (most recent call last):
File "I:\School\Ana\lib\site-packages\torch\utils\data\dataloader.py", line 761, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "I:\School\Ana\lib\queue.py", line 178, in get
raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/Users/Dan/Desktop/shopee.py", line 70, in <module>
for b, (X_train, y_train) in enumerate(train_loader):
File "I:\School\Ana\lib\site-packages\torch\utils\data\dataloader.py", line 345, in __next__
data = self._next_data()
File "I:\School\Ana\lib\site-packages\torch\utils\data\dataloader.py", line 841, in _next_data
idx, data = self._get_data()
File "I:\School\Ana\lib\site-packages\torch\utils\data\dataloader.py", line 798, in _get_data
success, data = self._try_get_data()
File "I:\School\Ana\lib\site-packages\torch\utils\data\dataloader.py", line 774, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 8180) exited unexpectedly
train_data, test_data = random_split(master_data, (n - n_test, n_test))
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True, pin_memory=True, num_workers=8)
test_loader = DataLoader(test_data, batch_size=batch_size, shuffle=True, pin_memory=False, num_workers=8)
model = models.resnet101(pretrained=True)
for param in model.parameters():
param.requires_grad = False
model.fc = nn.Sequential(nn.Linear(2048, 2048),
nn.ReLU(inplace=True),
nn.Linear(2048, 42),
nn.LogSoftmax(dim=1))
model.cuda()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.00002)
start_time = time.time()
epochs = 6
for i in range(epochs):
tst_corr = 0
total_loss = 0
for b, (X_train, y_train) in enumerate(train_loader):
b += 1
loss = criterion(model(X_train.to(device)), y_train.to(device))
optimizer.zero_grad()
loss.backward()
optimizer.step()
if b%300==0:
print(f'epoch:{i+1:2} batch: {batch_size*b:5}/97920')
with torch.no_grad():
for b, (X_test, y_test) in enumerate(test_loader):
b += 1
X_test = X_test.cuda()
y_val = model(X_test).to(torch.device('cpu'))
predicted = torch.max(y_val.data, 1)[1]
tst_corr += (predicted == y_test).sum()
total_loss += criterion(y_val, y_test)
print(f'epoch {i+1} test loss: {total_loss/b:5} test accuracy: {tst_corr.item()*100/(batch_size*b):7.3f}%\n')
Environment
PyTorch version: 1.5.1
Is debug build: No
CUDA used to build PyTorch: 10.2
OS: Microsoft Windows 10 Pro
GCC version: Could not collect
CMake version: Could not collect
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce RTX 2080 Ti
Nvidia driver version: 451.22
cuDNN version: Could not collect
Versions of relevant libraries:
[pip3] efficientnet-pytorch==0.6.3
[pip3] numpy==1.18.5
[pip3] numpydoc==1.0.0
[pip3] torch==1.5.1
[pip3] torchvision==0.6.1
[conda] _pytorch_select 0.1 cpu_0
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 h74a9793_1
[conda] efficientnet-pytorch 0.6.3 pypi_0 pypi
[conda] libmklml 2019.0.5 0
[conda] mkl 2019.4 245
[conda] mkl-service 2.3.0 py37hb782905_0
[conda] mkl_fft 1.1.0 py37h45dec08_0
[conda] mkl_random 1.1.0 py37h675688f_0
[conda] numpy 1.18.5 py37h6530119_0
[conda] numpy-base 1.18.5 py37hc3f5095_0
[conda] numpydoc 1.0.0 py_0
[conda] pytorch 1.5.1 py3.7_cuda102_cudnn7_0 pytorch
[conda] torchvision 0.6.1 py37_cu102 pytorch
Additional context
Blue screens have different stopcodes. IRQL not less than equal happens, kmode exception happens, kernal security check error, some other exception not handled thing has also happened before.
I run on 2080ti and it’s at about 82% for its CUDA utilization. CPU’s at about 50% and RAM isn’t maxed out.
Edit: Have been getting the above issue at different points during train_data training.
Also ran it again on a freshly installed Anaconda, Pytorch, CUDA and cuDNN. Got an ‘interrupt exception not handled’ blue screen at about epoch 2.
Changed the drive in which my data is saved at to another M.2 drive. Same errors.