Hello all,
I know this particular error as In the heading has been encountered and discussed at length in the previous discussions on this forum, but I am simply not able to find the reason as to why I am encountering it, I get the error - Device side assert triggered
when I am printing my loss. I am training a segmentation network on the PASCAL VOC dataset and my training loop is as follows -
for i in range(100):
epoch_loss = 0
num_nan = 0
for _, data in enumerate(dataloader):
image = data['image'].cuda()
mask = data['ground_truth'].cuda()
with autocast():
loss = model((image, mask))
print(loss)
scaler.scale(loss).backward()
scaler.unscale_(optim)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1)
scaler.step(optim)
scaler.update()
epoch_loss += loss.item()
del loss
torch.cuda.empty_cache()
print(f'Epoch Loss = {epoch_loss / _}, Number of Nans = {num_nan}')
#scheduler.step()
and the stack trace is as follows -
RuntimeError Traceback (most recent call last)
<ipython-input-1-686f5bbeb585> in <module>
1138 with autocast():
1139 loss = model((image, mask))
-> 1140 print(loss)
1141
1142 scaler.scale(loss).backward()
/opt/conda/lib/python3.7/site-packages/torch/tensor.py in __repr__(self)
177 return handle_torch_function(Tensor.__repr__, relevant_args, self)
178 # All strings are unicode in Python 3.
--> 179 return torch._tensor_str._str(self)
180
181 def backward(self, gradient=None, retain_graph=None, create_graph=False):
/opt/conda/lib/python3.7/site-packages/torch/_tensor_str.py in _str(self)
370 def _str(self):
371 with torch.no_grad():
--> 372 return _str_intern(self)
/opt/conda/lib/python3.7/site-packages/torch/_tensor_str.py in _str_intern(self)
350 tensor_str = _tensor_str(self.to_dense(), indent)
351 else:
--> 352 tensor_str = _tensor_str(self, indent)
353
354 if self.layout != torch.strided:
/opt/conda/lib/python3.7/site-packages/torch/_tensor_str.py in _tensor_str(self, indent)
239 return _tensor_str_with_formatter(self, indent, summarize, real_formatter, imag_formatter)
240 else:
--> 241 formatter = _Formatter(get_summarized_data(self) if summarize else self)
242 return _tensor_str_with_formatter(self, indent, summarize, formatter)
243
/opt/conda/lib/python3.7/site-packages/torch/_tensor_str.py in __init__(self, tensor)
87
88 else:
---> 89 nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
90
91 if nonzero_finite_vals.numel() == 0:
RuntimeError: CUDA error: device-side assert triggered
Now as it can be seen, the error comes in the print function. I monitor my memory, and memory is not really an issue as I clear cache at the end.
If I do not print -
Then the error comes in the backward call as follows -
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-1-000e70ad2e12> in <module>
1139 loss = model((image, mask))
1140
-> 1141 scaler.scale(loss).backward()
1142 scaler.unscale_(optim)
1143 torch.nn.utils.clip_grad_norm_(model.parameters(), 1)
/opt/conda/lib/python3.7/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
219 retain_graph=retain_graph,
220 create_graph=create_graph)
--> 221 torch.autograd.backward(self, gradient, retain_graph, create_graph)
222
223 def register_hook(self, hook):
/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
130 Variable._execution_engine.run_backward(
131 tensors, grad_tensors_, retain_graph, create_graph,
--> 132 allow_unreachable=True) # allow_unreachable flag
133
134
RuntimeError: CUDA error: device-side assert triggered
What should I do to debug and ensure smooth training
TIA