I am rewriting this tutorial with Pytorch Lightning and within the following training_step:
def training_step(self, batch, batch_idx):
images = batch[0]
targets = batch[1]
loss_dict = self.model(images, targets)
loss = torch.stack([loss for loss in loss_dict.values()])
loss[torch.isnan(loss)] = 10.0
loss = loss.clamp(min=0.0, max=10.0)
loss = loss.sum()
for l_name, l_value in loss_dict.items():
try:
self.logger.experiment.add_scalar(
f"train_{l_name}", l_value, self.current_epoch
)
except RuntimeError:
pass
self.log("train_loss", loss)
return loss
I get
tensorboard --logdir=trash/1606745694/tensorboard
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Using native 16bit precision.
| Name | Type | Params
-------------------------------------------------
0 | model | MaskRCNN | 43 M
1 | criterion | MSELoss | 0
2 | criterion_1 | CrossEntropyLoss | 0
3 | accuracy | Accuracy | 0
Validation sanity check: 0it [00:00, ?it/s]/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/torch/nn/modules/loss.py:446: UserWarning: Using a target size (torch.Size([1, 420, 360])) that is different to the input size (torch.Size([10, 1, 420, 360])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
return F.mse_loss(input, target, reduction=self.reduction)
Epoch 0: 0%| | 0/5 [00:00<?, ?it/s]/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/torch/nn/functional.py:3103: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details.
warnings.warn("The default behavior for interpolate/upsample with float scale_factor changed "
Epoch 0: 80%|████████████████/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/torch/nn/modules/loss.py:446: UserWarning: Using a target size (torch.Size([1, 420, 360])) that is different to the input size (torch.Size([0, 1, 420, 360])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
return F.mse_loss(input, target, reduction=self.reduction)
/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/torch/nn/modules/loss.py:446: UserWarning: Using a target size (torch.Size([1, 420, 360])) that is different to the input size (torch.Size([2, 1, 420, 360])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
return F.mse_loss(input, target, reduction=self.reduction)
Epoch 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:07<00:00, 1.42s/it, loss=4.556, v_num=0]Epoch 0: train_loss reached 10.22216 (best 10.22216), saving model to /home/vitor/Projects/ocr/trash/1606745694/checkpoints/weights-epoch=00-val_loss=30.00.ckpt as top 3████████| 1/1 [00:00<00:00, 1.02it/s]
Epoch 1: 80%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 4/5 [00:03<00:00, 1.16it/s, loss=3.630, v_num=0]
Validating: 0it [00:00, ?it/s]
Epoch 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00, 1.07s/it, loss=3.630, v_num=0]Epoch 1: train_loss reached 0.13001 (best 0.13001), saving model to /home/vitor/Projects/ocr/trash/1606745694/checkpoints/weights-epoch=01-val_loss=0.00.ckpt as top 3███████████| 1/1 [00:01<00:00, 1.91s/it]
Epoch 2: 80%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 4/5 [00:03<00:00, 1.14it/s, loss=3.790, v_num=0]
Validating: 0it [00:00, ?it/s]
Epoch 2: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00, 1.08s/it, loss=3.790, v_num=0]Epoch 2: train_loss reached 5.96821 (best 0.13001), saving model to /home/vitor/Projects/ocr/trash/1606745694/checkpoints/weights-epoch=02-val_loss=0.00.ckpt as top 3███████████| 1/1 [00:01<00:00, 1.87s/it]
Epoch 3: 80%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 4/5 [00:03<00:00, 1.18it/s, loss=4.178, v_num=0]
Validating: 0it [00:00, ?it/s]
Epoch 3: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00, 1.04s/it, loss=4.178, v_num=0]Epoch 3: train_loss reached 0.07344 (best 0.07344), saving model to /home/vitor/Projects/ocr/trash/1606745694/checkpoints/weights-epoch=03-val_loss=0.00.ckpt as top 3███████████| 1/1 [00:01<00:00, 1.82s/it]
Epoch 4: 20%|████████████████████████████▍ | 1/5 [00:01<00:04, 1.03s/it, loss=3.938, v_num=0]Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/vitor/Projects/ocr/segmentation/maskrcnn/train.py", line 173, in <module>
trainer.fit(model, tr_dl, val_dl)
File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
results = self.accelerator_backend.train()
File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 63, in train
results = self.train_or_test()
File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in train_or_test
results = self.trainer.train()
File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 493, in train
self.train_loop.run_training_epoch()
File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 561, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 728, in run_training_batch
self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 469, in optimizer_step
self.trainer.accelerator_backend.optimizer_step(
File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 122, in optimizer_step
model_ref.optimizer_step(
File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1262, in optimizer_step
optimizer_closure()
File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 718, in train_step_and_backward_closure
result = self.training_step_and_backward(
File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 813, in training_step_and_backward
result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 320, in training_step
training_step_output = self.trainer.accelerator_backend.training_step(args)
File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 69, in training_step
output = self.__training_step(args)
File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 79, in __training_step
output = self.trainer.model.training_step(*args)
File "/home/vitor/Projects/ocr/segmentation/maskrcnn/models.py", line 60, in training_step
loss[torch.isnan(loss)] = 10.0
RuntimeError: CUDA error: an illegal memory access was encountered
Exception ignored in: <function tqdm.__del__ at 0x7f14df8e3160>
Traceback (most recent call last):
File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/tqdm/std.py", line 1128, in __del__
File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/tqdm/std.py", line 1341, in close
File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/tqdm/std.py", line 1520, in display
File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/tqdm/std.py", line 1131, in __repr__
File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/tqdm/std.py", line 1481, in format_dict
TypeError: cannot unpack non-iterable NoneType object
I have tried using CUDA_LAUNCH_BLOCKING=1, however it did not solve the problem.
How can I debug and get rid of this behavior?