MaskRCNN training_step in Pytorch Lightning

vriez · November 30, 2020, 2:50pm

I am rewriting this tutorial with Pytorch Lightning and within the following training_step:

    def training_step(self, batch, batch_idx):

        images = batch[0]
        targets = batch[1]

        loss_dict = self.model(images, targets)

        loss = torch.stack([loss for loss in loss_dict.values()])

        loss[torch.isnan(loss)] = 10.0
        loss = loss.clamp(min=0.0, max=10.0)
        loss = loss.sum()

        for l_name, l_value in loss_dict.items():
            try:
                self.logger.experiment.add_scalar(
                    f"train_{l_name}", l_value, self.current_epoch
                )
            except RuntimeError:
                pass

        self.log("train_loss", loss)

        return loss

I get

tensorboard --logdir=trash/1606745694/tensorboard
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Using native 16bit precision.

  | Name        | Type             | Params
-------------------------------------------------
0 | model       | MaskRCNN         | 43 M  
1 | criterion   | MSELoss          | 0     
2 | criterion_1 | CrossEntropyLoss | 0     
3 | accuracy    | Accuracy         | 0     
Validation sanity check: 0it [00:00, ?it/s]/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/torch/nn/modules/loss.py:446: UserWarning: Using a target size (torch.Size([1, 420, 360])) that is different to the input size (torch.Size([10, 1, 420, 360])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  return F.mse_loss(input, target, reduction=self.reduction)


Epoch 0:   0%|                                                                                                                                                                           | 0/5 [00:00<?, ?it/s]/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/torch/nn/functional.py:3103: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. 
  warnings.warn("The default behavior for interpolate/upsample with float scale_factor changed "
Epoch 0:  80%|████████████████/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/torch/nn/modules/loss.py:446: UserWarning: Using a target size (torch.Size([1, 420, 360])) that is different to the input size (torch.Size([0, 1, 420, 360])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  return F.mse_loss(input, target, reduction=self.reduction)
/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/torch/nn/modules/loss.py:446: UserWarning: Using a target size (torch.Size([1, 420, 360])) that is different to the input size (torch.Size([2, 1, 420, 360])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  return F.mse_loss(input, target, reduction=self.reduction)


Epoch 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:07<00:00,  1.42s/it, loss=4.556, v_num=0]Epoch 0: train_loss reached 10.22216 (best 10.22216), saving model to /home/vitor/Projects/ocr/trash/1606745694/checkpoints/weights-epoch=00-val_loss=30.00.ckpt as top 3████████| 1/1 [00:00<00:00,  1.02it/s]
Epoch 1:  80%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                            | 4/5 [00:03<00:00,  1.16it/s, loss=3.630, v_num=0]
Validating: 0it [00:00, ?it/s]                                                                                                                                                                                 
Epoch 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.07s/it, loss=3.630, v_num=0]Epoch 1: train_loss reached 0.13001 (best 0.13001), saving model to /home/vitor/Projects/ocr/trash/1606745694/checkpoints/weights-epoch=01-val_loss=0.00.ckpt as top 3███████████| 1/1 [00:01<00:00,  1.91s/it]
Epoch 2:  80%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                            | 4/5 [00:03<00:00,  1.14it/s, loss=3.790, v_num=0]
Validating: 0it [00:00, ?it/s]                                                                                                                                                                                 
Epoch 2: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.08s/it, loss=3.790, v_num=0]Epoch 2: train_loss reached 5.96821 (best 0.13001), saving model to /home/vitor/Projects/ocr/trash/1606745694/checkpoints/weights-epoch=02-val_loss=0.00.ckpt as top 3███████████| 1/1 [00:01<00:00,  1.87s/it]
Epoch 3:  80%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                            | 4/5 [00:03<00:00,  1.18it/s, loss=4.178, v_num=0]
Validating: 0it [00:00, ?it/s]                                                                                                                                                                                 
Epoch 3: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.04s/it, loss=4.178, v_num=0]Epoch 3: train_loss reached 0.07344 (best 0.07344), saving model to /home/vitor/Projects/ocr/trash/1606745694/checkpoints/weights-epoch=03-val_loss=0.00.ckpt as top 3███████████| 1/1 [00:01<00:00,  1.82s/it]
Epoch 4:  20%|████████████████████████████▍                                                                                                                 | 1/5 [00:01<00:04,  1.03s/it, loss=3.938, v_num=0]Traceback (most recent call last):                                                                                                                                                                             
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/vitor/Projects/ocr/segmentation/maskrcnn/train.py", line 173, in <module>
    trainer.fit(model, tr_dl, val_dl)
  File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 63, in train
    results = self.train_or_test()
  File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in train_or_test
    results = self.trainer.train()
  File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 493, in train
    self.train_loop.run_training_epoch()
  File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 561, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 728, in run_training_batch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 469, in optimizer_step
    self.trainer.accelerator_backend.optimizer_step(
  File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 122, in optimizer_step
    model_ref.optimizer_step(
  File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1262, in optimizer_step
    optimizer_closure()
  File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 718, in train_step_and_backward_closure
    result = self.training_step_and_backward(
  File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 813, in training_step_and_backward
    result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
  File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 320, in training_step
    training_step_output = self.trainer.accelerator_backend.training_step(args)
  File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 69, in training_step
    output = self.__training_step(args)
  File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 79, in __training_step
    output = self.trainer.model.training_step(*args)
  File "/home/vitor/Projects/ocr/segmentation/maskrcnn/models.py", line 60, in training_step
    loss[torch.isnan(loss)] = 10.0
RuntimeError: CUDA error: an illegal memory access was encountered
Exception ignored in: <function tqdm.__del__ at 0x7f14df8e3160>
Traceback (most recent call last):
  File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/tqdm/std.py", line 1128, in __del__
  File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/tqdm/std.py", line 1341, in close
  File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/tqdm/std.py", line 1520, in display
  File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/tqdm/std.py", line 1131, in __repr__
  File "/home/vitor/Projects/ocr/venv/lib/python3.8/site-packages/tqdm/std.py", line 1481, in format_dict
TypeError: cannot unpack non-iterable NoneType object

I have tried using CUDA_LAUNCH_BLOCKING=1, however it did not solve the problem.

How can I debug and get rid of this behavior?

ptrblck · December 1, 2020, 8:09am

Which PyTorch version are you using? Could you update to the latest stable release (1.7.0) and rerun the script?
Did a run with blocking calls yield another stack trace?

vriez · December 1, 2020, 8:45am

Thanks for the reply @ptrblck, I am already on pytorch ‘1.7.0’.

I have further investigated this behavior and I suspect that it has to do with the loss’ grad_fn, because:

MaskRCNN returns the following object during training

{'loss_classifier': tensor(1.2836, device='cuda:0', grad_fn=<NllLossBackward>), 'loss_box_reg': tensor(0.0359, device='cuda:0', grad_fn=<DivBackward0>), 'loss_mask': tensor(1123.8214, device='cuda:0',
       grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'loss_objectness': tensor(0.8418, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'loss_rpn_box_reg': tensor(0.0594, device='cuda:0', grad_fn=<DivBackward0>)}

After, stacking one gets

tensor([1.2836e+00, 3.5866e-02, 1.1238e+03, 8.4183e-01, 5.9387e-02],
       device='cuda:0', grad_fn=<StackBackward>)

Then, cleansing yields

 tensor([1.2836e+00, 3.5866e-02, 1.1238e+03, 8.4183e-01, 5.9387e-02],
       device='cuda:0', grad_fn=<IndexPutBackward>)

Clamping

tensor([ 1.2836,  0.0359, 10.0000,  0.8418,  0.0594], device='cuda:0',
       grad_fn=<ClampBackward>)

At last, after reduction, on gets

tensor(12.2206, device='cuda:0', grad_fn=<SumBackward0>)

I have read some “theory” here, but it’s not clear yet how to operationally make it work.

In short, I guess the question is how to reduce the losses so that I get a loss that can backward-propagate.

Note: if I choose any loss alone, say loss = loss_dict[“loss_classifier”] the training_step would run.

ptrblck · December 1, 2020, 8:54am

I don’t think the illegal memory access is related to the code usage in your script, but is an internal bug.
Could you post an executable code snippet and add the information about your setup, please?

vriez · December 1, 2020, 10:22am

I see, so I am able to reproduce one of the errors I got (perhaps related) with the snippet below:

import torch
import pickle
import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor

num_classes = 2
model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
model.roi_heads.detections_per_img = 10

in_features = model.roi_heads.box_predictor.cls_score.in_features

model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
hidden_layer = 256

model.roi_heads.mask_predictor = MaskRCNNPredictor(
    in_features_mask, hidden_layer, num_classes
)

device = torch.device("cuda")

with open('batch_sample.pkl', 'rb') as f:
    batch = pickle.load(f)

images, targets = batch

model.to(device)
model.train()
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

loss_dict = model(images, targets)
loss = sum([loss for loss in loss_dict.values()])
loss.backward()

you can download the sample input data in here.

My setup is:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 207...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   57C    P8    15W /  N/A |    682MiB /  7982MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

(venv) vitor@vitor-Oryx-Pro:~/Projects/ocr$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04 LTS
Release:	20.04
Codename:	focal

Python 3.8.5

gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

 torchvision. '0.8.1'

Does it suffice, @ptrblck?

ptrblck · December 2, 2020, 8:18am

Could you post the shapes as well as min and max. values for images and targets? Alternatively, could you upload the batch_sample.pkl data?

vriez · December 2, 2020, 2:06pm

Thanks for the hint, @ptrblck. The requested shapes are:

IMAGES: shape =  torch.Size([5, 3, 420, 360]) , min =  tensor(0., device='cuda:0') , max =  tensor(0.9765, device='cuda:0') , type =  torch.cuda.FloatTensor
MASK_0: shape =  torch.Size([1, 420, 360])    , min =  tensor(0., device='cuda:0') , max =  tensor(0.0157, device='cuda:0') , type =  torch.cuda.FloatTensor
MASK_1: shape =  torch.Size([1, 420, 360])    , min =  tensor(0., device='cuda:0') , max =  tensor(0.0078, device='cuda:0') , type =  torch.cuda.FloatTensor
MASK_2: shape =  torch.Size([1, 420, 360])    , min =  tensor(0., device='cuda:0') , max =  tensor(0.0078, device='cuda:0') , type =  torch.cuda.FloatTensor
MASK_3: shape =  torch.Size([1, 420, 360])    , min =  tensor(0., device='cuda:0') , max =  tensor(0.0078, device='cuda:0') , type =  torch.cuda.FloatTensor
MASK_4: shape =  torch.Size([1, 420, 360])    , min =  tensor(0., device='cuda:0') , max =  tensor(0.0078, device='cuda:0') , type =  torch.cuda.FloatTensor

This enabled me to identify the origin of the error, which for this snippet was to multiply both images and masks by 255. Hence, obtaining:

IMAGES: shape =  torch.Size([5, 3, 420, 360]) , min =  tensor(0., device='cuda:0') , max =  tensor(249., device='cuda:0') , type =  torch.cuda.FloatTensor
MASK_0: shape =  torch.Size([1, 420, 360])    , min =  tensor(0., device='cuda:0') , max =  tensor(4., device='cuda:0')   , type =  torch.cuda.FloatTensor
MASK_1: shape =  torch.Size([1, 420, 360])    , min =  tensor(0., device='cuda:0') , max =  tensor(2., device='cuda:0')   , type =  torch.cuda.FloatTensor
MASK_2: shape =  torch.Size([1, 420, 360])    , min =  tensor(0., device='cuda:0') , max =  tensor(2., device='cuda:0')   , type =  torch.cuda.FloatTensor
MASK_3: shape =  torch.Size([1, 420, 360])    , min =  tensor(0., device='cuda:0') , max =  tensor(2., device='cuda:0')   , type =  torch.cuda.FloatTensor
MASK_4: shape =  torch.Size([1, 420, 360])    , min =  tensor(0., device='cuda:0') , max =  tensor(2., device='cuda:0')   , type =  torch.cuda.FloatTensor

Which yields no error.

I have adjusted my code to comply with the correct expected ranges and types, however the backward propagation within lightning still fails.

Any suggestion?