I have implemented an object detection model with CNNs in Pytorch with 3 heads: classification, object detection and segmentation, on google collab This model is from a research paper and when I run it, there is no problem and the training time is consistante, but I modified this model by adding a new classification head to the backbone of the model 1 and created a second model, since the model 1 was just getting some feature maps and used them via FPN, the backbone is dla34 from timm model in pytorch and the code is this:
self.backbone = timm.create_model(model_name, pretrained=True, features_only=True, out_indices=model_out_indices)
I add some layers to the end of the backbone to make it classify the image while getting the featuremaps, and so the training and validation results are decreasing in a slow rate like these:
$$TRAIN$$ epoch 0 ====>: loss_cls = 10.37930 loss_reg_xytl = 0.07201 loss_iou = 3.33917 loss_seg = 0.23536 loss_class_cls = 0.13680 Train Time: 00:15:57
$$VALID$$ epoch 0 ====>: loss_cls = 3.64299 loss_reg_xytl = 0.06027 loss_iou = 3.27866 loss_seg = 0.21605 loss_class_cls = 0.13394 Val Time: 00:02:51
$$TRAIN$$ epoch 1 ====>: loss_cls = 2.90086 loss_reg_xytl = 0.04123 loss_iou = 2.82772 loss_seg = 0.18830 loss_class_cls = 0.13673 Train Time: 00:06:28
$$VALID$$ epoch 1 ====>: loss_cls = 2.42524 loss_reg_xytl = 0.02885 loss_iou = 2.43828 loss_seg = 0.16975 loss_class_cls = 0.13383 Val Time: 00:00:21
$$TRAIN$$ epoch 2 ====>: loss_cls = 2.51989 loss_reg_xytl = 0.02749 loss_iou = 2.29531 loss_seg = 0.16370 loss_class_cls = 0.13665 Train Time: 00:08:08
$$VALID$$ epoch 2 ====>: loss_cls = 2.31358 loss_reg_xytl = 0.01987 loss_iou = 2.15709 loss_seg = 0.15870 loss_class_cls = 0.13372 Val Time: 00:00:20
$$TRAIN$$ epoch 3 ====>: loss_cls = 2.45530 loss_reg_xytl = 0.02143 loss_iou = 2.04151 loss_seg = 0.15327 loss_class_cls = 0.13663 Train Time: 00:09:41
$$VALID$$ epoch 3 ====>: loss_cls = 2.16958 loss_reg_xytl = 0.01639 loss_iou = 1.93723 loss_seg = 0.14761 loss_class_cls = 0.13373 Val Time: 00:00:21
$$TRAIN$$ epoch 4 ====>: loss_cls = 2.28015 loss_reg_xytl = 0.01871 loss_iou = 1.95341 loss_seg = 0.14816 loss_class_cls = 0.13662 Train Time: 00:11:24
$$VALID$$ epoch 4 ====>: loss_cls = 2.10085 loss_reg_xytl = 0.01300 loss_iou = 1.72231 loss_seg = 0.14628 loss_class_cls = 0.13366 Val Time: 00:00:20
$$TRAIN$$ epoch 5 ====>: loss_cls = 2.26286 loss_reg_xytl = 0.01951 loss_iou = 1.85480 loss_seg = 0.14490 loss_class_cls = 0.13656 Train Time: 00:12:51
$$VALID$$ epoch 5 ====>: loss_cls = 2.06082 loss_reg_xytl = 0.01709 loss_iou = 1.70226 loss_seg = 0.13609 loss_class_cls = 0.13360 Val Time: 00:00:21
$$TRAIN$$ epoch 6 ====>: loss_cls = 2.10616 loss_reg_xytl = 0.02187 loss_iou = 1.75277 loss_seg = 0.14173 loss_class_cls = 0.13654 Train Time: 00:14:36
$$VALID$$ epoch 6 ====>: loss_cls = 1.80460 loss_reg_xytl = 0.01411 loss_iou = 1.64604 loss_seg = 0.13180 loss_class_cls = 0.13360 Val Time: 00:00:20
$$TRAIN$$ epoch 7 ====>: loss_cls = 1.95502 loss_reg_xytl = 0.01975 loss_iou = 1.70851 loss_seg = 0.14052 loss_class_cls = 0.13655 Train Time: 00:16:06
$$VALID$$ epoch 7 ====>: loss_cls = 1.80424 loss_reg_xytl = 0.01560 loss_iou = 1.69335 loss_seg = 0.13176 loss_class_cls = 0.13355 Val Time: 00:00:20
$$TRAIN$$ epoch 8 ====>: loss_cls = 1.90833 loss_reg_xytl = 0.02100 loss_iou = 1.73520 loss_seg = 0.14235 loss_class_cls = 0.13649 Train Time: 00:17:46
$$VALID$$ epoch 8 ====>: loss_cls = 1.53639 loss_reg_xytl = 0.01386 loss_iou = 1.68395 loss_seg = 0.13792 loss_class_cls = 0.13350 Val Time: 00:00:21
$$TRAIN$$ epoch 9 ====>: loss_cls = 1.61048 loss_reg_xytl = 0.01840 loss_iou = 1.81451 loss_seg = 0.14155 loss_class_cls = 0.13642 Train Time: 00:19:23
$$VALID$$ epoch 9 ====>: loss_cls = 1.39604 loss_reg_xytl = 0.01234 loss_iou = 1.69770 loss_seg = 0.14150 loss_class_cls = 0.13345 Val Time: 00:00:20
$$TRAIN$$ epoch 10 ====>: loss_cls = 1.58478 loss_reg_xytl = 0.01784 loss_iou = 1.73858 loss_seg = 0.14001 loss_class_cls = 0.13636 Train Time: 00:21:11
$$VALID$$ epoch 10 ====>: loss_cls = 1.49616 loss_reg_xytl = 0.01216 loss_iou = 1.60697 loss_seg = 0.13105 loss_class_cls = 0.13335 Val Time: 00:00:20
$$TRAIN$$ epoch 11 ====>: loss_cls = 1.59138 loss_reg_xytl = 0.01954 loss_iou = 1.70157 loss_seg = 0.13825 loss_class_cls = 0.13628 Train Time: 00:23:13
$$VALID$$ epoch 11 ====>: loss_cls = 1.37387 loss_reg_xytl = 0.01493 loss_iou = 1.72290 loss_seg = 0.14186 loss_class_cls = 0.13325 Val Time: 00:00:20
$$TRAIN$$ epoch 12 ====>: loss_cls = 1.56931 loss_reg_xytl = 0.01929 loss_iou = 1.69895 loss_seg = 0.13726 loss_class_cls = 0.13621 Train Time: 00:24:55
$$VALID$$ epoch 12 ====>: loss_cls = 1.47095 loss_reg_xytl = 0.01358 loss_iou = 1.64010 loss_seg = 0.12568 loss_class_cls = 0.13314 Val Time: 00:00:21
$$TRAIN$$ epoch 13 ====>: loss_cls = 1.47089 loss_reg_xytl = 0.01883 loss_iou = 1.69151 loss_seg = 0.13617 loss_class_cls = 0.13627 Train Time: 00:26:49
$$VALID$$ epoch 13 ====>: loss_cls = 1.37469 loss_reg_xytl = 0.01444 loss_iou = 1.57538 loss_seg = 0.13452 loss_class_cls = 0.13308 Val Time: 00:00:20
$$TRAIN$$ epoch 14 ====>: loss_cls = 1.39732 loss_reg_xytl = 0.01801 loss_iou = 1.66951 loss_seg = 0.13488 loss_class_cls = 0.13614 Train Time: 00:28:04
$$VALID$$ epoch 14 ====>: loss_cls = 1.22657 loss_reg_xytl = 0.01389 loss_iou = 1.66898 loss_seg = 0.14039 loss_class_cls = 0.13286 Val Time: 00:00:21
$$TRAIN$$ epoch 15 ====>: loss_cls = 1.30442 loss_reg_xytl = 0.01737 loss_iou = 1.69497 loss_seg = 0.13358 loss_class_cls = 0.13607 Train Time: 00:29:14
$$VALID$$ epoch 15 ====>: loss_cls = 1.25604 loss_reg_xytl = 0.01460 loss_iou = 1.65997 loss_seg = 0.12326 loss_class_cls = 0.13268 Val Time: 00:00:20
$$TRAIN$$ epoch 16 ====>: loss_cls = 1.32521 loss_reg_xytl = 0.01644 loss_iou = 1.70964 loss_seg = 0.13379 loss_class_cls = 0.13590 Train Time: 00:30:58
$$VALID$$ epoch 16 ====>: loss_cls = 1.28813 loss_reg_xytl = 0.01189 loss_iou = 1.62254 l
oss_seg = 0.13013 loss_class_cls = 0.13239 Val Time: 00:00:20
the training time is increasing per epoch, I also checked it with ChatGPT and did these modifications but at the end the results were the same, the modifications are:
changing the optimizer
changing the lr scheduler
freezing some first layers of the backbone
changing the weights of the losses
removing some of the losses (loss_class_cls and loss_seg)
changing the number of workers and batch_size
but the results were exactly the same, the training time keeped increasing (running on gpu on google collab), SO here I desperatly need some suggestions on how to solve this problem.