Training Time is Increasing per epoch, Can somebody help me?

I have implemented an object detection model with CNNs in Pytorch with 3 heads: classification, object detection and segmentation, on google collab This model is from a research paper and when I run it, there is no problem and the training time is consistante, but I modified this model by adding a new classification head to the backbone of the model 1 and created a second model, since the model 1 was just getting some feature maps and used them via FPN, the backbone is dla34 from timm model in pytorch and the code is this:
self.backbone = timm.create_model(model_name, pretrained=True, features_only=True, out_indices=model_out_indices)

I add some layers to the end of the backbone to make it classify the image while getting the featuremaps, and so the training and validation results are decreasing in a slow rate like these:

$$TRAIN$$ epoch 0 ====>: loss_cls = 10.37930 loss_reg_xytl = 0.07201 loss_iou = 3.33917 loss_seg = 0.23536 loss_class_cls = 0.13680 Train Time: 00:15:57 
$$VALID$$ epoch 0 ====>: loss_cls = 3.64299 loss_reg_xytl = 0.06027 loss_iou = 3.27866 loss_seg = 0.21605 loss_class_cls = 0.13394 Val Time: 00:02:51 
$$TRAIN$$ epoch 1 ====>: loss_cls = 2.90086 loss_reg_xytl = 0.04123 loss_iou = 2.82772 loss_seg = 0.18830 loss_class_cls = 0.13673 Train Time: 00:06:28 
$$VALID$$ epoch 1 ====>: loss_cls = 2.42524 loss_reg_xytl = 0.02885 loss_iou = 2.43828 loss_seg = 0.16975 loss_class_cls = 0.13383 Val Time: 00:00:21 
$$TRAIN$$ epoch 2 ====>: loss_cls = 2.51989 loss_reg_xytl = 0.02749 loss_iou = 2.29531 loss_seg = 0.16370 loss_class_cls = 0.13665 Train Time: 00:08:08 
$$VALID$$ epoch 2 ====>: loss_cls = 2.31358 loss_reg_xytl = 0.01987 loss_iou = 2.15709 loss_seg = 0.15870 loss_class_cls = 0.13372 Val Time: 00:00:20 
$$TRAIN$$ epoch 3 ====>: loss_cls = 2.45530 loss_reg_xytl = 0.02143 loss_iou = 2.04151 loss_seg = 0.15327 loss_class_cls = 0.13663 Train Time: 00:09:41 
$$VALID$$ epoch 3 ====>: loss_cls = 2.16958 loss_reg_xytl = 0.01639 loss_iou = 1.93723 loss_seg = 0.14761 loss_class_cls = 0.13373 Val Time: 00:00:21 
$$TRAIN$$ epoch 4 ====>: loss_cls = 2.28015 loss_reg_xytl = 0.01871 loss_iou = 1.95341 loss_seg = 0.14816 loss_class_cls = 0.13662 Train Time: 00:11:24 
$$VALID$$ epoch 4 ====>: loss_cls = 2.10085 loss_reg_xytl = 0.01300 loss_iou = 1.72231 loss_seg = 0.14628 loss_class_cls = 0.13366 Val Time: 00:00:20 
$$TRAIN$$ epoch 5 ====>: loss_cls = 2.26286 loss_reg_xytl = 0.01951 loss_iou = 1.85480 loss_seg = 0.14490 loss_class_cls = 0.13656 Train Time: 00:12:51 
$$VALID$$ epoch 5 ====>: loss_cls = 2.06082 loss_reg_xytl = 0.01709 loss_iou = 1.70226 loss_seg = 0.13609 loss_class_cls = 0.13360 Val Time: 00:00:21 
$$TRAIN$$ epoch 6 ====>: loss_cls = 2.10616 loss_reg_xytl = 0.02187 loss_iou = 1.75277 loss_seg = 0.14173 loss_class_cls = 0.13654 Train Time: 00:14:36 
$$VALID$$ epoch 6 ====>: loss_cls = 1.80460 loss_reg_xytl = 0.01411 loss_iou = 1.64604 loss_seg = 0.13180 loss_class_cls = 0.13360 Val Time: 00:00:20 
$$TRAIN$$ epoch 7 ====>: loss_cls = 1.95502 loss_reg_xytl = 0.01975 loss_iou = 1.70851 loss_seg = 0.14052 loss_class_cls = 0.13655 Train Time: 00:16:06 
$$VALID$$ epoch 7 ====>: loss_cls = 1.80424 loss_reg_xytl = 0.01560 loss_iou = 1.69335 loss_seg = 0.13176 loss_class_cls = 0.13355 Val Time: 00:00:20 
$$TRAIN$$ epoch 8 ====>: loss_cls = 1.90833 loss_reg_xytl = 0.02100 loss_iou = 1.73520 loss_seg = 0.14235 loss_class_cls = 0.13649 Train Time: 00:17:46 
$$VALID$$ epoch 8 ====>: loss_cls = 1.53639 loss_reg_xytl = 0.01386 loss_iou = 1.68395 loss_seg = 0.13792 loss_class_cls = 0.13350 Val Time: 00:00:21 
$$TRAIN$$ epoch 9 ====>: loss_cls = 1.61048 loss_reg_xytl = 0.01840 loss_iou = 1.81451 loss_seg = 0.14155 loss_class_cls = 0.13642 Train Time: 00:19:23 
$$VALID$$ epoch 9 ====>: loss_cls = 1.39604 loss_reg_xytl = 0.01234 loss_iou = 1.69770 loss_seg = 0.14150 loss_class_cls = 0.13345 Val Time: 00:00:20 
$$TRAIN$$ epoch 10 ====>: loss_cls = 1.58478 loss_reg_xytl = 0.01784 loss_iou = 1.73858 loss_seg = 0.14001 loss_class_cls = 0.13636 Train Time: 00:21:11 
$$VALID$$ epoch 10 ====>: loss_cls = 1.49616 loss_reg_xytl = 0.01216 loss_iou = 1.60697 loss_seg = 0.13105 loss_class_cls = 0.13335 Val Time: 00:00:20 
$$TRAIN$$ epoch 11 ====>: loss_cls = 1.59138 loss_reg_xytl = 0.01954 loss_iou = 1.70157 loss_seg = 0.13825 loss_class_cls = 0.13628 Train Time: 00:23:13 
$$VALID$$ epoch 11 ====>: loss_cls = 1.37387 loss_reg_xytl = 0.01493 loss_iou = 1.72290 loss_seg = 0.14186 loss_class_cls = 0.13325 Val Time: 00:00:20 
$$TRAIN$$ epoch 12 ====>: loss_cls = 1.56931 loss_reg_xytl = 0.01929 loss_iou = 1.69895 loss_seg = 0.13726 loss_class_cls = 0.13621 Train Time: 00:24:55 
$$VALID$$ epoch 12 ====>: loss_cls = 1.47095 loss_reg_xytl = 0.01358 loss_iou = 1.64010 loss_seg = 0.12568 loss_class_cls = 0.13314 Val Time: 00:00:21 
$$TRAIN$$ epoch 13 ====>: loss_cls = 1.47089 loss_reg_xytl = 0.01883 loss_iou = 1.69151 loss_seg = 0.13617 loss_class_cls = 0.13627 Train Time: 00:26:49 
$$VALID$$ epoch 13 ====>: loss_cls = 1.37469 loss_reg_xytl = 0.01444 loss_iou = 1.57538 loss_seg = 0.13452 loss_class_cls = 0.13308 Val Time: 00:00:20 
$$TRAIN$$ epoch 14 ====>: loss_cls = 1.39732 loss_reg_xytl = 0.01801 loss_iou = 1.66951 loss_seg = 0.13488 loss_class_cls = 0.13614 Train Time: 00:28:04 
$$VALID$$ epoch 14 ====>: loss_cls = 1.22657 loss_reg_xytl = 0.01389 loss_iou = 1.66898 loss_seg = 0.14039 loss_class_cls = 0.13286 Val Time: 00:00:21 
$$TRAIN$$ epoch 15 ====>: loss_cls = 1.30442 loss_reg_xytl = 0.01737 loss_iou = 1.69497 loss_seg = 0.13358 loss_class_cls = 0.13607 Train Time: 00:29:14 
$$VALID$$ epoch 15 ====>: loss_cls = 1.25604 loss_reg_xytl = 0.01460 loss_iou = 1.65997 loss_seg = 0.12326 loss_class_cls = 0.13268 Val Time: 00:00:20 
$$TRAIN$$ epoch 16 ====>: loss_cls = 1.32521 loss_reg_xytl = 0.01644 loss_iou = 1.70964 loss_seg = 0.13379 loss_class_cls = 0.13590 Train Time: 00:30:58 
$$VALID$$ epoch 16 ====>: loss_cls = 1.28813 loss_reg_xytl = 0.01189 loss_iou = 1.62254 l
oss_seg = 0.13013 loss_class_cls = 0.13239 Val Time: 00:00:20

the training time is increasing per epoch, I also checked it with ChatGPT and did these modifications but at the end the results were the same, the modifications are:

changing the optimizer
changing the lr scheduler
freezing some first layers of the backbone
changing the weights of the losses
removing some of the losses (loss_class_cls and loss_seg)
changing the number of workers and batch_size
but the results were exactly the same, the training time keeped increasing (running on gpu on google collab), SO here I desperatly need some suggestions on how to solve this problem.

None of these changes would explain why the iteration time increases. I would expect to see a lower overall iteration time (e.g. if early layers were frozen), but no change in the increase in iteration time.

To your actual issue: are you also seeing an increase in memory usage on the GPU? If so, you might accidentally store append each iteration to a larger computation graph, which would not only increase the memory usage but also the iteration time. I don’t have your code, but if that’s the case I would expect to see .backward(retain_graph=True) calls.
If not, could you check the clock frequencies of your system (you can start with the GPU first) to see if you might be running into thermal issues?

1 Like

Thanks, That was the problem, Thanks for your precise answer and for the precious time you spared to help me.
I have been struggling for a week just to solve this.

1 Like

I checked the original model and witnessed that it uses the .backward(retain_graph=True) too and it is working fine for it without increasing the training time, but in my modified model, this training time is increasing and since I have several losses, and without the retain_graph=True I can’t calculate the backward, I tried to detach and clone the predictions and them pass them through the loss functions and it didn’t work, since the model wasn’t learning anything since it was not pasing the gradient, so I tried to combine the losses before the .backward and then do the .backward but it give me this error:

While all the loss values contain gradient, after two batches of data this happens:

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

I don’t understand why this happens for my model 2 while model 1 is working just fine, I run the both models on google colab with the same conditions, so could you please guide me how to handle this?

I have a idea about using:

    loss_cls = output['loss_cls']
    loss_reg_xytl = output['loss_reg_xytl']
    loss_iou = output['loss_iou']
    loss_class_cls = output['loss_class_cls']
    loss_seg = output['loss_seg']

    loss_cls.backward(retain_graph=True)
    loss_reg_xytl.backward(retain_graph=True)
    loss_iou.backward(retain_graph=True)
    loss_class_cls.backward(retain_graph=True)
    loss_seg.backward(retain_graph=False)

in this way I will free the graph at the end of the last loss and this is working but chatGPT is suggesting using the combined losses but that doesn’t work, don’t understand why

Using retain_graph=True as a workaround for the actual error:

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

is usually wrong unless you can explain why you want to retain the graph.

To debug the issue further you need to check where your computation graph is reused, extended, etc.
E.g. here is a simple code snippet showing the error caused by appending the outputs of the model to a new tensor and trying to backpropagate through both tensors:

model = models.resnet18()
x = torch.randn(1, 3, 224, 224)

outputs = torch.tensor([])

# 1st iter
out = model(x)
outputs = out

out.mean().backward() # this call computes the gradients and deletes the computation graph of the corresponding forward pass

# 2nd iter
out = model(x)
outputs = torch.stack((outputs, out))

# this will fail since the backward pass now tries to backpropagate through the computation graphs of the 2nd and 1st iterations!
outputs.mean().backward()
# RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad().
1 Like

I found the problem, I was storing embedding file on device way to much, so I fixed It thanks for the tip.

Would you mind please look at this question too and help me further?
CNN Model is not learning after some epochs