Keypoint RCNN training time increases per epoch


I am training a keypoint rcnn model and noticed that on every epoch, the training time increases.

For the first epoch, it takes about 32 seconds, then the next takes about 45, then it goes upto 60 and increases about a second every two epochs.

Just wanted to know if this a normal behaviour for this model.

Using the latest pytorch nightly on cuda 12.1. I did test on the stable release as well with cuda 11.7 and 11.8 but noticed the same issues.

Running on a 40 series laptop GPU with nvidia 535 drivers

To narrow down the issue properly you should profile the workload and check what exactly slows down during the later epochs. In case you are also seeing and increase in device memory usage in each iteration, check if you are storing model outputs or the loss in a list without detaching it as it would also store the entire computation graph.
If that’s not the case, my guess is that your laptop might be overheating and could throttle some clocks which might slow down the training.

This is the result of the profiler. I believe the torch.where is taking most of the cpu time which is present throughtout the Keypoint RCNN torch model.

But why would it get slower per epoch/optimization step?

Compare profiles of a fast vs. slow epoch and check where the difference is. I would also recommend using a visual profiler, such as Nsight Systems, to see the real timeline instead of a summary of the kernels.


So I tried using NSight and found that there are long paues for cudaStreamSynchronize while loading the training images. The image size is ‘1536,1536’, I push them to GPU and it gets resized internally to 512,512 in Keypoint RCNN.

I tried a few things:

  1. Resizing the image to 256,256 before passing it to keypoint RCNN. Note that the all the target data is wrong since everything is calcluated with respect to the 1536x1536 image.

In this scenario, I get a constant 14-15 seconds training time per epoch.

  1. Using non-blocking = True with 256x256, this remains the same as in 1.

  2. Using non_blocking=True with 1536x1536 image sizes, this looks like it reduces the training time a bit but there is still an increase in training time per epoch, it goes from 26 seconds on the first epoch, to 32 seconds, then on the 6th epoch its on 46 seconds.

Using non-blocking the data loads really quickly, however the cudaStreamSynchronize call moves into the model() call of keypoint rcnn

It seems your data loading might be the bottleneck. Did you check where the sync is coming from in the nsys profile?
Using non_blocking=True would allow the CPU to run ahead and not to block on the data transfer. However, if the next operation (e.g. the forward method) requires the data on the GPU you would then still need to wait until the transfer is done before the kernel can be launched.

Sorry, I am not sure if I followed it right. Is this the way to check it?

Yes, as the stacktrace shows the sync is triggered by the copy kernel, so indeed your model seems to wait for the data transfer.

Does that mean I am bound by my hardware?

Also, why would it get slower over time?


I did some more testing and found that the code slows down only when I call the backward function.

Loading the data and calling forward seem to be fast enough. When I introduce the backward call, the loader seems to get slower over time

Did you compare the profiles of both runs and checked if the sync caused by the data loading is still the bottleneck? I wouldn’t know how the backward call could slow down your data loading, so a profile might again help here.

    def train_one_epoch(self, train=True):
        loader = self.train_loader if train else self.val_loader
        with torch.set_grad_enabled(train):
            for images, targets in loader:
                torch.cuda.nvtx.range_push("Gradient Step")
                torch.cuda.nvtx.range_push("Loading Data")
                torch.cuda.nvtx.range_push("Loading Images")
                images = [ for image in images]
                torch.cuda.nvtx.range_push("Loading Targets")
                targets = [
                    {k: for k, v in t.items()} for t in targets
                torch.cuda.nvtx.range_push("Forward Pass")
                loss_dict = self.model(images, targets)
                losses = sum(loss for loss in loss_dict.values())
                if train:
                    torch.cuda.nvtx.range_push("Backward Pass")

The above screenshot is obatined by omitting losses.backward() and optimizer.step(). There is very little time spent on synchronize during the data loading process.

This is when I have the backward and step running.

Sorry, but the screenshots are not giving me enough information. Could you post a minimal and executable code snippet which would show the behavior so that I could profile it?

import torch
import torchvision
from import DataLoader
from import Dataset
from torchvision.models import ResNet50_Weights
from torchvision.models.detection.rpn import AnchorGenerator

class FakeDataset(Dataset):
    def __init__(self, num_keypoints, num_classes):
        num_images = 800
        num_instances_per_image = 3
        width, height = 1536, 1536
        self.images = torch.rand(num_images, 1, width, height)
        self.labels = torch.randint(
        bboxes_min = torch.rand(num_images, num_instances_per_image, 2) * width // 2
        bboxes_max = bboxes_min + torch.randint(
            1, width // 2, size=(num_images, num_instances_per_image, 2)
        self.bboxes =[bboxes_min, bboxes_max], dim=2)
        self.keypoints = (
            torch.rand(num_images, num_instances_per_image, num_keypoints, 2)
            * width
            // 2
        self.keypoints =
            [self.keypoints, torch.ones_like(self.keypoints)], dim=3

    def __getitem__(self, index):
        data_dict = {}
        data_dict["labels"] = self.labels[index]
        data_dict["boxes"] = self.bboxes[index]
        data_dict["keypoints"] = self.keypoints[index]
        return self.images[index], data_dict

    def __len__(self):
        return len(self.images)

def timer(func):
    def wrapper(*args, **kwargs):
        import time

        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(f"Time taken to run {func.__name__}: {end - start} seconds")
        return result

    return wrapper

class Trainer:
    def __init__(self, model, train_loader, val_loader, config):
        self.model =["device"])
        self.train_loader = train_loader
        self.val_loader = val_loader
        self.optimizer = config["optimizer"]
        self.device = config["device"]
        self.num_epochs = config["num_epochs"]
        self.eval_iter = 10

    def train_one_epoch(self, train=True):

        loader = self.train_loader if train else self.val_loader
        total_loss = 0
        with torch.set_grad_enabled(train):
            for images, targets in loader:
                images = [ for image in images]
                targets = [
                    {k: for k, v in t.items()} for t in targets
                loss_dict = self.model(images, targets)
                losses = sum(loss for loss in loss_dict.values())
                total_loss += losses.item()
                if train:
        return total_loss

    def train(self):
        for epoch in range(1, self.num_epochs + 1):
            train_loss = self.train_one_epoch(train=True)
            print("Epoch: {}, Train Loss: {}".format(epoch, train_loss))
            if epoch > 1 and epoch % self.eval_iter == 0:
                val_loss = self.train_one_epoch(train=False)

if __name__ == "__main__":
    anchor_generator = AnchorGenerator(
        sizes=(32, 64, 128, 256, 512),
        aspect_ratios=(0.25, 0.5, 0.75, 1.0, 2.0, 3.0, 4.0),
    model = torchvision.models.detection.keypointrcnn_resnet50_fpn(
        # Background is the first class, object is the second class
        fixed_size=(512, 512),

    dataset = FakeDataset(num_keypoints=10, num_classes=10)
    data_loader = DataLoader(
        collate_fn=lambda x: tuple(zip(*x)),

    params = [p for p in model.parameters() if p.requires_grad]
    optimizer = torch.optim.SGD(params, lr=0.001, momentum=0.9, weight_decay=0.0005)

    config = {
        "optimizer": optimizer,
        "device": torch.device("cuda" if torch.cuda.is_available() else "cpu"),
        "num_epochs": 30,

    train_loader = data_loader
    val_loader = data_loader

    trainer = Trainer(model, train_loader, val_loader, config)

Added a random dataset too. After every epoch, the training time increases 4 seconds. Should be able to run the script as is


Is there any update on this?

@ptrblck Sorry, but were you able to reproduce it?