ResNet training speed degrades quickly over time


I’m trying to integrate a ResNet50 backbone into my hand-rolled Faster R-CNN implementation. I’m using the torchvision resnet50 model. As I train, by the end of the first epoch, training speed declines from an initial 18 it/sec to ~8 it/sec and by the second epoch hits (and remains at) 3 it/sec. I have no idea what is causing this nor how to investigate. Using VGG-16 for the backbone layers doesn’t have this issue.

I’m plucking out the relevant layers as follows (the “Backbone” object is my own and basically consists of the feature extractor layers and the necessary classifier layer for the detector stage that follows the RoI pool operation):

class PoolToFeatureVector(nn.Module):
  def __init__(self, resnet):
    self._layer4 = resnet.layer4

  def forward(self, rois):
    y = self._layer4(rois)  # (N, 1024, 7, 7) -> (N, 2048, 4, 4)
    #y = y.mean(-1).mean(-1) # use mean to remove last two dimensions -> (N, 2048)
    y = F.adaptive_max_pool2d(y, output_size = 1).squeeze()
    return y

class FeatureExtractor(nn.Module):
  def __init__(self, resnet):

    # Feature extractor layers
    self._feature_extractor = nn.Sequential(

    # Freeze initial layers

  def forward(self, image_data):
    y = self._feature_extractor(image_data)
    return y

  def _freeze(layer):
    for name, parameter in layer.named_parameters():
      parameter.requires_grad = False

class ResNetBackbone(Backbone):
  def __init__(self):

    # Backbone properties
    self.feature_map_channels = 1024  # feature extractor output channels
    self.feature_pixels = 16          # ResNet feature maps are 1/16th of the original image size, similar to VGG-16 feature extractor
    self.feature_vector_size = 2048   # linear feature vector size after pooling

    # Construct model and pre-load with ImageNet weights
    resnet = torchvision.models.resnet50(weights = torchvision.models.ResNet50_Weights.IMAGENET1K_V1)
    print("Loaded IMAGENET1K_V1 pre-trained weights for Torchvision ResNet50 feature extractor")

    # Feature extractor: given image data of shape (batch_size, channels, height, width),
    # produces a feature map of shape (batch_size, 1024, ceil(height/16), ceil(width/16))
    self.feature_extractor = FeatureExtractor(resnet = resnet)

    # Conversion of pooled features to head input
    self.pool_to_feature_vector = PoolToFeatureVector(resnet = resnet)

  def compute_feature_map_shape(self, image_shape):
    Computes feature map shape given input image shape. Unlike VGG-16, ResNet
    convolutional layers use padding and the resultant dimensions are therefore
    not simply an integral division by 16. The calculation here works well
    enough but it is not guaranteed that the simple conversion of feature map
    coordinates to input image pixel coordinates in is absolutely

    image_shape : Tuple[int, int, int]
      Shape of the input image, (channels, height, width). Only the last two
      dimensions are relevant, allowing image_shape to be either the shape
      of a single image or the entire batch.

    Tuple[int, int, int]
      Shape of the feature map produced by the feature extractor,
      (feature_map_channels, feature_map_height, feature_map_width).
    image_width = image_shape[-1]
    image_height = image_shape[-2]
    return (self.feature_map_channels, ceil(image_height / self.feature_pixels), ceil(image_width / self.feature_pixels))

In the end, the ResNetBackbone.feature_extractor and .pool_to_feature_vector members are used in my Faster R-CNN implementation. Any ideas on how I can debug out what’s going on here?

Separately, and probably worthy of its own post, ResNet-50 converges more slowly than VGG-16 and never achieves anywhere near the same mAP score (71% with VGG-16, ~60% with ResNet using the same training schedule), which is the opposite of what I would expect.


– Bart

I would start with checking the system’s health status and see if any clocks are decreased due to a potentially overheating system.

All good on that front. CUDA utilization begins to decline after about 40% of the epoch has been processed. This doesn’t happen with the VGG-16 backbone (I’ve implemented both my own from scratch as well as an option to just pick layers from Torchvision’s implementation).

The only difference between the ResNet and VGG-16 backbones is that some different layers are used. I’m being careful not to create any tensors that may hang around outside the training loop. I do believe what I’m seeing is somehow linked to the poor training performance of the new backbone as well but I’m not yet sure where the problem is.

One peculiarity is that no matter how many of the backbone layers I freeze, there is no effect. I’ve confirmed by manually inspecting weights before and after training that I am freezing layers correctly.