Hi,
I’m trying to integrate a ResNet50 backbone into my hand-rolled Faster R-CNN implementation. I’m using the torchvision resnet50 model. As I train, by the end of the first epoch, training speed declines from an initial 18 it/sec to ~8 it/sec and by the second epoch hits (and remains at) 3 it/sec. I have no idea what is causing this nor how to investigate. Using VGG-16 for the backbone layers doesn’t have this issue.
I’m plucking out the relevant layers as follows (the “Backbone” object is my own and basically consists of the feature extractor layers and the necessary classifier layer for the detector stage that follows the RoI pool operation):
class PoolToFeatureVector(nn.Module):
def __init__(self, resnet):
super().__init__()
self._layer4 = resnet.layer4
def forward(self, rois):
y = self._layer4(rois) # (N, 1024, 7, 7) -> (N, 2048, 4, 4)
#y = y.mean(-1).mean(-1) # use mean to remove last two dimensions -> (N, 2048)
y = F.adaptive_max_pool2d(y, output_size = 1).squeeze()
return y
class FeatureExtractor(nn.Module):
def __init__(self, resnet):
super().__init__()
# Feature extractor layers
self._feature_extractor = nn.Sequential(
resnet.conv1,
resnet.bn1,
resnet.relu,
resnet.maxpool,
resnet.layer1,
resnet.layer2,
resnet.layer3
)
# Freeze initial layers
self._freeze(self._feature_extractor[0])
self._freeze(self._feature_extractor[1])
self._freeze(self._feature_extractor[4])
def forward(self, image_data):
y = self._feature_extractor(image_data)
return y
@staticmethod
def _freeze(layer):
for name, parameter in layer.named_parameters():
parameter.requires_grad = False
class ResNetBackbone(Backbone):
def __init__(self):
super().__init__()
# Backbone properties
self.feature_map_channels = 1024 # feature extractor output channels
self.feature_pixels = 16 # ResNet feature maps are 1/16th of the original image size, similar to VGG-16 feature extractor
self.feature_vector_size = 2048 # linear feature vector size after pooling
# Construct model and pre-load with ImageNet weights
resnet = torchvision.models.resnet50(weights = torchvision.models.ResNet50_Weights.IMAGENET1K_V1)
print("Loaded IMAGENET1K_V1 pre-trained weights for Torchvision ResNet50 feature extractor")
# Feature extractor: given image data of shape (batch_size, channels, height, width),
# produces a feature map of shape (batch_size, 1024, ceil(height/16), ceil(width/16))
self.feature_extractor = FeatureExtractor(resnet = resnet)
# Conversion of pooled features to head input
self.pool_to_feature_vector = PoolToFeatureVector(resnet = resnet)
def compute_feature_map_shape(self, image_shape):
"""
Computes feature map shape given input image shape. Unlike VGG-16, ResNet
convolutional layers use padding and the resultant dimensions are therefore
not simply an integral division by 16. The calculation here works well
enough but it is not guaranteed that the simple conversion of feature map
coordinates to input image pixel coordinates in anchors.py is absolutely
correct.
Parameters
----------
image_shape : Tuple[int, int, int]
Shape of the input image, (channels, height, width). Only the last two
dimensions are relevant, allowing image_shape to be either the shape
of a single image or the entire batch.
Returns
-------
Tuple[int, int, int]
Shape of the feature map produced by the feature extractor,
(feature_map_channels, feature_map_height, feature_map_width).
"""
image_width = image_shape[-1]
image_height = image_shape[-2]
return (self.feature_map_channels, ceil(image_height / self.feature_pixels), ceil(image_width / self.feature_pixels))
In the end, the ResNetBackbone.feature_extractor and .pool_to_feature_vector members are used in my Faster R-CNN implementation. Any ideas on how I can debug out what’s going on here?
Separately, and probably worthy of its own post, ResNet-50 converges more slowly than VGG-16 and never achieves anywhere near the same mAP score (71% with VGG-16, ~60% with ResNet using the same training schedule), which is the opposite of what I would expect.
Thanks!
– Bart