Keypoint detection with r-cnn feature extraction backnone

I’m training a keypoint detection model using the builtin pytorch r-cnn class. It requires a backbone feature extraction network. I got decent results using efficientnet and convnext backbones but would like to try other architectures like one of the bulitin vision transformers. The model works when I access the efficientnet or convnext “.features” attribute. If I understand it correctly this attribute accesses the network without the top/classification layer. I manged to access this layer of the transfromer using pytorch’s feature_extraction method but I am not able to use it as a backbone (I get inccret dimension errors that change each time I run it). I realize this post is light in details. Please let me know what additional information would be helpful. Any thoughts greatly appreciated.

Check the shape of the output features for the feature extractor models in the working configs and then compare it to the feature shape of your failing use case.
Based on your description it seems that the current model outputs features which cause shape mismatches and which also change. If so, you might need to use adaptive pooling layers to make sure the features have a static shape.

Here’s the structure of the model I’m using with vit_b_16 as the backbone to the rcnn:

def get_model(num_keypoints):

#backbone = torchvision.models.convnext_large(weights='DEFAULT').features
#backbone.out_channels = 1536 # 1536 for convnext_large

model = torchvision.models.vit_b_16(weights='DEFAULT')
backbone = nn.Sequential(*list(model.children())[:-1])
backbone.out_channels = 768 

anchor_generator = AnchorGenerator(sizes=((64, 128, 256),), aspect_ratios=((0.83, 1.7, 1.2),)) # these need to be tweaked for the very large dataset

roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=['0'],
                                                output_size=7,
                                                sampling_ratio=2)
keypoint_roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=['0'],
                                                         output_size=14,
                                                         sampling_ratio=2)

model = KeypointRCNN(backbone,
                      num_classes=2,
                      rpn_anchor_generator=anchor_generator,
                      box_roi_pool=roi_pooler,
                      keypoint_roi_pool=keypoint_roi_pooler,
                      num_keypoints=num_keypoints)
       
    
return model

When I train the model I get this output. Input images are size (224,224,3) batch size 2.

AssertionError: Expected (batch_size, seq_length, hidden_dim) got torch.Size([2, 768, 46, 46])

Thanks for the code snippet.
Based on your code I see this error:

model.eval()
x = torch.randn(2, 3, 224, 224)
out = model(x)
# AssertionError: Expected (batch_size, seq_length, hidden_dim) got torch.Size([2, 768, 50, 50])

which I guess might come from rewrapping the VisionTransformer model into an nn.Sequential container.
This step would remove all functional API calls from the original forward method which is defined here.
The assert is raised in the Encoder class here which seems to be missing the reshape (and permute) from the _process_input call as well as the torch.cat call from here.

How can I access the feature extraction or last layer (which KeypointRCNN() requires as a backbone) backbone without wrapping the layers in nn.Sequential?

Maybe you could use a forward hook or torchvision’s create_feature_extrator method as described in this tutorial.

Thanks for the response. I tried to configure it with the feature extractor. Here is the traceback I get:
AssertionError Cell In[65], line 8
5 lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.3)
7 for epoch in range(num_epochs):
----> 8 train_one_epoch(model, optimizer, data_loader_train, device, epoch, print_freq=100)
9 lr_scheduler.step()
10 evaluate(model, data_loader_test, device)

File /mnt/f/patella/engine.py:31, in train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq, scaler)
29 targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
30 with torch.cuda.amp.autocast(enabled=scaler is not None):
—> 31 loss_dict = model(images, targets)
32 losses = sum(loss for loss in loss_dict.values())
34 # reduce losses over all GPUs for logging purposes

File ~/anaconda3/envs/insall_salvati/lib/python3.10/site-packages/torch/nn/modules/module.py:1488, in Module._call_impl(self, *args, **kwargs)
1483 # If we don’t have any hooks, we want to skip the rest of the logic in
1484 # this function, and just call forward.
1485 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1486 or _global_backward_pre_hooks or _global_backward_hooks
1487 or _global_forward_hooks or _global_forward_pre_hooks):
→ 1488 return forward_call(*args, **kwargs)
1489 # Do not call functions when jit is used
1490 full_backward_hooks, non_full_backward_hooks = [], []

1126 if type(condition) is not torch.Tensor and has_torch_function((condition,)):
1127 return handle_torch_function(_assert, (condition,), condition, message)
→ 1128 assert condition, message

AssertionError: Wrong image height! Expected 224 but got Proxy(getitem_2)!

I get the same AssertionError if create the feature extractor with all nodes (create_feature_extractor(network, eval_return_nodes=eval_nodes[:], train_return_nodes=train_nodes[:]) or with only the last layer (create_feature_extractor(network, return_nodes =[‘getitem_5’]).