Change model output shape

Is there a way that I can have my model output a specific shape? Say if my input (ground truth) has a shape of [15, 3], however, the output (x_kp) size from the model is [15, 1].

class Model(nn.Module):
    def __init__(self, num_classes, batch_size):
        super(Model, self).__init__()
        self.conv1 = nn.Conv3d(in_channels=256, out_channels=512, kernel_size=(3, 3, 3), stride=(1, 1, 1))
        self.conv2 = nn.Conv3d(in_channels=512, out_channels=512, kernel_size=(3, 3, 3), stride=(1, 1, 1))
        self.conv3 = nn.Conv3d(in_channels=512, out_channels=512, kernel_size=(3, 3, 3), stride=(1, 1, 1))
        self.conv4 = nn.Conv3d(in_channels=512, out_channels=512, kernel_size=(3, 3, 3), stride=(1, 1, 1))
        self.conv5 = nn.Conv3d(in_channels=512, out_channels=512, kernel_size=(3, 3, 3), stride=(1, 1, 1))
        self.conv6 = nn.Conv3d(in_channels=512, out_channels=512, kernel_size=(3, 3, 3), stride=(1, 1, 1))
        self.conv7 = nn.Conv3d(in_channels=512, out_channels=512, kernel_size=(3, 3, 3), stride=(1, 1, 1))
        self.conv8 = nn.Conv3d(in_channels=512, out_channels=512, kernel_size=(3, 3, 3), stride=(1, 1, 1))
        self.cls_score = nn.Linear(471040, num_classes)
        self.kp_score = nn.Linear(471040, 17*batch_size)
        
    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = F.relu(self.conv4(x))
        x = F.relu(self.conv5(x))
        x = F.relu(self.conv6(x))
        x = F.relu(self.conv7(x))
#         x = F.relu(self.conv8(x))
        x = x.flatten(start_dim=1)
        x_label = F.relu(self.cls_score(x))
        x_kp = F.relu(self.kp_score(x))
        return x_label, x_kp
    
model = Model(num_classes=2, batch_size=batch_size).to(device)

My criterion and loss is calculated using the following:

pred_label, pred_kp = model(cat_features)

kp_criterion = nn.SmoothL1Loss()
label_criterion = nn.CrossEntropyLoss()

kp_loss = kp_criterion(torch.mean(pred_kp), torch.mean(cat_kp).to(device))
label_loss = label_criterion(pred_label, crossing_labels)

The output shape of [15, 1] is a bit weird, since it should be [batch_size, 17*batch_size] based on your model definition.

You can define the output shape via the out_features of the linear layer.

That being said, it’s also unusual to define a specific shape relative to the batch size, as the model definition is independent of the batch size in a standard scenario.

I’m probably doing something wrong here. But this is my workflow:

  1. Extract features features = pose_estimator.backbone(img_list.tensor.float()). The shape of the extracted features is torch.Size([1, 256, 272, 480]) via print(features[0].shape). Each image is added appended to a list called features_list.
  2. Then I concatenate the images using:
    cat_features = torch.cat([i[0].cpu() for i in features_list], 0). The shape is torch.Size([3, 256, 34, 60]), where the 3 represents that 3 image features were concatenated (i.e. batch size).
  3. Then I apply cat_features = cat_features.transpose(1, 0).unsqueeze(0).to(device) to get in the format I need: torch.Size([1, 256, 3, 272, 480]), where now the batch is 1 is one as I want the concatenated features to be seen as a single batch by the model, and 3 represents the depth (i.e. 3 images).

I hope that so far it’s all making sense.

What I would like to do is use this image to predict future actions of pedestrians. So, for that, I also obtain keypoints using the keypoint_rcnn model. Each keypoint has a shape of torch.Size([1, 17, 3]) and when concatenating the keypoints of the 3 images as I did with the images in step 2, cat_kp = torch.cat(kypnt_instances[i]['pred_keypoints'], 0).to(device), I get a shape of torch.Size([51, 3]).

When I run my model, pred_label, pred_kp = model(cat_features), the shape of pred_kp is torch.Size([51, 1]), so when I try to calculate the loss kp_loss = kp_criterion(pred_kp, cat_kp), I get an error saying the red_kp and cat_kp are not the same size.

I think the issue is, that I want the model to see it as a batch size of 1, but output the keypoints as a batch of 3, 1 for each of the images. Which doesn’t make sense now that I think about it. Is there a way that I can combine/reshape cat_kp so that it is torch.Size([51, 1]) instead of torch.Size([51, 3]) without losing too much feature quality?

Sorry for the long post and I hope that it all makes some sense.

Based on your description, the error is raised since nn.SmoothL1Loss expects the output and target to have the same shape.
If you want to use a batch size of 1 and use 3 different images with 17 points each, you could make sure the output and target have the shape [batch_size, 3, 17] or you could also flatten the points to [batch_size, 51]. The calculated loss would be the same.

I don’t know, as I’m not familiar with your use case.
However, I would recommend to fix the shape issues and keep the model “idea” unchanged for now.

Thank you for the quick reply.

If you want to use a batch size of 1 and use 3 different images with 17 points each, you could make sure the output and target have the shape [batch_size, 3, 17].

Sorry, but probably a silly question, but how would I make sure that the output is [batch_size, 3, 17]? What I mean to say is, how would I change self.kp_score = nn.Linear(66078720, 17, 3) to get the output that I require? I hope that that is what you meant with your suggestion.

you could also flatten the points to `[batch_size, 51]

So, I followed your advice to flatten the keypoints since I couldn’t figure out how to do the initial suggestion you could make sure the output and target have the shape [batch_size, 3, 17] (probably as misunderstanding on my part for how to do it).

I did it by setting self.kp_score = nn.Linear(66078720, 17*3*3), where 17 are the keypoints, 3 is number of images and the other 3 is to match the predicted keypoints shape [3, 51]. I flattened cat_kp = cat_kp.flatten(start_dim=0) and pred_kp = pred_kp.flatten(start_dim=0).

However, there are 2 “issues” that I am having now:

  1. The kp_loss for is very high (over 200). Not sure if that’s an issue due to problem number 2, as I can’t begin to train the model.
  2. I get an issue saying the GPU is out of memory. I have an RTX 2080 Ti 11GB. I thought this would be able to handle this. Is it because I am using conv3D, because it works fine to training for detection with a batch size of 8 images.
  1. Could you flatten the tensors starting in dim1? This might not change the high loss value (depends on the used criterion), but would most likely fit the expected input shape of [batch_size, *]. I would recommend to check some outputs and targets manually and make sure that the values are in the expected range and that no unwanted permutation was appiled somewhere in the code.

  2. Your model might create large intermediate tensors, which would create this OOM issue. The number of input features is also relatively large in the linear layer, which would also point towards a high memory usage.

Thank you again for your suggestions.

This doesn’t seem to make a difference in terms to the high loss value. However, I have tried changing the optimizer from SGD to Adam, and the loss values begin to drop (although still really high).

When I view the outputs and targets, they look like they’re within the expected range (I don’t see any large variations in the predicted keypoints when comparing to the ground truth values).

You were right, I adjusted my model by adding some more convolutional layers etc, and now the model is working.