Neural networks with 2 inputs of different shapes


I would like to discuss something here as I am going to use PyTorch to build the NN architecture. I have 2 sets of images: Grayscale (1 channel, size=(500,1200)) and Depth Images (40, 110) size. I have prepared a dataset with the following annotation format: x1, y1, x2, y2, class, depth, wherein x1, y1, x2, y2 are the coordinates of the bounding box. Now, you might be wondering, why is there depth in the annotations? Spot on!. So I wrote an algorithm that uses the depth image, point cloud to calculate the distance of the object from the sensor. This algorithm is applied in the pre-annotated bounding boxes. This distance value is then appended to the annotation of each object as explained above. I would like to build a neural network that would take the grayscale and depth image as inputs and then predict the bounding boxes, classes, and depth. I have the bounding boxes drawn inside the depth image representing the object that correlates to the boxes in the grayscale image with another algorithm.

My question is twofold and as follows:

  1. My idea is to have two different subnetworks, one that does the classification and bounding box regression and the other predicts depth. My doubt here is regarding the second subnetwork. This subnetwork would actually be a CNN regression Network which would regress the values lying inside each bounding box and then predict the depth value. This prediction will then be compared to the ground truth value and the RMSE loss function can be used to correct it. How do I build this Subnetwork though? Or is there a better alternative?

  2. My first idea is to use an Encoder-Decoder Network which takes in the 2 inputs and then outputs what’s needed. My question, however, is that the inputs are of different sizes. Do I need to upsample or downsample either of the images before feeding to the Encoder-Decoder network? Also, how will the network decide which features to use for classification and depth prediction? How can the network focus only inside the bounding boxes of the depth images to figure out value? or there is no need?

Would really appreciate your help.
Thank You.