I have to implement a Convolutional Neural Network, that takes a kinect image (1640480) and return a 1 x8 tensor predicting the class to which the object belongs and a 1 x 4 tensor, predicting the bounding box around the image, if its present.
Please help me how can I implement a suitable model to give two outputs and how to calculate loss and backpropagate in that case?
Also, i have just around 6000 training images, how can I achieve the best possible results with limited number of training images??
Some examples of such projects would be highly helpful.