After normalizing image x and y coordinate values from [0 512] to [-1 1] and training a regression CNN to predict these coordinates, the output from the network is out of the bounds of [-1 1]

rzp0063 · October 10, 2020, 11:07pm

Warm regards to all!
I am currently trying to predict x and y locations (landmarks) from a 512x512 image for a circle detection problem. I have a segmented image which shows circles as black and background as white, for each image I have the x and y coordinate of the centroid for each circle with a total of 435 points to predict (so 870 total values x and y). I had previously worked in a facial landmark detection where I was only predicting 138 values instead of 870. However, I am not successful at the moment at predicting the correct values for the centroids and the main question I have is regarding normalization. I have followed the following procedure for normalizing to [-1 1]:

Y_C /= 512
Y_C = 2*Y_C - 1

I am using SmoothL1Loss, and after training it turns out the predicted values are outside of the range [-1 1]. Is there a reason for this I am missing? Also if you have some advise for predicting a large number of landmarks for this type of problem I would really appreciate the help as I am currently not successful.

ptrblck · October 12, 2020, 7:25am

If you don’t clip the output or use any other condition, the outputs of e.g. a linear layer would be unbounded so that the values might be outside of [-1, 1].
That being said, I don’t think that e.g. using a sigmoid for a regression task would perform better than using no activation function so you might need to play around with some hyperparameters to improve the predictions.

rzp0063 · October 12, 2020, 3:21pm

@ptrblck Thank you so much for your reply! it is very helpful since it answers the main question I had in terms of normalization and its bounds.
I’ll try to describe my project better to see if you have any additional professional advise about which methods to use for this problem from your experience.

I have 30,000+ images (256x256) of circular particles in space. I have already collected their ground truth using conventional computer vision segmentation which also outputs radius, perimeter, and x-y locations of each particle; it is very slow, therefore machine learning could be an attractive approach. My initial task which I have already done successfully was to implement a simpler version of U-Net to generate the same segmented (ground truth) figure. The results were great, nearly a perfect segmentation as well as improving the speed of conventional computer vision methods by a substantial amount.
Now I am trying to do the second portion which is to predict the radius, and x-y locations as well. I have tried using convolutional neural nets, even using transfer learning from pre-trained models but the results are quite undesirable.

My first thought as to what is causing this is that since some images have more particles than others, in order to make the output vector of x-y locations and radius a fixed size (101 each x, y and radius so 303 total), some outputs will contain values of zeros. In other words I made the output vector size fixed at 101 for x,y,radius but some images might only have 51 particles which means 50 (50*3 = 150 out of 303) will be 0. My second thought regarding the source of error is the sparsity on the x-y locations; being scattered all over the image.
Is there an approach you would advise on using that could be better for this specific problem? I am thinking of trying the YOLO network today by creating bounding boxes, however, can the YOLO network work with only 1 class (particle or no particle, where some even overlap). Please, if you think there is a different route I should take I would really appreciate your advise

ptrblck · October 13, 2020, 6:11am

That’s an interesting use case.
If I understand your current approach correctly, you are using the first output indices for the valid points and set the rest to the zero target.
Assuming that’s the case, then this would explain the “drift” of your points towards the zero location.
Some of the “later” outputs will get zero targets more often than the earlier ones and the model could try to output the “mean” location, which would be somewhere between a valid location and the zero point.

I’m not sure what the best approach would be, but I guess you could try to reuse some approaches of object detection models, e.g. sorting the outputs based on their mse loss to the target value (in case your model just predicts the points in a different order) or work with proposals and keep the best ones.

rzp0063 · October 13, 2020, 4:39pm

@ptrblck Thank you so much for your response and guidance to solve this issue with several tips. This makes a lot more sense now why the the points are shifting towards zero and confirms the source of error with this approach.
I will keep trying to solve it using ConvNets along with your suggestion of sorting the outputs thats a very clever idea; it would be very interesting to show that this problem could be solve with simple ConvNets approach.

I wanted to give you the good news since you have been so helpful and responsive to my question, I tried YOLO network right away over the past couple of days since it allows to have various number of objects per image. The results using the YOLO network are nearly pefect!

358_gt_png.rf.315d63fa9cef671a6deb4220ac7805c0
232_gt_png.rf.a0611e2874e7d31350573f5b152b46be

Again thank you so much for your help! I will keep trying with ConvNets though as it would be a more attractive solution requiring less computational power.

ptrblck · October 13, 2020, 11:52pm

That’s pretty cool. You could also try to use a “fast” YOLO model. If I remember it correctly, newer YOLO architectures are quite fast and might also work pretty well.