Get the position of on object out of an image

Hello! This is not really a PyTorch question, and I apologize for this, but I hope someone can help me anyway. I have some images with a fixed background and a single object on them which is placed, in each image, at a different position on that background. I want to find a way to extract, in an unsupervised way, the positions of that object. For example, us, as humans, would record the x and y location of the object. Of course the NN doesn’t have a notion of x and y, but i would like, given an image, the NN to produce 2 numbers, that preserve as much as possible from the actual relative position of objects on the background. For example, if 3 objects are equally spaced on a straight line (in 3 of the images), I would like the 2 numbers produced by the NN for each of the 3 images to preserve this ordering, even if they won’t form a straight line. They can form a weird curve, but as long as the order is correct that can be topologically transformed to the right, straight line. Can someone suggest me any paper/architecture that did something similar? Thank you!

If I understand correctly, your problem is similar to the usual bounding-box based detection but you rather want a point representing the center of the object rather than thd bbox.

I am not sure to follow the second part though, how would that be different from detection?

Thank you for your reply. I was thinking too to use something similar to bounding-boxes. However, don’t they require supervised learning i.e. don’t you need to actually label the right boxes for the training set? I want to find the position in an unsupervised way.

A bit more information is needed to properly answer this question. What is the background, and what are the objects you are trying to identify? Depending on the simplicity of the task, you may be able to use generic non-neural network based computer vision techniques. For instance, if the background is a solid or completely stationary you could try background subtraction and clustering. If the objects are roughly circular, you could try using a Hough transform. See openCV packages and documentation for this.

However, in general, if the specific identities of the objects are important (i.e. if they are unique) you’ll likely have to used supervised learning to train a model to recognize these objects. Again, there may be an easy cheat depending on the details of the task but it’s hard to say based on your description.

The background and object can be anything. For example, assume I have 1000 pictures with a (fixed) forest background and a (fixed) cat image at different positions on each of these 1000 images. I want a NN able to output 2 numbers that are related to the real x and y of the cat position. By this I mean applying a (possibly highly non-linear) transformation on the 2 numbers output for each image, I can get the actual x and y of the cat. Of course I need the same transformation for each image and also that transformation function might be highly complicated, but what I want is an output that somehow stores the actual position of the cat. For example if the NN outputs for each image instead of (x,y), (log(sqrt(x)),atan(exp(y)) it’s fine as the functions are invertible. However I can’t hardcode this by hand, as later I might have another dataset (say dog on a bedroom background). So a good NN shouldn’t care too much about the object and background (assuming they are reasonable). Of course you would need to retrain it for each case, but for the purpose of my project, I don’t need a validation set. I just want the NN to discover the x and y on it’s own for the training data and that’s it (so basically highly overfitting would be ideal in this case). Thank you!

@smu226 “I want to find a way to extract, in an unsupervised way, the positions of that object”

You cannot expect a neural network to output ANY specific target (e.g. x y positions or any function thereof) without having trained it to predict such outputs. Neural networks are a supervised learning technique. Thus, your options are:
1.) train the model with labeled examples
2.) Use a non-trained image processing technique such as convolving the target with the image or background subtraction, or perhaps clustering by pixel values if the color intensities are different.

I don’t think that’s true. For example, one thing I tried was to use a bottleneck, made of these 2 nodes which I want as the output and force the NN to reconstruct the original image using these 2 numbers. Ideally the “encoder” (going from the image to the 2 numbers) would output some function of x and y and the “decoder” (going from the 2 numbers back to the original image) would learn the inverse of that function of x and y and place the object there. I did that and it works ok-ish, in the sense that for most of the points the ordering is correct, but there are quite a few whose position is pretty messed up. So my point is, this can be done without using labeled examples or non-NN techniques (probably having 10 or 100 times more data would fix the problem even with my approach?). However I was hoping for something more clever that just and encoder-decoder. This is the most obvious thing to try first, but I was hoping (and given the rate at which AI evolves I am pretty sure) that someone has a more effective way of doing this and someone can point me towards that paper/code.

For the encoder-decoder example you’ve listed above, you are in fact training the model in a supervised approach. The key insight here is that you use the same image for the input and the label, and this is used to train both the encoder and decoder portion of the model; the loss is backpropogated through both modules. While the encoder may learn something similar to x y coordinates if the hidden state is of size two, there’s no guarantee of exactly what it is learning.

It doesn’t really seem that the information you’ve given is sufficient to provide you with the sort of solution you seem to be after. Is your dataset limited? If so, try doing some transforms on increase the size of your dataset and this may yield better results from your encoder-decoder approach.

By unsupervised I meant that the images have no label. I didn’t think that what I am doing counts as supervised (what is unsupervised learning then?). The data set is not limited, I am generating the images, so I can generate as many as I want (i.e. I can place the object wherever I want on the background). But I am looking for something more data efficient, hence why I was asking for some new insight, beside the plain old encoder-decoder architecture.

Nowadays bottleneck is very abused term, what do you mean by bottleneck. For instance, inside some Resnet models there is a bottleneck part.


You should keep trying if you are using convolution networks, because convolution is a pattern matching.

Do you have the image you are actually searching inside the bigger image?

Oh, sorry (I am quite new to the field in general)! By bottleneck I meant the middle layer between input and output which has only 2 nodes (and all the other layers have more than that). So basically all the useful information in the input image should be stored in this 2 nodes layer. The problem with CNN as far as I understand (I read an Uber paper too about this a while ago) is that they can identify the presence of an object in the image, but they are translationally invariant, so they can’t really locate the position of that object within the image (but I have no better tool so far, hence here I am). I do have the separately both the small image (object) and big image (background), as I am generating the overall combination of them (the position of the object on the background) for training, but I don’t want to use that information, as ideally I want to build a NN that can be used for cases where you don’t have access to that. Thank you!

If you are generating the images, I guess I don’t understand why you just label them automatically during generation? That’s the neatest way to get a function that maps you input space to your desired outputs.

That’s not the point… If I wanted a supervised learning approach that way, I wouldn’t be here asking for help. The point is that I want to apply this, later, to images for which I don’t have access to the actual x and y of the object, or to the background and the object separately. I use these generated images as a training set for different architectures, the fact that I know the real x and y in this case is irrelevant for the purpose of my project or for the purpose of my original question here.

You have an interesting use case and I assume you’ve already seen Unsupervised learning of foreground object detection?
Could this approach work, if you add some postprocessing, i.e. using the soft masks to get a center of mass for your object localization?

You can also look at Saliency filters and then get the coordinates with contour detection if you want a way without a nn this would use OpenCV.