Identification of objects in 2D Boxes

i don’t really know if it is the right category or forum to post this question, but here is the question: i have a set of images with 2D boxes around objects in those images, in some images there are objects behind buildings that have their 2B box drawn, my question is to know how i could recognize if the objects in boxes are either cars or humans (because those are the two objects with boxes in my images). the coordinates of the 2D boxes of all objects in an image are know, so i can with OpenCV all the boxes easily draw.
online i have found many article, but all these try to detect objects in an image and construct their boxes, but i already have those boxes, i just want to check if the boxes effectively contain a car or a human.

One of the tool i have found is Yolo, but like i said , it start with predicting the bounding box .

thank you for your answer, i know i have wrote too much

One potential approach would be to use the bounding boxes to create image crops and pass these (after e.g. resizing to a fixed size) to a classification model.
Would this approach make sense?

thank you for your answer.

i had tried it, with the torchvision model alexnet, but it got 1000 classes, so the classification is not efficient. could i build and train my own model, and save the weights in order to use the pre-trained model ??

Sure! This approach is called fine tuning or transfer learning and you could follow this tutorial.

i have follow the tutorial and test it, thank you for the link.

i will generate my own image with the Pytorch Dataset structure and save the weight.

please could you tell me, was is the best way handle negative image ?

What do you mean by negative images?

i mean , images that belong to none of the classes, if my classes are {ants,bees}, a car or a human are negatives images.

if the i take the example i am working on, i want to classify vehicles (cars, motorbike and bike) and pedestrians, like i have explained in the first post, sometime the objets are behind buidings or tree, so we see these buildings or trees on the image, instead of seeing the objets, and my goal is to delete all those images.

If you want to only have two class, positive vs. negative, you can:

  1. First label the data, if you don’t already have. Then add one or two layers at the end of the network you pre-trained, freeze the pre-trained network, and only the last few layers on the new data.

  2. The simplest is just to write a function, to pick out the pos and neg labels. For instance, the {ants, bees} could be one-hot encoded as {0001, 0100}, and write a “filter function” such as

if <pred_label> in [0001, 0100, ...]:
    <pred_label> = True // positive
    <pred_label> = False // negative

i am interested in the second option. does this means that i have to download negative images and stored them ? i am using the Dataset Structure with data/train/bees ; data/train/ants, just to say i don’t use the one-hot encoding.