Hello, Iwant to fully understand how Yolo or SSD does work for object detection.
Here is what I understand:
- in a simple CNN for image recognition, features are extracted, flattend and used for a classification
- yolo devides the image into a grid. For each grid, some values like class probabilities and the bounding box parameters are calculated.
- SSD not only uses one grid , but a combination of different sizes to better detect objects at any size.
What I don’t understand:
- in yolo and ssd, there is a classification per grid cell? How do we know, which features are in that specific cell?
- are features extracted per grid cell or per bounding box or non of both?
- how does ssd combine results from different grids?
Thank you!