Yolo/SSD , how are features localized?

MC95 · December 3, 2019, 1:54pm

Hello, Iwant to fully understand how Yolo or SSD does work for object detection.

Here is what I understand:

in a simple CNN for image recognition, features are extracted, flattend and used for a classification
yolo devides the image into a grid. For each grid, some values like class probabilities and the bounding box parameters are calculated.
SSD not only uses one grid , but a combination of different sizes to better detect objects at any size.

What I don’t understand:

in yolo and ssd, there is a classification per grid cell? How do we know, which features are in that specific cell?
are features extracted per grid cell or per bounding box or non of both?
how does ssd combine results from different grids?

Thank you!