Yolo/SSD , how are features localized?

Hello, Iwant to fully understand how Yolo or SSD does work for object detection.

Here is what I understand:

  • in a simple CNN for image recognition, features are extracted, flattend and used for a classification
  • yolo devides the image into a grid. For each grid, some values like class probabilities and the bounding box parameters are calculated.
  • SSD not only uses one grid , but a combination of different sizes to better detect objects at any size.

What I don’t understand:

  • in yolo and ssd, there is a classification per grid cell? How do we know, which features are in that specific cell?
  • are features extracted per grid cell or per bounding box or non of both?
  • how does ssd combine results from different grids?

Thank you!