Object Detection Training Procedure


I am trying to train an object detection model for my custom data. While I’ve constructed the model and data loading, I am having some doubts over how to prepare targets and calculate the loss.

The details of the training are:

  1. The problem is a binary classification task with localization.
  2. I am using Yolov3 for the base model with a bit of customization for my project needs.
  3. For each scale in Yolov3, I am constructing an adjacent zero tensor that is identical to the output of the model then calculating the midpoint of any objects in the image, finding out which grid cell does the midpoint belongs to, and assigning that grid cell with the necessary information about object co-ordinates and height/width. This tensor will be used as the target tensor for the scale.
  4. Find which grid cells are responsible for objects then save indices for the non-zero positions.
  5. Filter target and feature tensors for both objectness score and coordination regression by using the indices from step 3.
  6. Reshape the features and target tensor to (-1, n), meaning no information about batch, grid cell, or anything, just plain old 1:1 comparison.
  7. Calculate IOU between the predicted box and target box.
  8. Object Loss is being calculated as (IOU * Object Confidence).
  9. Add to Total Loss.
  10. Repeat for another scale.
  11. Backprop Object Loss.

Some training shenanigans:

  1. For each scale, positional weights are being calculated for object == 0 or object == 1 then negative/positive as pos_weight for BCEWithLogitsLoss which means criterion is being created for each scale in each batch.
  2. Adam Optimizer with 1e-3 LR and ReduceOnPlateau with the patience of 2 and learning rate reduction of 0.1 per step on eval_loss. This was done because my loss stops decreasing at some point and starts increasing and I read on multiple Github issues that something like this could help but it hasn’t in my case. Also, train_loss is oscillating a lot while monitoring in Tensorboard.
  3. Batch Size of 16, can go up to 256.
  4. Clipping gradients to 10.

My questions are:

  1. Is it okay to only assign a single grid cell for the target? I am only considering whether the grid cell “supposed” to detect the object is successful or not. I have gone through the paper and multiple codes from various sources and never been able to grasp how they are building targets.
  2. What if the object is sufficiently large and the scale at which I am predicting is not able to fully detect it? Do I still calculate the loss for that scale?
  3. I am using “mean” as the reduction for the BCEWithLogitsLoss but reading the paper I get the impression that they are summing over the losses which is the reduction “sum”. Does this affect the training procedure much?
  4. Before using (IOU * Object Confidence), I was trying to train by using Object Confidence + (1 - IOU) or MSE_Loss(IOU, 1). When using this formulation, the output of the model for Object Confidence became NaN after a few steps to a few epochs based on the learning rate. What might’ve caused this?
  5. Should I use MSE_Loss on coordinates and height/width directly instead of IOUs? I am using the IOUs because using the bounding box attributes directly just gives larger numbers while using IOUs the same information can be supplied to the criterion while keeping the numbers in check.

I am sorry for the lengthy post.

1 Like