Bounty: $500 to debug, fix, and *explain* why my implementation of FasterRCNN is underperforming


I’ve implemented FasterRCNN from scratch but it doesn’t quite work. It definitely learns but there are issues. I’m struggling to debug it and tried comparing against a simple PyTorch implementation that does work. I don’t see a meaningful difference between the two and even after grafting over code from it, I still have an unexplained discrepancy in performance (as judged by convergence over time, mAP score, and visual inspection of predicted bounding boxes).

This was intended to be an exercise to learn how to implement a model from scratch and how to debug such performance issues but after a few months of little progress, I’m stumped.

So, I’d like to offer $500 (the fee is negotiable) to be paid in two installments for expert help. I figure a week of time should be sufficient for someone to dig into both code bases and we can do a video call initially for a quick orientation of the source code and a demonstration of the problem. The final deliverable would be a (up to) 1.5-2 hour video session detailing exactly how the problem was identified and solved (rewriting it completely from scratch to conform more closely to existing FasterRCNN implementations does not count).

Prefer US-based individuals using Venmo. May consider BTC/ETH. PayPal is right out (closed my account recently).

If interested, my email can be found at my personal site: Inquire there or here.

Thanks, :slight_smile:


Hi Bart,

so the difference in performance is with just the models swapped but the training procedure being the same?

Best regards


Hey Thomas,

Yeah, as far as I can tell, that’s what is going on. There’s a lot I can write but here is a “short” braindump:

The training procedure is pretty simple: SGD, learning rate = 1e-3 for 10 epochs or so and then 1e-4 for 4 epochs, momentum 0.9, weight decay of 5e-4. The only form of augmentation is random horizontal flipping. It’s not exactly clear what the original paper did but they did use these parameters and they seem to indicate that they trained their region proposal network using a similar number of epochs. The implementation by Yun Chen (GitHub - chenyuntc/simple-faster-rcnn-pytorch: A simplified implemention of Faster R-CNN that replicate performance from origin paper) uses that schedule.

He does a few things I don’t do and that I think the original Caffe model does: learning rate for biases is 2x as large, for example. I haven’t observed a meaningful impact when this behavior is removed.

Ultimately, I think my region proposal network performs the same as his. He follows the paper’s bizarre choice of using two class outputs per anchor point and performing a softmax, and then only using the foreground score. I replaced it with a single output and a sigmoid activation (and went back to Chen’s code and did the same to verify no impact). I think the problem manifests itself somewhere after that, when proposals are selected, labeled, and then passed into the final detection stage. There is some difference in how we compute class loss (I use the formula for categorical cross entropy, he uses nn.CrossEntropyLoss, which is a bit different) but I don’t think that is the problem. What I observe is that within only 1 or 2 epochs, his mAP (on a 60-image subset of data I’ve been using to rapidly iterate with) jumps from almost 0 to a whopping 30%! It takes me several epochs to start improving mAP. Weirdly, the magnitudes of our loss components are almost identical throughout.

We are both using VGG-16 as the feature extractor. The basic flow of FasterRCNN as I understand it is, during training:

  1. Generate anchor boxes, regularly spaced every 16 pixels in x and y. There are 9 boxes per anchor location, differing in aspect ratio and area. Then, label anchors as positive, negative, or neutral depending on their intersection area with the ground truth object boxes in the image. What the network will ultimately compute is the delta between anchors and the final boxes.

  2. Pass image through pre-trained VGG-16’s convolutional layers to produce a feature map that is image_dimensions // 16 in size. It so happens that each cell in this map corresponds to an anchor position (and there are 9 anchors at each position).

  3. Region proposal network: a simple network that takes the feature map and spits out to predictions. 1) Object score per anchor (0 = background, 1 = object), 2) Box regressions (for each positive anchors, 4 numbers that can be used to scale and translate the anchor to a full-size box). Two loss values are computed here, which will be used during backprop: RPN class loss and RPN box regression loss.

  4. The proposals are sorted by score from highest to lowest and the best N (N=6000) are taken for further processing. Note that you do NOT take only the positive scoring proposals (e.g., score > 0.5), but the top N whatever they may be. The proposals are represented as a score plus the four corners of the box they are predicting, clipped to image boundaries. Non-max suppression is performed and then the top-M are taken (M=2000 during training, 300 during evaluation/inference).

  5. The proposal boxes are where an object might be. The data set has 20 object classes. The next step is to label each proposal box and generate a map of target box parameters (just like for the RPN, except that you can think of the proposals as being our new anchors at this stage). Proposal boxes are matched with ground truth boxes based on their overlap (highest overlap and overlap > threshold gets assigned the label). Proposal boxes that do not have sufficient overlap are labeled as class=0 (background). Backprop is not carried out through these proposal box coordinate calculations. Summary: two ground truth items are generated – proposal class labels and, for non-background proposals, regression targets.

  6. Proposals are sampled. We take 128 proposals (although I think technically the paper used 64) per image and we want 25% of them to be positive ones. They are randomly sampled. In practice, you almost never have as many as 25% positive samples and negative tend to dominate a little.

  7. Now the feature map from step [2] is passed into an RoIPool layer along with the proposals. Proposals are (N,4), a series of box coordinates, and the feature map is (512,H,W). The RoIPool layer produces (N,512,7,7) little max pooled maps. I actually wrote my own implementation of this in Keras and eventually want to revisit and get that that model working. In PyTorch, this layer type exists.

  8. The pooled features are passed through the detector stage. The detector is very similar to the RPN: it treats the proposals as anchors and then tries to predict regressions that will transform them into the ground truth bounding boxes. Two more loss values are computed: detector class loss and detector regression loss.

That’s pretty much it. For evaluation/testing, there is of course no label assignment anywhere. The top 300 post-NMS region proposals are passed forward and then at the very end, when the detector network spits out a predicted label and boxes, NMS is performed again per-class.

Thank you,


Hey, you almost have an entire blog post on what you do there. :slight_smile:
Fun fact: @chenyuntc used to be very active here in the early days.

But so the reason I’m asking is that from the sound of it, half the effort could be setting up the experiment, but if the reference is Yun Chen’s code and your code uses roughly the same dataset/loader/training-loop, that probably simplifies things a lot.

Have you tried swapping bits of the model out (like backbone first, then Anchors, then RPN then head or so)?

Best regards


Thanks for engaging on this, Thomas. It’s been a tough one to ask questions about because ultimately there isn’t one isolated thing I can point to and ask about. Finding where to even look is the challenge for me :slight_smile:

My goal, if I can figure this out, is to write both a Keras and PyTorch version and throw them up on GitHub with a or blog post that details the issues I ran into and how I got it working. Of course, comparing against Yun Chen’s code feels like “cheating” (imagine if I were trying to replicate a detailed paper with no code to reference) but at this point, I’ll do whatever I can :wink:

We don’t share a common data loader. Mine is actually an absurd monstrosity and one of the things I’ve narrowed in on is that the anchor labeling code is wrong in some non-obvious way. It generates anchors that are similar but not identical to Chen’s. We have a lot of the same positive anchors but weirdly, due to some sort of slight numerical difference, I sometimes end up with a few more or less, or different aspect ratios. Originally I was too conservative with my anchor assignment. I intend to completely rewrite this at some point but to unblock myself, I’ve adapted Chen’s code to proceed (and that is what made me confident that our RPN functionality is the same).

The key difference in my code is that I don’t generate 2D tensors of anchor boxes, regression targets, etc. I actually prepare a large 5D tensor of shape (H,W,k,8), where k is the number of anchors and that last dimension is stuffed with ground truth data: anchor valid flag, object/not object, highest overlap GT object box class index (unused), and the 4 regression targets.

This complicates my loss functions a bit but I’ve written test programs that convert between my format and what his loss functions take and have convinced myself we are computing the same loss in the end, at least for the RPN and probably for the regression targets of the detector. The detector class score might be different.

I’ve encountered all kinds of very subtle bugs in my code over the last several weeks but none of them have made a large difference. Now, there doesn’t seem to be much left to examine, although I do suspect that the bounding box regression targets of the final detector stage might not be converging rapidly and I’m not sure why. They aren’t wrong enough to point to a problem in proposal labeling.

Initial weights for the layers taken from VGG-16 (not just the conv layers at the start but also the two fully connected layers in the final detector stage) may have an impact but I load the same weights that are loaded into Chen’s model (the Caffe weights, which assume images are preprocessed using the original VGG-16 procedure of ImageNet mean subtraction).

One thing the paper neglects to mention but which every implementation seems to do is scale the detector regression targets. I’m doing this too. During training, I also print out the statistics (mean and, more importantly, std dev) of each of the final regression targets: ty, tx, th, tw. I think that the prediction stats should converge to the ground truth target stats over time. They seem to do so in my model but slowly. RPN is much quicker. This is actually the one statistic I have not yet observed in Chen’s model and will do so tomorrow. It may confirm that is where the problem is or it could be a red herring.

But if so, if Chen’s regression target statistics converge faster, I’m at a loss to explain what is causing it. At this point, I am very confident that I am feeding in roughly the same number of positive/negative samples (at one point I discovered I was feeding in too few due to a bug in the anchor assignment code). I’ve made some visualizations of the proposal targets and how they evolve over time and both our models look the same at this point.



P.S. Re: those differences in anchor assignment I mentioned, I’m actually quite puzzled that they are having such a strong effect. Yes, this is a very important and fundamental part of the model but the paper suggests the precise sizing of the anchors is not so important and that the scales and ratios they used were selected arbitrarily. I spent a lot of time staring at images of anchors generated by my system and I could never pinpoint any major difference. The sensitivity of FasterRCNN to this is honestly a bit of a surprise.

At any rate, I’ve sidestepped the issue for now and can always return and fix that part once and for all. :slight_smile: