Curriculum learning with CNNs

I’m currently working on a project and using on the methodology described in this paper:

Essentially, the target labels are hierarchical (binary --> 7 categories --> 25 categories), and as I understand it, the authors first train a model on the binary, and then fine tune that model on the next level of labels (and then the third).

I’m struggling a bit on understanding the correct implementation. I’ve fine-tuned a resnet model on the binary labels and got a fairly good accuracy, and I took that same model, changed the final classifier layer, and began training on the second level of labels – but the accuracy is terrible.

Has anyone implemented a model in a similar fashion or know if this approach is correct?