The problem I see with this approach is that the 7 classes are not independent. Admittedly, I cannot give a solid, “scientific-y” explanation why this causes issues.
Intuitively, I would go with a “one-vs-rest” approach – have 3 binary classifiers, one for each label A, B, and C:
Should the text be labelled with A? Yes/No
Should the text be labelled with B? Yes/No
Should the text be labelled with C? Yes/No
You only need to adjust your training data for each of the 3 classifiers accordingly.