Hierarchical Multi Label Classification with

Hi everyone! This is my first post! I’m excited to be here!

I’m currently exploring multi-label text classification and I was hoping to get some advice.

Specifically, I’m interested in using over 700 abstracts to classify more than 1100 labels. However, the predicted labels have a hierarchical structure, with some labels being subcategories of others. For instance, “Libraries” is a parent label, while “Public libraries” and “Reference libraries” are child labels under the “Libraries” category.

I’m facing a few challenges and I could really use some input.

Problem 1:
There isn’t enough training data available for each label.

Problem 2:
There are simply too many labels to handle, especially considering the hierarchy. On top of that, the hierarchy itself may evolve over time, which is another issue I need to address.

Problem 3:
Lastly, I’m not sure how to input the hierarchy into a pre-trained language model for sequence classification.

Is there another softmax that I can use , like a hierarchical softmax?
How do I input a structure such as a directed acyclic graph of text labels as predicted labels?

My end goal is to assist cataloguers in the library domain with indexing books.
Any advice would be greatly appreciated!

I’m not sure if I may not be solving your problem, but I found myself in a similar situation to yours a few years ago so I’d like to just vomit out here what I did in the hope that it can be of some help. I was studying a location dataset for my country, and the location data provider had developed a quite impressive taxonomy system for classifying all the locations (venues). Similar to your case (although this was an exhaustive classification scheme, not a labelling scheme; one location one class), this taxonomy was hierarchical. For example: “Chinese restaurants” were a subclass of “Asian restaurants” which were a subclass of “restaurants” which was a child of “Dining and Drinking” category. I wasn’t doing ML at the time, but I was still trying to statistically profile these class distributions, and the logical problems were manyfold. I recall that my biggest headache was in figuring out how to deal with the fact that many location entities were not fully classified all the way down to the leaf nodes of the taxonomy. To make matters worse, there was no standard “depth” of the class tree at which these location entities were being classified, so some locations had class labels that were fully resolved all the way down to the leaf nodes, while some were classified all the way up below the root (the coarsest class). Most of the data mass fell somewhere in the interior of the tree, and the tree was not regular either, so some subsets of classes had many more branchings into finer grained classes than others.

For this situation I resorted to taking a step back from what I was doing, and tackle head on the details of the class taxonomy by formally modeling it. By this I mean that I went to the source documentation of the class taxonomy and formally modeled the class tree and then built a Python object (using the anytree package in case you’re interested) so that I could perform statistics on the tree itself and have a way to computationally interrogate the hierarchy in relation to my data. It became sort of a mini-project within the larger one. My conclusion was that you can’t model probability distributions for categories at different levels of the tree simultaneously if there exists an ancestor-descendant relation between any of the categories at the different levels. The reason being that it would violate the laws of probability. It’s easy to see why. In your case, the events “abstract x has label LIBRARY” and “abstract x has label PUBLIC LIBRARY” are not independent, since an abstract having a Public Library label (or any direct child label of a parent label) necessarily implies the abstract should also have the Library label. For this type of tree, a probability distribution cannot be formally defined on the entire node set (all classes) since the classes are can be related; you can only define a distribution for class subsets that are not related by lines of descent. Note that the nodes don’t all have to necessarily be on the same level of the tree.

Although you’re not dealing with classification, the same idea applies. The difference is that instead of having just one tree for a single exhaustive classification, you have a set of trees (a forest). Each tree in the taxonomy would connect the labels that are related to each other, defining label classes. Each tree would have labels that are independent of each other (tree) and so can be treated separately.

I would suggest (which is also what I ended up doing), to study all your labels (or if there’s documentation, just read up on that) and model them into trees. For each tree/labelClass, examine how your abstract data is distributed by finding the deepest level of the tree that can be defined for the data you intend to model. The aim is to, for each label class/tree, determine the most resolved level of that particular labeling scheme that is able to be defined for your data. For example, suppose there are abstracts with “Library” and no other specification that may refine the label further, and then other abstracts with “Public libraries” label. Here the “lowest” you can go consistently is down to Library, as this label cannot be broken down into any of its child labels without additional information. However, “Public Libraries” (and all the other child labels) can be coalesced into the “Library” label. Ideally, you’d have all the data labeled at the leaves (full resolution), but that’s probably not the case. You would have to coalesce to the common ancestor label for each label class separately; then whatever labels remain will be mutually exclusive and exhaustive (when you account for the 0 label). Coincidently, this will not just fix the “hierarchy” probability distribution problem, it will also reduce the dimensionality of your effective label set, and aggregate the data so as to improve the amount of samples you have per label. 3 for the price of 1! That said, dealing with studying the labels and their relationships might take a while and will require some tinkering, but in my opinion, it might be that or have the data remain useless.

Hope there’s something there that can help!