One model with multiple outputs vs. multiple single output models

Given a requirement to classify a text (document) in each of 7 set of labels (3 binary sets, 3 multi-label sets of 4 to 20 labels each and one set of 1500 labels).

With the goal of reducing computing costs and data-scientist-efforts, I am considering building one model that will predict each of the 7 classes (multi-output model) versus building and fine-tuning 7 different deep-learning models. I adopted this approach in the past but I find it difficult to find good discussions of the pros and cons of each approach.

I would appreciate anyone advice of the criteria they would use to decide to choose between one multi-output vs. several single-output models. I understand that the specifics of each deep-learning situation matter in the decision process but what I am after is overarching questions to ask / decision criteria.

Here one of my take. Since transfer learning with transformers is generally performing well - even when freezing the weights of an underlying pre-trained transformer - one can assume that by adding 7 independent linear layers on top of shared transformer layers, the model accuracy should not decrease significantly given a large supervised training dataset.

Thank you in advance for your inputs.