Hello, I have trained German BERT model and ran it against my test set. I know that my data is imbalanced, but I will leave it be for now.
The point of my experiment was to see what can I get out of my model with data at hand. However, I am still quite new in this topic and would very appreciate if someone could answer some of my questions. The table below show some of my classification report.
Question 1
If I understand correctly, precision means, how many of recognized classes are actually correct and recall says how many samples of this class
has been found. So when we see class fonds you could say, model find quite a lot of actual fonds samples, yet at the same time many other samples are falsely classified as fonds. Because precision is not that high. So others classes are suffering because
Fonds ‘steal’ samples from other classes, is it correct?
Question 2
I looked around but could not find the answer, is there a threshold when one would say that model is good enough? let’s say everything above 70% is pleasing, or is it necessary to check in inference if model is behaving correctly?
Question 3
In one case, my test data in auslandskrankenversicherung class has 0 samples. Thus, precision as well as recall are zero, so these have way more impact on macro avg.
When I got it right, macro focus on the avg performance of each class and micro takes total samples in the consideration. Thus, micro would be the optimistic assumption on overall performance and macro the pessimistic one.
precision recall f1-score support
none_cat 0.94 0.85 0.89 722
fonds 0.65 0.87 0.75 39
kontaktdaten 0.60 0.60 0.60 5
lebensversicherung 0.96 0.94 0.95 236
kundenberater 0.92 0.98 0.95 146
altersvorsorge 0.94 0.90 0.92 129
anschrift 0.73 1.00 0.84 8
allgemein 0.52 0.87 0.65 118
... ... ... ... ...
auslandskrankenversicherung 0.00 0.00 0.00 0
none_int 0.88 0.88 0.88 901
widerruf 1.00 0.60 0.75 5
steuer-id 1.00 1.00 1.00 2
dynamik 1.00 1.00 1.00 10
vertragsauskunfte 0.71 0.73 0.72 245
kundigung 0.92 0.96 0.94 102
micro avg 0.86 0.87 0.86 3416
macro avg 0.79 0.79 0.78 3416
weighted avg 0.87 0.87 0.87 3416
samples avg 0.86 0.87 0.87 3416
By looking at the micro and macro avg that model is not that bad?