Integrated gradients for negative labeled examples

chsher · May 21, 2021, 4:50pm

How do I compute the feature attributions for negative labeled examples (i.e., target = 0) when the model output is a scalar? Currently, integrated gradients gives me the feature attributions for all examples wrt target = 1.

hoangle_tttm · June 22, 2021, 10:14am

I guess your problem is a binary classification and you are using sigmoid for your output layer.
Currently, it seems that specifying a specific output if you use sigmoid in the output layer is not supported yet. If you want integrated gradient (Captum) give you the feature attributions wrt a specific target, you should use softmax instead of sigmoid.

chsher · June 22, 2021, 3:34pm

Thanks for your reply! I opted for sigmoid instead of softmax for binary classification to avoid doubling the number of parameters in the final layer. I wonder if there is a simple workaround, e.g., 1 - sigmoid(x) so target 1 becomes target 0, and vice versa.

hoangle_tttm · June 23, 2021, 3:15am

In fact, in the case of binary classification, a positive(+) attribution will contribute to the positive class and vice versa for the negative class. Hope this helps.