How do I compute the feature attributions for negative labeled examples (i.e., target = 0) when the model output is a scalar? Currently, integrated gradients gives me the feature attributions for all examples wrt target = 1.
I guess your problem is a binary classification and you are using sigmoid for your output layer.
Currently, it seems that specifying a specific output if you use sigmoid in the output layer is not supported yet. If you want integrated gradient (Captum) give you the feature attributions wrt a specific target, you should use softmax instead of sigmoid.
Thanks for your reply! I opted for sigmoid instead of softmax for binary classification to avoid doubling the number of parameters in the final layer. I wonder if there is a simple workaround, e.g., 1 - sigmoid(x) so target 1 becomes target 0, and vice versa.
In fact, in the case of binary classification, a positive(+) attribution will contribute to the positive class and vice versa for the negative class. Hope this helps.