Integrated Gradients and Text Generation

Hi all,

I am currently trying to explain the generated Text of a GPT-2 model, using Integrated Gradients and captum. I can create a minimal example if helpful.

My question is: Is it even at all possible to explain text generation? My current approach is to explain each token individually, by viewing the logits of the output layer as a classification problem and varying the context the model receives. As the target, I assigned the token index that is generated without pertubation. However, the attributions to the input are all zero. Do I make a conceptual mistake here? Is there a resource somewhere for explaining text generation? I couldn’t seem to find anything…

I’m having a really similar problem. Did you manage to find anything on the topic?

The way I solved it was basically what I had written above (I am assuming an autoregressive text generation model like GPT here):

  • run the model to generate one token, use the index of that token as your target for the token-classification output
  • calculate the attribution for that token
  • add the generated token to your input and repeat the two steps above to calculate the attributions for the next token

That will give you an attribution for every token that was generated and you bascially just loop autoregressively until your text generation is finished. It is then up to you what you do with those attributions, but you could for example calculate the attributions only for the prompt and in the end average the attributions that you calculated for each token to get an attribution for the whole generated text. Note that in that case, you have to calculate attributions only for your initial prompt then, you can feed in the generated tokens using additional forward args if you are using captum.

Hope this helps and is understandable, if not I can prepare a python snippet, unfortunately my code became a bit more complicated to use it as an example…

A bit late to the party, but I would be highly interested in how you solved this code-wise!