In the paper “Crafting Adversarial Input Sequences for Recurrent Neural Networks”, the authors evaluate the importance of a word in sequence by looking at the gradients of the embeddings w.r.t. to the outcome. It essentially boils down to calculate the Jacobian of the unfolded representation of the RNN.
How should I go about to do the same with a trained PyTorch RNN model for simple classification? I can see to check
embedding.weight.grad, but I struggle with the recursive nature. I assume/guess that I would have to check the gradient of the respective embedding of a word at each time step during
backward(). Or am I completely on the wrong? I can vaguely follow the formulas in the paper but I don’t know how to implement this approach.