I am trying a CNN + LSTM model, using CNN for feature extraction and then passing sequences of the features to LSTM for classification, CNN is trained separately and just used as a bottleneck layer.
I am mainly trying to find how the class activation heat map changes as frames are passed to an LSTM.
So if I have a 53x1024 matrix where 53 is the sequence length, how do I find which part of this matrix or which part of each of the vectors in the 53 length sequence activated the LSTM + MLP part (lstm has 1000 hidden dim and MLP part is nn.Linear(1000,15) where 15 is the number of classes)
After I find this I thought maybe Iâll do visual back prop through the CNN using these most activated parts as a binary map.
Thank you for your response Tom!
Just to be clear, psuedo-syntactically this would be passing my feature vector to the lstm classifier, finding the class score before the logsoftmax, backwarding this score to the network(which is a 53x15 dim matrix) using .backward(score).
Then using this, I would find the maximally activating points in the feature vector, perhaps use this as a binary encoded mask and backward each of those vectors (since a single vector represents an image) through my CNN for all images(vectors in the 53 length sequence)
I would probably use the torch.autograd.grad function.
Say you have the inputs to the LSTM as features (if you donât train anymore and want to have grads, you would need to re-wrap it as features_needing_grad = Variable(features.data, requires_grad=True) or so) and the score of the predicted class as score[i], you should be able to do
Sounds good, one question though - So during training for example for a 53 length sequence feature matrix my ground truth was 53 length long.
During testing when I pass a 53 length sequence, the output of the network wont be the same class throughout, it may be that the first 16 outputs for the sequence was of class âAâ and the next 37 were of class âBâ, for my final result I take the majority vote of these predictions.
So in the case of taking gradients, conceptually what would be done ? Should I take the derrivative of the output[class_A_idx] wrt the input, slice it so that I only have the first 16 feature vectors and concatenate with derrivative of the rest ?
The whole reason why I want a such a saliency map is that I want to see if the transient âheat mapâ of the first 16 frames would gradually converge to the âsteady state heat mapâ of a training video belonging to class B
Either use x = torch.tensor(x, requires_grad=True) (also works with other factory functions = things returning tensors) or use my_tensor.requires_grad = True (if it throws an error or doesnât work, go back to one). The key here is to impose the requires_grad on a leave node that does has not been used in calculation yet.
Best regards
Thomas
P.S.: But on the original topic of the âmost activated part of the feature matrixâ: I think that gradient does not work terribly well for that due to the tanh nonlinearities have low gradient when they activate strongly (the gradient is small - known as saturation in the training context, bites here, too). There are people proposing a different relevance propagation that has âskip the nonlinearities during backwardâ as a key ingredient http://aclweb.org/anthology/W/W17/W17-5221.pdf . In fact, I might have a demo of that in PyTorch to share in a while.