How to find the most activated part of the feature matrix that activated LSTM

Hi Pytorch Community,

I am trying a CNN + LSTM model, using CNN for feature extraction and then passing sequences of the features to LSTM for classification, CNN is trained separately and just used as a bottleneck layer.
I am mainly trying to find how the class activation heat map changes as frames are passed to an LSTM.

So if I have a 53x1024 matrix where 53 is the sequence length, how do I find which part of this matrix or which part of each of the vectors in the 53 length sequence activated the LSTM + MLP part (lstm has 1000 hidden dim and MLP part is nn.Linear(1000,15) where 15 is the number of classes)
After I find this I thought maybe I’ll do visual back prop through the CNN using these most activated parts as a binary map.

Hi Gabriel,

a starting point can be the gradient of the score (or the cross entropy with the predicted label, but that might run into saturation issues) by LSTM input (see Image-Specific Class Saliency Visualisation in Simonyan et al.: Deep inside conv networks. You can compute the gradient by using torch.autograd.grad.

Best regards


Thank you for your response Tom!
Just to be clear, psuedo-syntactically this would be passing my feature vector to the lstm classifier, finding the class score before the logsoftmax, backwarding this score to the network(which is a 53x15 dim matrix) using .backward(score).

Then using this, I would find the maximally activating points in the feature vector, perhaps use this as a binary encoded mask and backward each of those vectors (since a single vector represents an image) through my CNN for all images(vectors in the 53 length sequence)

Is that right ?

I would probably use the torch.autograd.grad function.
Say you have the inputs to the LSTM as features (if you don’t train anymore and want to have grads, you would need to re-wrap it as features_needing_grad = Variable(, requires_grad=True) or so) and the score of the predicted class as score[i], you should be able to do

grads = torch.autograd.grad(score[i], features_needing_grad)

and use these gradients.

Best regards


Sounds good, one question though - So during training for example for a 53 length sequence feature matrix my ground truth was 53 length long.

During testing when I pass a 53 length sequence, the output of the network wont be the same class throughout, it may be that the first 16 outputs for the sequence was of class ‘A’ and the next 37 were of class ‘B’, for my final result I take the majority vote of these predictions.

So in the case of taking gradients, conceptually what would be done ? Should I take the derrivative of the output[class_A_idx] wrt the input, slice it so that I only have the first 16 feature vectors and concatenate with derrivative of the rest ?

The whole reason why I want a such a saliency map is that I want to see if the transient ‘heat map’ of the first 16 frames would gradually converge to the ‘steady state heat map’ of a training video belonging to class B

this returns a runtime error : One of the differentiated Variables appears to not have been used in the graph.

I did

img_tensor = transforms.Compose(...)(numpy_img.transpose(1,2,0))
out = densenet(Variable(img_tensor.unsqueeze(0).cuda(),requires_grad=True))
out = out.view(-1)
grads = grad(out[1],Variable(,requires_grad=True))

where out is a 7 dim variable which is the output of the network, and img_tensor_input is a (1,3,224,224) tensor

Only Variables are tracked, so you want to wrap it once.

img_tensor = transforms.Compose(...)(numpy_img.transpose(1,2,0))
img_tensor = Variable(img_tensor.cuda(),requires_grad=True)
out = densenet(img_tensor)
out = out.view(-1)
grads = grad(out[1],img_tensor)

or somesuch.

Best regards


1 Like

How might one fix this in 0.4 when the difference between a tensor and variable is no longer explicit?

Either use x = torch.tensor(x, requires_grad=True) (also works with other factory functions = things returning tensors) or use my_tensor.requires_grad = True (if it throws an error or doesn’t work, go back to one). The key here is to impose the requires_grad on a leave node that does has not been used in calculation yet.

Best regards


P.S.: But on the original topic of the “most activated part of the feature matrix”: I think that gradient does not work terribly well for that due to the tanh nonlinearities have low gradient when they activate strongly (the gradient is small - known as saturation in the training context, bites here, too). There are people proposing a different relevance propagation that has “skip the nonlinearities during backward” as a key ingredient . In fact, I might have a demo of that in PyTorch to share in a while.