Adding mask to LSTM with attention causes gradient computation exception

def forward(self, input, hidden, encoder_outputs, mask):


    Run LSTM through 1 time step


    - input: <1 x batch_size x N_LETTER>

    - hidden: (<num_layer x batch_size x hidden_size>, <num_layer x batch_size x hidden_size>)

    - lstm_out: <1 x batch_size x N_LETTER>


    # Incorporate attention to LSTM input

    hidden_cat =[0], hidden[1]), dim=2)

    # attn_weights is 1 x batch_sz x MAX_NAME_LEN

    attn_weights = F.softmax(self.attn(, hidden_cat), 2)), dim=2)

    # Set all pad characters to negative infinity

    attn_weights[mask] = float('-inf')

    # Softmax to re-adjust weights so pad chars have no weight, in torch dimension correlated to name, dim=2

    attn_weights = torch.softmax(attn_weights, dim=2)

    attn_applied = torch.bmm(attn_weights.transpose(0,1),encoder_outputs.transpose(0,1)).transpose(0,1)

    attn_output =, attn_applied), 2)

    attn_output = F.relu(self.attn_combine(attn_output))

    # Run LSTM

    lstm_out, hidden = self.lstm(attn_output, hidden)

    lstm_out = self.fc1(lstm_out)

    lstm_out = self.softmax(lstm_out)

    return lstm_out, hidden

This is my forward function as you can see I’m passing in a mask that sets all characters that are pad characters in the inputs to negative infinity then I’m applying a softmax over the inputs. I read in a blog this is how you’re supposed to apply masking to attention, but when I try to compute loss I get this exception.

“Exception has occurred: RuntimeError
one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1, 2048, 40]], which is output 0 of SoftmaxBackward, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).”

You cannot use indexed assignment here, as the backward of the previous operation wants to use the calculation result you are overwriting.
The typical solution is to use torch.where(mask, attn_weights, neginf) where neginf is torch.tensor(float('-inf'), device=device) (I would recommend precomputing that).
I’m not sure your double-softmax makes much sense to me but that’s probably my limited imagination.

Best regards


1 Like

Don’t I need to do the softmax again after applying the mask in order to get the new distribution of weights? Cause won’t doing batch matrix multiplication with negative infinity values really mess with it? Unless maybe the relu layer adjusts for that?

Well, maybe masking things to -inf before the first softmax is a better strategy. But I wouldn’t know. In my experience, two softmaxes in a row usually are not a good idea.