When implementing the attention block as used in transformers, I saw the output of the following layer has “grad_fn=<UnsafeViewBackward>”:
import math import torch import torch.nn as nn import torch.nn.functional as F class ScaledDotProduct(nn.Module): def __init__(self,dropout=0.1): super(ScaledDotProduct, self).__init__() self.drop_out = nn.Dropout(p=dropout) def forward(self,Q,K,V,d_k,d_v): d_model = Q.size(2) W_Q = nn.Linear(d_model,d_k) W_K = nn.Linear(d_model,d_k) W_V = nn.Linear(d_model,d_v) dot1 = W_Q(Q) dot2 = W_K(K) dot3 = W_V(V) p_attention = F.softmax(dot1.matmul(dot2.transpose(1,2))/math.sqrt(d_k),dim=2) attention1 = self.drop_out(p_attention).matmul(dot3) #attention2 = self.drop_out(p_attention.matmul(dot3)) return attention1 Q = torch.randn(1, 3, 5) K = torch.randn(1, 3, 5) V = torch.randn(1, 3, 5) d_k = 2 d_v = 2 attn = ScaledDotProduct() print (attn.forward(Q,K,V,d_k,d_v))
Is UnsafeViewBackward bad? It seems to come from the line
attention1 = self.drop_out(p_attention).matmul(dot3)
in the forward function where the dropout layer is multiplied with the Value matrix. I also have a second closely related question regarding where the dropout comes in in the scaled dot product attention. In the paper “Attention is All You Need”, the authors say in the Residue Dropout section that “We apply dropout to the output of each sub-layer, before it is added to the sub-layer input and normalized.” My interpretation is that the dropout is applied after the product of softmax(QK)V. Yet all the implementation I have seen so far, including pytorch’s implementation on line 3367 of the code for multi_head_attention_forward, apply the dropout on softmax(QK), and then multiply with V. I wonder why is the dropout is applied here instead of on softmax(QK)V? When I followed the Pytorch implementation, I got grad_fn=<UnsafeViewBackward>, whereas when I followed my interpretation, had “grad_fn=<MulBackward0>”, which seems more normal.