In this paper they modify tacotron2 gst module to train it in semi supervised way with small amount of labeled data.
Vanilla gst from mozilla
So, they make attention one head and use attention weights as outputs for target class.
This is my code:
class MultiHeadAttention(nn.Module):
def __init__(self, query_dim, key_dim, num_units):
super().__init__()
self.num_units = num_units
self.key_dim = key_dim
self.W_query = LinearNorm(query_dim, num_units, bias=False)
self.W_key = LinearNorm(key_dim, num_units, bias=False)
self.W_value = LinearNorm(key_dim, num_units, bias=False)
def forward(self, query, key):
querys = self.W_query(query) # [N, T_q, num_units]
keys = self.W_key(key) # [N, T_k, num_units]
values = self.W_value(key)
# score = softmax(QK^T / (d_k ** 0.5))
scores = torch.matmul(querys, keys.transpose(1, 2))
scores = scores / (self.key_dim ** 0.5)
scores_before_softmax = scores
scores = F.softmax(scores, dim=-1)
out = torch.matmul(scores, values) # [h, N, T_q, num_units/h]
return out, scores_before_softmax.squeeze(1)
But i am not sure if i get everything right.
why this line
scores = scores / (self.key_dim ** 0.5
?
Another question, how should i handle attention weights, is it ok to return weights before softmax and apply cross entropy loss on it?
Do ignore index in cross entropy work as i think: set -100 to all unlabeled data?
Cross entropy loss struggles around 2 and do not decrease.