Need help with recreating paper

hadaev8 · December 11, 2019, 11:39am

In this paper they modify tacotron2 gst module to train it in semi supervised way with small amount of labeled data.
Vanilla gst from mozilla

github.com

mozilla/TTS/blob/97c7a8ff139455b243a8d479c90f346a069863fa/layers/gst_layers.py#L119


        # prosody_encoding: 3D tensor [batch_size, 1, encoding_size==128]
        tokens = torch.tanh(self.style_tokens) \
            .unsqueeze(0) \
            .expand(batch_size, -1, -1)
        # tokens: 3D tensor [batch_size, num tokens, token embedding size]
        style_embed = self.attention(prosody_encoding, tokens)


        return style_embed




class MultiHeadAttention(nn.Module):
    '''
    input:
        query --- [N, T_q, query_dim]
        key --- [N, T_k, key_dim]
    output:
        out --- [N, T_q, num_units]
    '''


    def __init__(self, query_dim, key_dim, num_units, num_heads):

So, they make attention one head and use attention weights as outputs for target class.

This is my code:

class MultiHeadAttention(nn.Module):
	def __init__(self, query_dim, key_dim, num_units):
		super().__init__()
		self.num_units = num_units
		self.key_dim = key_dim
		self.W_query = LinearNorm(query_dim, num_units, bias=False)
		self.W_key = LinearNorm(key_dim, num_units, bias=False)
		self.W_value = LinearNorm(key_dim, num_units, bias=False)

	def forward(self, query, key):
		querys = self.W_query(query)  # [N, T_q, num_units]
		keys = self.W_key(key)  # [N, T_k, num_units]
		values = self.W_value(key)

		# score = softmax(QK^T / (d_k ** 0.5))
		scores = torch.matmul(querys, keys.transpose(1, 2))
		scores = scores / (self.key_dim ** 0.5)
		scores_before_softmax = scores
		scores = F.softmax(scores, dim=-1)

		out = torch.matmul(scores, values)  # [h, N, T_q, num_units/h]

		return out, scores_before_softmax.squeeze(1)

But i am not sure if i get everything right.

why this line
scores = scores / (self.key_dim ** 0.5
?

Another question, how should i handle attention weights, is it ok to return weights before softmax and apply cross entropy loss on it?

Do ignore index in cross entropy work as i think: set -100 to all unlabeled data?

Cross entropy loss struggles around 2 and do not decrease.