Need help with recreating paper


In this paper they modify tacotron2 gst module to train it in semi supervised way with small amount of labeled data.
Vanilla gst from mozilla

So, they make attention one head and use attention weights as outputs for target class.

This is my code:

class MultiHeadAttention(nn.Module):
	def __init__(self, query_dim, key_dim, num_units):
		super().__init__()
		self.num_units = num_units
		self.key_dim = key_dim
		self.W_query = LinearNorm(query_dim, num_units, bias=False)
		self.W_key = LinearNorm(key_dim, num_units, bias=False)
		self.W_value = LinearNorm(key_dim, num_units, bias=False)

	def forward(self, query, key):
		querys = self.W_query(query)  # [N, T_q, num_units]
		keys = self.W_key(key)  # [N, T_k, num_units]
		values = self.W_value(key)

		# score = softmax(QK^T / (d_k ** 0.5))
		scores = torch.matmul(querys, keys.transpose(1, 2))
		scores = scores / (self.key_dim ** 0.5)
		scores_before_softmax = scores
		scores = F.softmax(scores, dim=-1)

		out = torch.matmul(scores, values)  # [h, N, T_q, num_units/h]

		return out, scores_before_softmax.squeeze(1)

But i am not sure if i get everything right.

why this line
scores = scores / (self.key_dim ** 0.5
?

Another question, how should i handle attention weights, is it ok to return weights before softmax and apply cross entropy loss on it?

Do ignore index in cross entropy work as i think: set -100 to all unlabeled data?

Cross entropy loss struggles around 2 and do not decrease.