How to speedup pairwise multiplication ?

We are now implement a Factorization Machine which can pay attention on each interaction of fields.
So we have to compute the pairwise multiply of each pair of the embedding.
The code below is how we multiply the result of the previous computation:

    # shape of embedded_features : (batch_size, field_num, embedding_size)
    second_order_result = Variable(torch.zeros(batch_size, self.two_level_feature_fields, self.embedding_size)).cuda()
    attention_list = [embedded_features[:, i ,:] for i in range(self.feature_fields)]

    count = 0
    for i in range(self.feature_fields):
        for j in range(i + 1, self.feature_fields):
            second_order_result[:, count, :] = attention_list[i] * attention_list[j]
            count += 1

But the code is very very slow when backpropagation.
Is there any solution to increase the speed of pairwise multiplication?
Thanks for the help!