For example, I have a tensor of size(batch_size, hidden_size, hidden_size). Is there a way to compute the trace of matrix(hidden_size, hidden_size) for every sample in this batch without a loop? So that the output tensor is a vector of size(batch_size).