I implemented a neural model for text segmentation that feeds sequences of word embeddings into a bidirectional LSTM, extracts forward and backward outputs, feeds them into a fully connected layer separately and then computes the dot product between the outputs for each timestep, in line with with the ideas presented in this paper. The code is functional and I can overfit on training data, however GPU utilization is low (somewhere in the region of 40 % according to gpustat) and training’s generally quite slow. Could this be due to the use of for-loops for the computation of the time-step wise dot products? If so, what would be a more efficient way to compute them? Any help would be greatly appreciated.
def forward(self, input): # apply dropout for regularization input = self.dropout(input) bi_output, bi_hidden = self.lstm(input) # iterate over batch elements n_elem = bi_output.size() outs =  for i in range(n_elem): forward_output, backward_output = bi_output[i, :, :self.hidden_size], bi_output[i, :, self.hidden_size:] # apply FC layer to both forward_hidden = self.fc(forward_output) backward_hidden = self.fc(backward_output) products =  # compute dot product of the FC outputs, comparing at every time step n_steps = forward_hidden.size() for step in range(n_steps): dot_product = torch.dot(forward_hidden[step], backward_hidden[step]) products.append(dot_product.unsqueeze(0)) output = torch.cat(products) outs.append(output) # reshape into batchwise format total_output = torch.stack(outs) return total_output