Inputting data with multiple variable dimensions to an LSTM


My data is of shape batch_size * num_paths * num_edges * emb_dim. Here:

  • batch_size: refers to batch size (32)
  • num_paths: refers to the no of paths between two given terms in a dependency tree
  • num_edges: refers to the no of edges in a specific path
  • emb_dim: refers to the embedding dimension (300)

Both num_paths and num_edges are variable for every training sample. In other words, for every training sample, there may be variable number of paths, each with variable number of edges.

(Note that data is an n-d list of lists in native Python, since two dimensions (num_paths and num_edges are variable)

I want to pass each path of each training instance as input to an LSTM, since a path is a sequence of edges and I want the resultant path representation after passing it through the LSTM. After obtaining the representation for each path in a training example, I want to take the sum of all these representations.

I know I can pack variable number of edges through pack_padded_sequence. But what about variable number of paths? How do I account for that? Is there any way to do it in native Pytorch, without going for messy solutions like iterating through loops?

Any help would be greatly appreciated!


IIUC paths are treated as independent sequences, so you can treat their dimension as an additional batch dimension. I.e. map (batch_idx, path_idx) -> seq_idx, create a compact tensor of sequences, run a RNN, map results back and aggregate by batch_idx.

Thanks, that does sound good! Hopefully it should solve my problem.