Hello, I have an input of shape `(14, 10, 30, 300)`

,

where, `14`

is `batch_size`

, `10`

is the `seq_len`

, `30`

is the `num_tokens`

in each element in the sequence, and `300`

is the `embedding_dim`

for each token. (I know, it’s complicated!)

I want to process each of the `10`

elements in the sequence through a `GRU`

and convert it to a `512`

dimensional output. i.e., I wanna take each `30 x 300`

in the sequence and learn to map it to a `512`

dimensional vector.

Here’s what I have. I’m using the output of gru’s last hidden layer. That’s because my input represents *textual features* and I’ll be combining it with *visual features* later. Would love some feedback on whether this is the correct approach. Thank you.

```
gru = nn.GRU(
input_size=300,
hidden_size=256,
batch_first=True,
bidirectional=True
)
# Reshape input from (14, 10, 30, 300) -> (14*10, 30, 300)
inputs = inputs.reshape(14*10, 30, 300)
_, final_hidden_layer = gru(inputs)
outputs = final_hidden_layer.permute(1, 0, 2).contiguous().flatten(start_dim=1) # (140, 512)
outputs = outputs.reshape(14, 10, 512)
```