What is the correct way of implmenting a simple subsampling RNN network?

I’m trying to implement an RNN with Hierarchical Subsampling as shown here [chapter 9.0]

Is my understanding of how to implement this correct? I feel that this is too simple

import torch
import torch.nn as nn

class subsampleRNN(nn.Module):
    def __init__(self) -> None:
        super().__init__()

        self.lstm1 = nn.LSTM(512, 256)
        self.lstm2 = nn.LSTM(256, 128)
        self.lstm3 = nn.LSTM(128, 64)
        
        self.relu = nn.ReLU()
        
         
    def forward(self, x):
        x, h0 = self.lstm1(x)
        x = self.relu(x)
        x, h0 = self.lstm2(x)
        x = self.relu(x)
        x, h0 = self.lstm3(x)
        
        return x
    
    
model = subsampleRNN()
input = torch.rand(1,1,512)
model(input).shape

No, described subsampling is in time dimension, it should be the same to:

lstm1 = nn.LSTM(512, 256)
lstm2 = nn.LSTM(512, arbitrary_width)

if you permute+reshape (time=2,width=256) windows → (1,512) between layers

Thank you for your answer @googlebot. I’m not sure if I understand fully. Can you kindly expand on your previous post?

So, you have lstm1=LSTM(I,H1) that maps shapes as (T,B,I) → (T,B,H1), where hidden size H1 is arbitrary, (T,B,I) can be a shape of network input.

Now you want to process shorter sequences in lstm2. You have options:

  1. Drop some information by using stride>1, e.g. x[::2,:,:]
  2. Use pooling or something similar
  3. Do what your text suggests and use weighted combinations (if I understood correctly from quick skimming). So, you reshape data as (T,B,H1) → (T/window_size,B,H1*window_size) and do bigger steps in lstm2. To concatenate chunks in T dimension, you can instead .reshape() a permuted (B,T,H1) tensor and permute it back (as time-major lstm2 may be faster).
1 Like