So, you have lstm1=LSTM(I,H1) that maps shapes as (T,B,I) → (T,B,H1), where hidden size H1 is arbitrary, (T,B,I) can be a shape of network input.
Now you want to process shorter sequences in lstm2. You have options:
Drop some information by using stride>1, e.g. x[::2,:,:]
Use pooling or something similar
Do what your text suggests and use weighted combinations (if I understood correctly from quick skimming). So, you reshape data as (T,B,H1) → (T/window_size,B,H1*window_size) and do bigger steps in lstm2. To concatenate chunks in T dimension, you can instead .reshape() a permuted (B,T,H1) tensor and permute it back (as time-major lstm2 may be faster).