How to align multi-channel signal with channels has different time steps

Hi,
I am training an attention-based model. Multiple extractors are applied to my time-series input, resulted in multichannel signals with different timesteps. Apparently, I have to align/merge those channels before I can feed them to the model. I’d like to know what is the common practice for this kind of tasks.

I can think of a few ways to do this:
a) padding with 0s

[a0 a1 a2 a3 a4]
[b0 b1]
[c0 c1 c2]

will become

[a0 a1 a2 a3 a4]
[b0 b1 0 0 0 ]
[c0 c1 c2 0 0 ]

The most obvious problem of this way is that the model has to align the timesteps for different channels. Especially when positional encoding is applied evenly among channels.

b) align signal with preprocessing
[a0 a1 a2 a3 a4]
[b0 b1]
[c0 c1 c2]

will become

[a0 a1 a2 a3 a4]
[b0 0 0 0 b1]
[c0 0 c1 0 c2]

It makes more sense because the relative position is preserved, but I wonder will I have to consider the higher frequency components introduced by brutally inserting 0. Are there documents targeting these questions?