Is padding really necessary for CNN with variable-length inputs?

apytorch · September 26, 2018, 10:58pm

First, I’m building a time series classifier, to classify each chunk/block of time. The data which will be used has already be tried and failed on standard ML techniques, as the long-term temporal order is important.

The basic idea that I’m trying to do is that for each Y sec chunk/block of time, use a series of CNNs, with a AdaptiveMaxPool1d on top (to handle the various block sizes), to extract X number of features. Then, using an “outer” CNN to work on all of those blocks, and finally a time-distributed dense layer or two to get a classification. Previously I had used a couple LSTM layers with Keras for the “outer” part, but I’m intrigued by the current findings replacing LSTMs with CNN.

My pared-down dataset is about 70GB in size, with ~2500 recordings (samples, in the pytorch sense), that are of various lengths and each recorded at a different rate. Each recording is divided into lengths of a fixed number of seconds, but each of those block lengths depends on the sample rate. Example:

Recording 1: 2213 blocks, with a sample rate of 128 Hz means each block is 3,840 samples (in the classical time series data sense) long.
Recording 2: 5127 blocks, with a sample rate of 512 Hz means each block is 15,360 samples long.

I would really like to avoid padding my data (in both “dimensions”), as just the back of the envelope calculation suggests the padded size will be at least 2x the raw data set size. I just think the overhead of transferring all of that data to the GPU is going to be a massive waste of time.

*I know that I could resample the raw data to the same sample rate, so that only the number of blocks is different. However, too much of my data is at 128 Hz, and I’m convinced I’m losing some information at that rate. And I hope the higher sample rate recordings can help inform the network.

Ditlev_Jorgensen · September 27, 2018, 7:20am

Good morning

Can you pass me a link to something confirming your claim, as Im very intrigued by the idea myself, and would love to know more.

One of the key elements of CNN’s is that they excel where data have a grid-like structure. For images the grid is the distance between pixels and for time-series its the constant data sample rate. So it might be a problem using two sample rates as the CNN might have trouble when grid structure are different from sequence to sequence.

If you loose information at 128 Hz, is it possible to resample data to 512 Hz then?
Another solution could be to simple clip the data sequences to have same length but that would throw away a lot of otherwise usefull data.

You might also be able to train one CNN on the short sequences and another CNN on the long sequences and combine them into one classification using some sort of feature fusion? Like in the picture:

If none of these approaches work im afraid i cannot think of anything else than padding

Cheers.

wizardk · September 27, 2018, 9:18am

There are 4 ways to deal with variable size inputs:

one by one training (batch size = 1).
resizing (or up/down sampling) the samples to the same size in a batch.
crop the samples to the same size in a batch.
padding the samples to the max size in a batch, and masking the redundant parts after every y=f(x).

I would really like to avoid padding my data (in both “dimensions”), as just the back of the envelope calculation suggests the padded size will be at least 2x the raw data set size. I just think the overhead of transferring all of that data to the GPU is going to be a massive waste of time.

Why there is a “2x”?

apytorch · September 27, 2018, 2:22pm

Thank you, Ditlev and wizardk, for responding.

Yup. I found these very interesting. For those that might need a little more reason to click, they found that LSTM actually doesn’t have that long of a memory, and that using a set of dilated CNNs could develop an exceptionally long memory about the sequence.
arxiv.org 1
arxiv.org 2
towardsdatascience.com

I had planned on solving the problem of different sample rates mating with the fixed-size dense layer by using an Adaptive Pool at the top of the “feature extracting” CNN. (this also is in response to your suggestion of training two CNNs). Also, I feel I should add that this is continuous data.

I could, but then I blow up my data set even more. Since about 1/2 of the recordings are at 128 Hz, even without padding the sequences, we’re talking a 4x increase for just those recordings.

I’m almost 100% positive that throwing away 3/4 of the data would not work. Even though I’m extracting some features that I’m not so confident the network can learn, I still expect that the bulk of the learning will occur on the raw data.

You list mixes possible solutions for the two -different- dimensions; but the solutions for one are not applicable to to the other. The recording sequences themselves are of variable length, and the 1D main feature vector is also variable (the same size for a given sequence, but varies by recording).

The difference between the shortest and longest sequence (number of blocks of data) is over 3x, combined with the 4x difference between the lowest and highest sampling rate (the samples for each block of X seconds).

I still wonder if there is another solution to this. Although I would prefer not to, if necessary, I can resample the blocks to be the same size, and pad the sequences of blocks - which would actually be smaller memory-wise. However, I’m still not quite sure how to do the padding for the upper CNNs so that computations aren’t wasted on computing empty blocks, as it appears the pad/pack functions in pytorch are just for RNNs.

Aside: I was thinking about it more after I posted last night, and I realized that given that pytorch only needs to reshape the data for a time distributed network might be of use (I had used keras’ TimeDistributed wrapper around the feature extraction part that runs on every block).

wizardk · September 28, 2018, 12:16am

Hi apytorch,

First, you should ensure the C (channels) of samples in the same. You can normalize them by resampling in preprocess, or pooling/convolution with stride 2 in additional layer specially for the high sample rate.

For padding, you can sort the samples by length and do a local random. That will let the total padding size keep smaller. And maybe you will get faster convergence as well.

apytorch · September 28, 2018, 1:13am

Yeah, I had read a few ideas about doing that. Although I would assume some randomness would have to be added to prevent the same batch from being presented every epoch.

I’m curious if you have any suggestions about how to do the padding when going through a CNN, instead of a RNN, so that the padded samples aren’t calculated.

wizardk · September 28, 2018, 1:28am

Hi apytorch,

You can shuffle the samples in the range of 2x batch size on the sorted samples, that’s what I mean “local random”.

I just discuss the issue about padding and masking a few days ago. You can find it here.