Is there a masking method for CNN input of variable size within a batch?

I am using a resnet to do phone recognition from spectrograms. My input is a tensor of shape N C H W, where N is the batch size. But since the speech has different length in time, this causes H to be different in a batch, and I cannot crop the input because they are speech data… Right now, I have to pad zeros within a batch. I am wondering is there a similar way as packed padded sequence in RNN to let all modules in the network know that these zeros are useless, do not compute the gradient of any parts resulted from them?

Hi, if you are not familiar with PyTorch, I suggest you reading this data loading tutorial first.

I am assuming each of your speech data is just a vector, right? But the vector length is not all the same. PyTorch will by default try to pack the samples in a batch to form a tensor. But if your samples in a batch have variable size, the packing of samples will fail.

The Dataloader class has a parameter called collate_fn which controls how samples in a batch should be packed together. For example you can just store the samples in a batch in a list, with each element a speech sample. Store the corresponding label in a Tensor, For more info, refer to this post.

Hi, jdhao,

Thank you for your reply. But my real question is: suppose I already formed a batch of N, C, H, W, and in order to make this batch, I padded some zeros along the H dimension to some of the images, how do I let the CNN know that these zeros are useless, do not compute gradients resulted from these zeros in all the subsequent modules?

Thank you so much

I think it is not possible, at least for now. You can record the padding information for each image and load the padding info along with the image batch. Then you can recover the original image from the padding info. In this way, the CNN will not waste time and computation with the padded zeros in the image.

But if you recover the original image, the image samples in a batch may become variable size. You can not feed them to the CNN in a whole and you have to feed the image on by one to the CNN. Since you haven’t given any specific information about the input to the network. I can only give the above suggestions.

Hi weedwind,

I didn’t see any masking work in your description. I am doing like this : Padding and masking in convolution

Currently, my model seems working well but I have some questions about gradient too.

I have a question, collate_fn seems can control batch, but can not control input shape size, if I want to get variable input feature map size(h&w), how should I do?

1 Like