I am trying to use a model which combines both Resnet and Transformer architectures. I will provide the input image to the resnet backbone and extract the features of the intermediate layer.
input(size: 1,3,256,256) --------> Resnet50layer3----------->features(1,1024,16,16)
from the output features (16X16), I take 10 pixels by some criteria which results in tensor of size(1,1024, 10).
Now, this is provided as input to the transformer encoder with batch size 1, source sequence length 10, and embedding size 1024.
This works alright when I don’t use any data transforms other than
Now I am trying to use the
RandomErase transform with erasing value as 0. When I select the pixels from the output of the Resnet50 to provide them as input to the transformer, some of the pixels I am interested in might lie in the corresponding erased region of the input image. So I create a key_padding_mask of size (batch_size, src_sequence_length) to ignore these erased region pixels. During the backpropagation step because of this mask, it works alright for the transformer part but the gradient backpropagates to resnet block and the weights of resnet become
How can I prevent backpropagation for these specific pixels extracted from the erased region in the resnet block? The key_padding_mask has boolean values that represent which pixels should be avoided for gradient backpropagation. As I am using RandomErase the pixels which should be ignored change from sample to sample.