Spatial-Aware Feature Aggregation

I need to convert the following code from tensorflow to pytorch. This code is suppose to generate a position embedding map. Two fully connected layers are used to select features among the prominent ones as well as encode the spatial combinations and feature responses. The feature that is given to function is a convolution feature of size [N, C, H, W], which is output of a VGG16 network at the last conv layer and before the last max pooling layer.

def spatial_aware(input_feature, dimension, trainable, name):
    batch, height, width, channel = input_feature.get_shape().as_list()
    vec1 = tf.reshape(tf.reduce_mean(input_feature, axis=-1), [-1, height * width])

    with tf.variable_scope(name):
        weight1 = tf.get_variable(name='weights1', shape=[height * width, int(height * width/2), dimension],
                                 trainable=trainable,
                                 initializer=tf.truncated_normal_initializer(mean=0.0, stddev=0.005),
                                 regularizer=tf.contrib.layers.l2_regularizer(0.01))
        bias1 = tf.get_variable(name='biases1', shape=[1, int(height * width/2), dimension],
                               trainable=trainable, initializer=tf.constant_initializer(0.1),
                               regularizer=tf.contrib.layers.l1_regularizer(0.01))
        # vec2 = tf.matmul(vec1, weight1) + bias1
        vec2 = tf.einsum('bi, ijd -> bjd', vec1, weight1) + bias1


        weight2 = tf.get_variable(name='weights2', shape=[int(height * width / 2), height * width, dimension],
                                  trainable=trainable,
                                  initializer=tf.truncated_normal_initializer(mean=0.0, stddev=0.005),
                                  regularizer=tf.contrib.layers.l2_regularizer(0.01))
        bias2 = tf.get_variable(name='biases2', shape=[1, height * width, dimension],
                                trainable=trainable, initializer=tf.constant_initializer(0.1),
                                regularizer=tf.contrib.layers.l1_regularizer(0.01))
        vec3 = tf.einsum('bjd, jid -> bid', vec2, weight2) + bias2

        return vec3

It seems you are using TensorFlow so I would recommend posting this question into their forum as you would find the experts there :wink:

Hi Thanks fro your quick reply,

I think I have an equivalent implementation of SAFA. However, I keep having an error when implementing the vgg16 module with it.

class SAFA(nn.Module):
    """SAFA layer implementation"""
    def __init__(self, in_channel, d=8):
        super().__init__()
        c = in_channel
        self.fc = nn.ModuleList([
            nn.Sequential(
                nn.Linear(c, c // 2),
                nn.Linear(c // 2, c),
            ) for _ in range(d)
        ])

    def forward(self, x):
        # channel dim
        x = torch.mean(x, dim=1)
        x = [b(x) for b in self.fc]
        x = torch.stack(x, -1)
        return x

class VGG16_SAFA(nn.Module):
    
    def __init__(self,in_channels, num_classes):
        super(VGG16_SAFA, self).__init__()
        self.in_channels = in_channels
        self.num_classes = num_classes
        self.SAFA = SAFA(512, d=8)
        self.pooling = nn.AvgPool2d(kernel_size=2,stride=2,padding=0)
        
        layers = list(torchvision.models.vgg16(pretrained=True,progress=True).features.children())[:-2]

        for l in layers[:-5]:

            for p in l.parameters(): p.requires_grad = False            
                
        self.backbone = torch.nn.Sequential(*layers,nn.MaxPool2d(kernel_size = (2,2), stride = (2,2)))            

    def forward(self, x):
        
        H, W = x.shape[2]//16, x.shape[3]//16
        
        x = self.backbone(x) 
        x = x.flatten(2).transpose(1, 2)
        B, N, C = x.shape
        x = x.transpose(1, 2).view(B, C, H, W) 
        x = self.pooling(x) 
        
        #x = torch.flatten(x, start_dim=-2, end_dim=-1)
        x_sa = self.SAFA(x)
        
        # b c h*w @ b h*w d = b c d
        x = x @ x_sa
        x = torch.transpose(x, -1, -2).flatten(-2, -1) 
        return x

The error I am getting is the following

RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x9 and 512x256)

I am still trying to reproduce the tensorflow code originally posted. The feature size after the back bone is [N, 512, 7, 7].

Based on the error message it seems the first linear layer in self.SAFA raises the shape mismatch as 512 features are expected while your input seems to have the shape [batch_size, *, 9].
I don’t quite understand why you are transposing x two times:

x = x.flatten(2).transpose(1, 2)
B, N, C = x.shape
x = x.transpose(1, 2).view(B, C, H, W) 

as the second transpose(1, 2) call should create the shape created after x.flatten(2), shouldn’t it?

I also don’t know how B, C, H, W are defined so check the shape of x before passing it to SAFA and make sure it has 512 features in the last dim.