Attention Mask: Show, Attend and Interact/tell

ahq1993 · February 28, 2018, 7:37pm

Hi, I am implementing a paper Show, Attend and Interact: Human-Robot Interaction (https://arxiv.org/pdf/1702.08626.pdf) in pytorch. I have successfully implemented the following architecture and now I also have an annotation vector z from the attention module (G).

My question is, how to use the annotation vector to put attention mask on the input image to indicate attention region.

My Convnets takes an image of 198x198 to transform it to 256 feature maps (a) of size 7x7 as follow:

My attention module takes input in the form 49X256=7x7x256 and outputs an annotation vector z as follow:

In original torch/lua, I used to display attention mask using the nn.subsampling method as shown below:
local upsample = nn.Sequential()
upsample:add(nn.SpatialSubSampling(1,9,9,3,3))
upsample:add(nn.SpatialSubSampling(1,8,8,2,2))
upsample:add(nn.SpatialSubSampling(1,7,7,2,2))
upsample:add(nn.SpatialSubSampling(1,6,6,1,1))
upsample:float()
local w, dw = upsample:getParameters()
w:fill(0.25)
dw:zero()
local empty = torch.zeros(1,198,198):float()
upsample:forward(empty)
local attention, q = self:predict(state[m])
attention = upsample:updateGradInput(empty,attention:float())
attention = image.scale(attention, 198, 198, ‘bilinear’)
Please tell me the way to indcate attention mask using pytorch as I am not able to find any subsampling function.

yf225 · March 1, 2018, 12:00am

The formula for nn.subsampling looks like this:

output[i][j][k] = bias[k] + weight[k] * sum_{s=1}^kW sum_{t=1}^kH input[dW*(i-1)+s)][dH*(j-1)+t][k]

We can rearrange the formula in order to use some of the existing modules:

O1[i][j][k] = 1/(kH * kW) * sum_{s=1}^kW sum_{t=1}^kH input[dW*(i-1)+s][dH*(j-1)+t][k] # This is AvgPool2d
O2[i][j][k] = O1[i][j][k] * (kH * kW) # Multiply the matrix by a constant
O3[i][j][k] = O2[i][j][k] * weight[k] + bias[k] # Use a learnable vector `weight` and `bias` to calculate O3