Passing a variable number of image features through an nn.Linear layer


I am making a VQA model with adaptive image features (10 to 100) per each image. My batch size is 512 and each image feature has a dimension of 2048. Every image with less than 100 features is padded so that the shape of each minibatch is always (512, 100, 2048).

While calculating these adaptive features, I follow this process:

            batch_size = n_objs.size(0)
            weight_mold = torch.zeros(batch_size, self.max_objs, 1).to(self.device)
            total_objs = int(n_objs.sum().item())
            q_mold =  torch.zeros(total_objs, self.q_emb).to(self.device)
            obj_p = 0
            for n, i in enumerate(n_objs):
                n_i = int(i.item())
                q_i = q2[n]
                q_i = q_i.repeat(n_i, 1)
                q_mold[obj_p:n_i + obj_p, :] = q_i
                obj_p += n_i
            mask = generate_mask(v2, n_objs, device = self.device)
            flattened_objs = torch.masked_select(v2, mask)
            total_objs =  self.v_proj(flattened_objs.view((-1, self.v_emb)))
            q_proj = self.q_proj(q_mold)
            #fusion = q_mold * total_objs
            fusion =  - (v_proj - q_proj)**2 + relu(v_proj + q_proj)
            fusion = self.dropout(fusion)
            fusion = self.linear(fusion)
            obj_p = 0
            for n, i in enumerate(n_objs):
                n_i = int(i.item())
                objs_n = fusion[obj_p:n_i + obj_p, :]
                objs_n = softmax(objs_n, 0)
                if n == 0:
                weight_mold[n, :n_i, :] = objs_n
                obj_p += n_i
            return weight_mold

To walk through the code real quick, I:

  1. Get the batch size, get the total number of feature objects (total_objs) in the batch
  2. Create an empty matrix for the attention weights size: (512, 100, 1) which will hold the attention scores for each object in each image in the batch
  3. Create an emtpy matrix to hold the question embedding (dim = 5000) which is size: (total_objs, 5000).
  4. I then use a for loop to repeat the ith question n time for the n objects in the ith image
  5. I then use a mask to and a flattening procedure to reshape the image features (512, 100, 2048) to (total_objs, 2048) essentially selecting only the unpadded image features.
  6. Then I project each image feature (total_objs, 2048) and each question (total_objs, 5000) both to the same size (total_objs, 1024) so that I can perform element wise multiplication to fuse them.
  7. I then perform another projection so to get the logits for each fusion item (total_objs, 1)
  8. Then in another for loop, I softmax the n logits for the ith image to compute their final scores and place them in the wieght_mold (512, 100, 1).

However when I train my accuracy converges to about ~44% so I know this process isn’t working. I have switched my network to use fixed features which achieves ~63%. Which allows me to know my logic above is exactly the problem.

Does anyone have any ideas on how to compute the attention scores for these adaptive features or can correct my logic above? I have been mind boggled by this for too long lol.

Thank you in advance!

I actually can confirm the code above works. My model just took a bit longer to converge for some reason!