I am making a VQA model with adaptive image features (10 to 100) per each image. My batch size is 512 and each image feature has a dimension of 2048. Every image with less than 100 features is padded so that the shape of each minibatch is always (512, 100, 2048).
While calculating these adaptive features, I follow this process:
batch_size = n_objs.size(0) weight_mold = torch.zeros(batch_size, self.max_objs, 1).to(self.device) total_objs = int(n_objs.sum().item()) q_mold = torch.zeros(total_objs, self.q_emb).to(self.device) obj_p = 0 for n, i in enumerate(n_objs): n_i = int(i.item()) q_i = q2[n] q_i = q_i.repeat(n_i, 1) q_mold[obj_p:n_i + obj_p, :] = q_i obj_p += n_i mask = generate_mask(v2, n_objs, device = self.device) flattened_objs = torch.masked_select(v2, mask) total_objs = self.v_proj(flattened_objs.view((-1, self.v_emb))) q_proj = self.q_proj(q_mold) #fusion = q_mold * total_objs fusion = - (v_proj - q_proj)**2 + relu(v_proj + q_proj) fusion = self.dropout(fusion) fusion = self.linear(fusion) obj_p = 0 for n, i in enumerate(n_objs): n_i = int(i.item()) objs_n = fusion[obj_p:n_i + obj_p, :] objs_n = softmax(objs_n, 0) if n == 0: print(objs_n) weight_mold[n, :n_i, :] = objs_n obj_p += n_i return weight_mold
To walk through the code real quick, I:
- Get the batch size, get the total number of feature objects (total_objs) in the batch
- Create an empty matrix for the attention weights size: (512, 100, 1) which will hold the attention scores for each object in each image in the batch
- Create an emtpy matrix to hold the question embedding (dim = 5000) which is size: (total_objs, 5000).
- I then use a for loop to repeat the ith question n time for the n objects in the ith image
- I then use a mask to and a flattening procedure to reshape the image features (512, 100, 2048) to (total_objs, 2048) essentially selecting only the unpadded image features.
- Then I project each image feature (total_objs, 2048) and each question (total_objs, 5000) both to the same size (total_objs, 1024) so that I can perform element wise multiplication to fuse them.
- I then perform another projection so to get the logits for each fusion item (total_objs, 1)
- Then in another for loop, I softmax the n logits for the ith image to compute their final scores and place them in the wieght_mold (512, 100, 1).
However when I train my accuracy converges to about ~44% so I know this process isn’t working. I have switched my network to use fixed features which achieves ~63%. Which allows me to know my logic above is exactly the problem.
Does anyone have any ideas on how to compute the attention scores for these adaptive features or can correct my logic above? I have been mind boggled by this for too long lol.
Thank you in advance!