Hello,
I am making a VQA model with co-attention with Y adaptive image features (10-100). To calculate the co attention, I first project my question (batch * q_len * q_Dim) to (batch * q_len * new_dim)
Then I have the following for loop which i project each one of my image features (Y * feature_dim) to (Y * new_dim)
def attn_weights(self, q2, v2, n_objs):
batch_size = n_objs.size(0)
weights = torch.zeros(batch_size, self.max_objs, 1).to(self.device)
q_proj = self.q_proj(q2)
for i in range(batch_size):
n_i = int(n_objs[i].item()) ### number of objects for the ith image in batchk
v_i = v2[i] ## the ith image in batch
v_i = v_i[:n_i-1, :] ## selecting number of object in image
v_i = self.v_proj(v_i) ## projecting feature dim to new_dim
q_i = q_proj[i] ## the ith question in batch
fusion = v_i * q_i.repeat(n_i-1 ,1) ## repeat the question Y times
fusion = self.dropout(fusion)
scores = self.linear(fusion)
att_weights = softmax(scores, 0)
weights[i, :n_i -1] = att_weights
return weights
During training this causes CUDA’s memory usage to sky rocket. I have checked the nvidia-smi and this function alone causes 14113MiB / 15079MiB of memory to be used.
This is the error I have received:
File "main.py", line 181, in <module>
main()
File "main.py", line 166, in main
run(mod, train_loader, optimizer, train=[], prefix='train', epoch=i)
File "main.py", line 79, in run
loss.backward()
File "/opt/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 107, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/opt/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 1.43 GiB (GPU 0; 14.73 GiB total capacity; 8.45 GiB already allocated; 1.04 GiB free; 4.54 GiB cached)
Is there a reason why this is happening, and is there a known way around this? If nn.Linear layers are not supposed to be called in a for loop, my next question would be how to project the Y image features for every image in the batch (Y * feature_dim) to (Y * new_dim) where the batch dimension looks like (batch * 100 * feature_dim) to ( batch * 100 * new_dim) where everything after the Y image features (100 - Y) would be zero padded without the zero padding affecting the gradient of the projection.
Any help would be greatly appreciated!