I am making a VQA model with co-attention with Y adaptive image features (10-100). To calculate the co attention, I first project my question (batch * q_len * q_Dim) to (batch * q_len * new_dim)
Then I have the following for loop which i project each one of my image features (Y * feature_dim) to (Y * new_dim)
def attn_weights(self, q2, v2, n_objs): batch_size = n_objs.size(0) weights = torch.zeros(batch_size, self.max_objs, 1).to(self.device) q_proj = self.q_proj(q2) for i in range(batch_size): n_i = int(n_objs[i].item()) ### number of objects for the ith image in batchk v_i = v2[i] ## the ith image in batch v_i = v_i[:n_i-1, :] ## selecting number of object in image v_i = self.v_proj(v_i) ## projecting feature dim to new_dim q_i = q_proj[i] ## the ith question in batch fusion = v_i * q_i.repeat(n_i-1 ,1) ## repeat the question Y times fusion = self.dropout(fusion) scores = self.linear(fusion) att_weights = softmax(scores, 0) weights[i, :n_i -1] = att_weights return weights
During training this causes CUDA’s memory usage to sky rocket. I have checked the nvidia-smi and this function alone causes 14113MiB / 15079MiB of memory to be used.
This is the error I have received:
File "main.py", line 181, in <module> main() File "main.py", line 166, in main run(mod, train_loader, optimizer, train=, prefix='train', epoch=i) File "main.py", line 79, in run loss.backward() File "/opt/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 107, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/opt/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CUDA out of memory. Tried to allocate 1.43 GiB (GPU 0; 14.73 GiB total capacity; 8.45 GiB already allocated; 1.04 GiB free; 4.54 GiB cached)
Is there a reason why this is happening, and is there a known way around this? If nn.Linear layers are not supposed to be called in a for loop, my next question would be how to project the Y image features for every image in the batch (Y * feature_dim) to (Y * new_dim) where the batch dimension looks like (batch * 100 * feature_dim) to ( batch * 100 * new_dim) where everything after the Y image features (100 - Y) would be zero padded without the zero padding affecting the gradient of the projection.
Any help would be greatly appreciated!