Using nn.Linear inside a for loop causes CUDA to run out of memory


I am making a VQA model with co-attention with Y adaptive image features (10-100). To calculate the co attention, I first project my question (batch * q_len * q_Dim) to (batch * q_len * new_dim)

Then I have the following for loop which i project each one of my image features (Y * feature_dim) to (Y * new_dim)

    def attn_weights(self, q2, v2, n_objs):
        batch_size = n_objs.size(0)
        weights = torch.zeros(batch_size, self.max_objs, 1).to(self.device)

        q_proj = self.q_proj(q2)
        for i in range(batch_size): 
            n_i = int(n_objs[i].item()) ### number of objects for the ith image in batchk
            v_i = v2[i] ## the ith image in batch
            v_i = v_i[:n_i-1, :] ## selecting number of object in image
            v_i = self.v_proj(v_i) ## projecting feature dim to new_dim
            q_i = q_proj[i] ## the ith question in batch
            fusion = v_i * q_i.repeat(n_i-1 ,1) ## repeat the question Y times
            fusion = self.dropout(fusion)
            scores = self.linear(fusion)
            att_weights = softmax(scores, 0)
            weights[i, :n_i -1] = att_weights 
        return weights

During training this causes CUDA’s memory usage to sky rocket. I have checked the nvidia-smi and this function alone causes 14113MiB / 15079MiB of memory to be used.

This is the error I have received:

  File "", line 181, in <module>
  File "", line 166, in main
    run(mod, train_loader, optimizer, train=[], prefix='train', epoch=i)
  File "", line 79, in run
  File "/opt/anaconda3/lib/python3.7/site-packages/torch/", line 107, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/anaconda3/lib/python3.7/site-packages/torch/autograd/", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 1.43 GiB (GPU 0; 14.73 GiB total capacity; 8.45 GiB already allocated; 1.04 GiB free; 4.54 GiB cached)

Is there a reason why this is happening, and is there a known way around this? If nn.Linear layers are not supposed to be called in a for loop, my next question would be how to project the Y image features for every image in the batch (Y * feature_dim) to (Y * new_dim) where the batch dimension looks like (batch * 100 * feature_dim) to ( batch * 100 * new_dim) where everything after the Y image features (100 - Y) would be zero padded without the zero padding affecting the gradient of the projection.

Any help would be greatly appreciated!

Hi, nn.linear work with an arbitrary amount of dimensions, namely you can pass whatever tensor of size BATCH,,dim yo obtain BATCH,,new_dim.

Never do for loops in pytorch as it is equivalent to generate Siamese modules. It duplicates the computational graph as many times as you call the module.

If you would like to do something similar (linear is spatial as you can pass arbitrary dimensions) the proper way is squeezing everything into the BATCH dimension

1 Like