RuntimeError: CUDA out of memory with Self-Attention in GANs

I am trying to implement the self-attention mechanism in MSG-GAN for Grayscale images. I have implemented this GitHub - mdraw/BMSG-GAN at img_channels code for generating X-ray images. I am integrating the self-attention layer in the generator and discriminator as in some-randon-gan-1/ at master · akanimax/some-randon-gan-1 · GitHub.

I got following Runtime memory error. I tried reducing batch size to 1 but didn’t work. I tried it only for 10 images as well but didn’t workout.

The error log:

Traceback (most recent call last):
 File "", line 281, in <module>
  File "", line 275, in main
  File "/home/r00206978/AICS/MSG_X/SA/MSG_GAN/", line 556, in train
    images, loss_fn)
  File "/home/r00206978/AICS/MSG_X/SA/MSG_GAN/", line 413, in optimize_discriminator
    loss = loss_fn.dis_loss(real_batch, fake_samples)
  File "/home/r00206978/AICS/MSG_X/SA/MSG_GAN/", line 202, in dis_loss
    f_preds = self.dis(fake_samps)
  File "/home/r00206978/.local/lib/python3.7/site-packages/torch/nn/modules/", line 1110,    in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/r00206978/AICS/MSG_X/SA/MSG_GAN/", line 304, in forward
    y = self.layers[self.depth - 2](y)
  File "/home/r00206978/.local/lib/python3.7/site-packages/torch/nn/modules/", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/r00206978/.local/lib/python3.7/site-packages/torch/nn/parallel/", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/r00206978/.local/lib/python3.7/site-packages/torch/nn/modules/", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/r00206978/AICS/MSG_X/SA/MSG_GAN/", line 566, in forward
    y, _ = self.self_attention(x)
  File "/home/r00206978/.local/lib/python3.7/site-packages/torch/nn/modules/", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/r00206978/AICS/MSG_X/SA/MSG_GAN/", line 87, in forward
    energy = th.bmm(proj_query, proj_key)  # energy
RuntimeError: CUDA out of memory. Tried to allocate 4.00 GiB (GPU 0; 15.78 GiB total capacity;  10.18 GiB already allocated; 846.00 MiB free; 13.73 GiB reserved in total by PyTorch) If reserved  memory is >> allocated memory try setting m$

Self Attention Layer:

class SelfAttention(th.nn.Module):
    Layer implements the self-attention module
    which is the main logic behind this architecture.
    Mechanism described in the paper ->
    Self Attention GAN: refer /literature/Zhang_et_al_2018_SAGAN.pdf
        channels: number of channels in the image tensor
        activation: activation function to be applied (default: lrelu(0.2))
        squeeze_factor: squeeze factor for query and keys (default: 8)
        bias: whether to apply bias or not (default: True)
    from torch.nn import LeakyReLU

    def __init__(self, channels, activation=LeakyReLU(0.2), squeeze_factor=8, bias=True):
        """ constructor for the layer """

        from torch.nn import Conv2d, Parameter, Softmax

        # base constructor call

        # state of the layer
        self.activation = activation
        self.gamma = Parameter(th.zeros(1))

        # Modules required for computations
        self.query_conv = Conv2d(  # query convolution
            out_channels=channels // squeeze_factor,
            kernel_size=(1, 1),

    self.key_conv = Conv2d(  # key convolution
        out_channels=channels // squeeze_factor,
        kernel_size=(1, 1),

    self.value_conv = Conv2d(  # value convolution
        kernel_size=(1, 1),

    # softmax module for applying attention
    self.softmax = Softmax(dim=-1)

def forward(self, x):
    forward computations of the layer
    :param x: input feature maps (B x C x H x W)
        out: self attention value + input feature (B x O x H x W)
        attention: attention map (B x H x W x H x W)

    # extract the shape of the input tensor
    m_batchsize, c, height, width = x.size()

    # create the query projection
    proj_query = self.query_conv(x).view(
        m_batchsize, -1, width * height).permute(0, 2, 1)  # B x (N) x C

    # create the key projection
    proj_key = self.key_conv(x).view(
        m_batchsize, -1, width * height)  # B x C x (N)

    # calculate the attention maps
    energy = th.bmm(proj_query, proj_key)  # energy
    attention = self.softmax(energy)  # attention B x (N) x (N)

    # create the value projection
    proj_value = self.value_conv(x).view(
        m_batchsize, -1, width * height)  # B X C X (N)

    # calculate the output
    out = th.bmm(proj_value, attention.permute(0, 2, 1))
    out = out.view(m_batchsize, c, height, width)

    attention = attention.view(m_batchsize, height, width, height, width)

    if self.activation is not None:
        out = self.activation(out)

    # apply the residual connection
    out = (self.gamma * out) + x
    return out, attention

Could you please help me out to solve this problem?
Thanks in advance…

Since reducing the batch size didn’t work, try to reduce the spatial size of your images and check which max. size would allow the model to train. Alternatively, you could also try to use torch.utils.checkpoint to trade compute for memory.

Hi @ptrblck thanks for your attention. I did both of these but error is still there. I did reduce the image size upto 6 KB. I also checked it for only 10 images as well but not worked. I added spectral normalization as well but not worked. Why does the model not starts training with self attention? The model works perfect without self-attention mechanism and trained well.

Your self-attention layer might use too much memory for your GPU so check your implementation in isolation and profile its memory usage.
The memory usage could also give you more information if the implementation might be wrong.