Increased GPU memory demand when loading from checkpoint

Hi,

I have been observing some memory issues for quite a while. Whenever I resume training from a checkpoint more GPU memory is required than when running from scratch. I have observed this independent of architecture/application (NLP, Vision). It’s hard to quantify because it depends on the model, but I’d guess it is around 5-10%. Consequentially, I have to reduce the batch size in order to continue training from a checkpoint, which is a bit painful.
Any idea?

1 Like

How are you measuring the memory usage?
Note that PyTorch uses a custom memory allocation, which caches the device memory.
nvidia-smi will this give you the memory usage of the CUDA context, the allocated as well as the cached memory.
Since you have to reduce the batch size, it seems that the memory might indeed to growing.
Since you are observing this behavior in a lot of use cases, could you post a simple code snippet to reproduce this behavior, please?

model = VAE(...)
pretrained_dict = torch.load(os.path.join(FLAGS.savedir,FLAGS.loadmodel))
model.load_state_dict(pretrained_dict)

I observed this also using the Huggingface BERT implementations. Yes, nvidia-smi.

Here is the memory footprint when training from scratch:

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     32630      C   python                                     11985MiB |
+-----------------------------------------------------------------------------+

Memory footprint when continuing from checkpoint:

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     32707      C   python                                     15045MiB |
+-----------------------------------------------------------------------------+
 File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 576.00 MiB (GPU 0; 15.75 GiB total capacity; 12.94 GiB already allocated; 392.88 MiB free; 14.38 GiB reserved in total by PyTorch)

Could you try to delete the pretrained_dict after loading it?
The difference in the memory usage seems to be quite large.
Also, could you post the definition of VAE?

Yes, in this case, the difference is quite large. Deleting the dictionary does not change the memory footprint unfortunately.

The VAE code is a bit more complex. It consists of a Conv2D encoder stack, with a bunch of FCs, and the Deconv decoder stack. Maybe at some point, it would be easier to look at the NLP problem, as it is more accessible.

class VAE2(nn.Module):
    '''
    Variational Autoencoder
    '''
    def __init__(self, img_channels, img_dim, latent_dim, filters,
                 kernel_sizes, strides, activation=nn.LeakyReLU,
                 out_activation=nn.Tanh, batch_norm=True, no_samples=10, sp_activation=None, public_stream=None, private_stream=None):
        '''
        in_dim (int): number of pixels on each row / column of the images
                      (assumes the images are square).
        in_channels (int): number of channels of the images (e.g.: 1 for
                           grayscale, 3 for color images).
        latent_dim (int): dimension of the latent space.
        filters (list of length n_conv): number of filters for each conv.
                                         layer.
        kernel_sizes (list of length n_conv): kernel size for each conv.
        strides (list of length n_conv): strides for each conv. layer.
        activation (nn.Module): activation used in all layers (default:
                                LeakyReLU).
        out_activation (subclass of nn.Module): activation used in the output
                                                layer (default: Tanh).
        batch_norm (boolean): if True, batch normalization is applied in every
                              layer before the activation (default: True).
        '''
        super(VAE2, self).__init__()

        self.img_dim = img_dim
        self.img_channels = img_channels
        self.latent_dim = latent_dim
        self.filters = filters
        self.kernel_sizes = kernel_sizes
        self.strides = strides
        self.activation = activation
        self.out_activation = out_activation
        self.batch_norm = batch_norm
        self.no_samples = no_samples

        n_conv = len(self.filters)

        # compute the paddings and the flattened dimension at the output of the
        # last conv.
        paddings = []
        dims = [self.img_dim]
        for i in range(n_conv):
            if (dims[i] - self.kernel_sizes[i]) % strides[i] == 0:
                paddings.append((self.kernel_sizes[i] - 1)//2)
            else:
                paddings.append((self.kernel_sizes[i] - strides[i] + 1)//2)

            dims.append((dims[i] + 2*paddings[i] - self.kernel_sizes[i])
                        // self.strides[i] + 1)
        flat_dim = self.filters[-1] * (dims[-1]**2)

        self.encoder = Encoder(self.img_channels, self.img_dim,
                               self.latent_dim, self.filters,
                               self.kernel_sizes, self.strides,
                               paddings, flat_dim,
                               activation=self.activation,
                               batch_norm=self.batch_norm)

        # the decoder architecture will be the transposed of the encoder's
        filters_dec = (list(reversed(self.filters[0:n_conv-1]))
                       + [img_channels])
        kernel_sizes_dec = list(reversed(self.kernel_sizes))
        strides_dec = list(reversed(self.strides))
        paddings = list(reversed(paddings))
        dims = list(reversed(dims))

        # compute the output paddings
        out_paddings = []
        for i in range(n_conv):
            out_dim = ((dims[i] - 1)*strides_dec[i] - 2*paddings[i] +
                       kernel_sizes_dec[i])
            out_paddings.append(dims[i+1] - out_dim)

        self.decoder = Decoder(self.latent_dim, self.filters[-1], dims[0],
                               filters_dec, kernel_sizes_dec, strides_dec,
                               paddings=paddings, out_paddings=out_paddings,
                               activation=self.activation,
                               out_activation=self.out_activation,
                               batch_norm=self.batch_norm)
       
            
        self.bottleneck_fc = SpLinear(in_features=flat_dim, out_features=int(np.ceil(flat_dim/4)), bias=True, activation=sp_activation)
       .............
        self.fc_model_public = nn.Sequential(stream2dict(public_stream))
        self.fc_model_private = nn.Sequential(stream2dict(private_stream))

Could you post the arguments as well as the input shapes?
If it’s easier for you to post to another repository, it would be fine as well.
I would just want to make sure that you are seeing the memory increase with a specific setup, which I’m able to run locally.

The input ist are RGB images:
3 x 128 128

The 2d filters are:

filters: '64, 128, 256, 384'
kernel-sizes: '3, 3, 3, 3, 3'
strides: '2, 2, 2, 2, 2'

The FC that maps from flattened convs to a compressed rep:

int: 12288, out: 6144, 

The FC stream (I am using a variant of linear layers, so I need more memory), each has dimensions:

'2x in: 6044, out: 2048; in: 2048, out: 512'

And another FC:

In: 1024, Out: 8400

@TJKlein Did you ever figure out the cause of this?

@ptrblck I’m running into the same problem. Loading from checkpoint takes a couple more gigabytes of VRAM than starting fresh and so I need to reduce batch size to avoid OOMing. I originally blamed PyTorch Lightning but the issue persists after reimplementing my entire training loop in vanilla PyTorch. Here’s my entire unedited training loop if specifics are needed.

2 Likes

I’m hitting the same problem - training on 4 nodes, with 4 gpus per node, the training runs fine, but when resuming from checkpoint I get a Cuda out of memory error

1 Like