Long sequential learning with Torch.utils.checkpoint.checkpoint

Tobias_Czempiel · August 18, 2021, 2:46am

I would like to use model checkpointing to run a CNN + RNN in an end-to-end fashion. One does not necessarily need checkpointing for this problem but I was hoping to backprop into the cnn using a very long sequence e.g. 500 images. Without any further measures to reduce the memory footprint this inevitably breaks any training process due to the cuda out of memory error.

I found some helpful resources:
https://pytorch.org/docs/stable/checkpoint.html

github.com

prigoyal/pytorch_memonger/blob/master/tutorial/Checkpointing_for_PyTorch_models.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Trading compute for memory in PyTorch models using Checkpointing\n",
    "\n",
    "Given the fact that deep learning explorations are bottlenecked by device memory constraints, we implemented a gradient checkpointing technique described in [1] and using this technique, we can achieve 4x-10x larger models ranging across convolutional networks, recurrent networks and medical imaging applications. The implementations of memory optimized versions of models along with their baselines is available [here](https://github.com/prigoyal/pytorch_memonger/tree/master/models). The results of checkpointing are below:\n",
    "\n",
    "<img src=\"results.png\">\n",
    "\n",
    "As we see from the above results, for ResNet-1001 model, baseline implementation can only fit 15 images and for better packing and layers optimization, we use minibatch size of 8 per run and do 4 runs to simulate minibatch size 32 training. But in the checkpointed model, we can fit ~52 images in minibatch and hence use 32 images in each run. As evident from the results, per iteration <i> speed improves by ~10% </i> and the model accuracy will be better because of stable batch normalization layers. \n",
    "\n",
    "###  Gradient Checkpointing\n",
    "\n",
    "Gradient Checkpointing technique reorders the forward and reverse computation of the model so as to reduce the maximal length of the autograd tape. This is achieved by dividing the model or parts of the model in various segments and executing the segments without taping them in the forward pass i.e. their taping is delayed until the backward pass. This results in not storing the activations of those segments in memory. During the backward pass, since we need activations to compute the gradients, the forward pass is done again on the segment and the pass is taped. In this way, at any given time, the tape length is the maximum of tape length of various segments and not the sum.\n",
    "\n",
    "##  Using Checkpointing for PyTorch models\n",
    "\n",

This file has been truncated. show original

If I understood model checkpointing correctly it should be possible to skip the gradient computation in the forward pass and in the backward pass calculate the gradient by rerunning a forward pass for each segment - trading speed for less memory consumption.

I am still a bit clueless on how to implement this as there seems to be no complete example with an RNN available.

Another resource I found is from fairscale

Is this a similar kind of model checkpointing compared to pytorch checkpointing?

Any help or pointers to additional reading material is highly appreciated.