How do we trade time for memory?

jeanfeydy · September 29, 2017, 10:44pm

Hi everyone,

First of all, I would like to thank the PyTorch development team for delivering such a quality framework.
Unlike most people here, I’m not into Deep Learning but come instead from the medical image registration field: a niche topic with specific constraints (smooth deformations between subjects, etc.) and specific mathematical tools to enforce them.
Now, most of the state-of-the-art algorithms in my field can be described as iterated linear algebra computations on large matrices: here is a typical example, the “Sinkhorn scaling algorithm”.

NPOINTS = 10000

# Variable definitions, etc. for this minimal working example. ----------------------------------
# You don't need to read this to understand our problem.
# Import the relevant tools
import numpy as np          # standard array library
import torch
from   torch.autograd import Variable

use_cuda = torch.cuda.is_available()
dtype    = torch.cuda.FloatTensor if use_cuda else torch.FloatTensor

# Computation of the Kernel Matrix :
def _squared_distances(x, y) :
	"Returns the matrix of |x_i-y_j|^2."
	x_col = x.unsqueeze(1) ; y_lin = y.unsqueeze(0)
	return torch.sum( (x_col - y_lin)**2 , 2 )
def _k(x, y, s) :
	"Returns the matrix of k(x_i,y_j)."
	sq = _squared_distances(x, y) / (s**2)
	return torch.exp( -sq )


t       = np.linspace(0, 2*np.pi, NPOINTS)
weights = np.ones( (NPOINTS,) ) /NPOINTS

X  = Variable(torch.from_numpy( np.vstack([np.cos(t),   np.sin(t)]).T ).type(dtype), requires_grad=True )
Y  = Variable(torch.from_numpy( np.vstack([np.cos(t)+1, np.sin(t)]).T ).type(dtype), requires_grad=True )
P  = Variable(torch.from_numpy( weights                               ).type(dtype), requires_grad=False)
Q  = Variable(torch.from_numpy( weights                               ).type(dtype), requires_grad=False)

# The meaningful part of the program, aka "The Sinkhorn Algorithm" ------------------------------
# Iterating this simple "for" loop is an efficient way of finding the optimal 
# scaling factors A and B such that  Gamma = diag(A)*K*diag(B)  has :
# - row-wise    sums equal to P
# - column-wise sums equal to Q
# Since the publication of the milestone paper :
# "Sinkhorn Distances: Lightspeed Computation of Optimal Transport", 
# Marco Cuturi (2013), who proposed to use this scheme to compute
# Optimal Transport plans (Monge-Kantorovitch theory), this algorithm 
# has generated a considerable interest in some applied maths and
# image registration communities.
#
# This algorithm showcases the typical features of state-of-the-art 
# methods in the field of medical image registration (major conference
# on the topic : MICCAI) :
#
# - "small" datasets, but each image has a large memory footprint 
#   (say, 256x256x256 images, or >50,000 triangles for a segmented
#    brain surface).
#
# - "diffeomorphic" (that is, tearing-free) registration algorithms, 
#   which iterate a simple linear algebra operation on the data. 
#   Here, we iterate a matrix-vector product + pointwise division,
#   but other pipelines will flow a vector field through time [t, t+dt]
#   using an ODE step  such as Euler (simplistic) or Runge-Kutta, etc.
#
# - output is a matching score, whose gradient shall be used  
#   to update the registration.
#

K  = _k( X, Y, .5) # Gaussian Kernel matrix (Npoints-by-Npoints), reg. param. = .5
A  = Variable(torch.ones( NPOINTS ).type(dtype), requires_grad = False) 
B  = Variable(torch.ones( NPOINTS ).type(dtype), requires_grad = False) 

for it in range(1000) :
	A = P / (K     @ B)
	B = Q / (K.t() @ A)
# Gamma_ij = a_i * b_j * k( x_i , y_j )
Gamma = (A.unsqueeze(1) @ B.unsqueeze(0)) * K 
Cost  = torch.sum( Gamma * _squared_distances(X,Y) )
# The above cost can be used as a distance measure between 
# the unlabeled points clouds X and Y.
print(Cost)

# In practice, target Y is fixed 
# and we're looking for the X-gradient of the Cost.
dX = torch.autograd.grad( Cost , X )
print(dX)

Problem: if NPOINTS is too large, you will get a GPU memory overflow.

RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1503968623488/work/torch/lib/THC/generic/THCStorage.cu:66

In theory, Pytorch, TF or Theano should be gifts from heaven for our community, as gradients of the models don’t have to be manually computed anymore. However, in practice, all of the researchers who toyed with autograd libraries faced a huge memory problem.
Indeed, as evidenced by the example above, our methods make an intensive use of :

“for” loops (10-1000 iterations) whose gradients is the key output of our algorithms (using a volatile flag is not an option).
Large matrix products, which may not even fit on the GPU memory.

(Note however that as we don’t need that many examples to “train” our models, we would be happy to be a little bit more patient and less “memory greedy”, if this allowed us to handle real medical data.)

All of this means that today, in our research community, Theano or TensorFlow are nothing more than prototyping tools, used on minimalistic datasets (curves sampled with <1000 points, etc.). All of the heavy lifting still has to be done by hand, as most research teams developed CUDA routines computing large matrix products “block-by-block”, and handcrafted strategies for computing the gradients of “for” loops without storing all the iterations. Obviously, this is very cumbersome…

Now, the release of PyTorch, which brands itself as a flexible GPU-oriented numpy-like framework has given us some hope!
My question is: Is there a way to trade “time for memory” using Pytorch?
I’m thinking about:

an equivalent to Theano’s “scan_checkpoints” routine (that would be a killer feature);
routines that would compute large matrix operations (algebra or pointwise operators) “block-by-block” on the GPU, storing the (large) result on the RAM. (We target applications on MRI machines, so we cannot expect to have a super-expensive GPU clusters at hand for every image processing task…)

I looked for them in the doc, in this forum and on Google, but couldn’t find them :-/

In the case where those features are not available just yet, can we expect to see them implemented in the future?

Allowing users to choose a tradeoff between computational time and memory usage would allow PyTorch to reach a whole new set of research communities: lots of people would benefit from a GPU-oriented semi-symbolic scientific computation library… Which can handle models that do not fit into the standard Deep Learning paradigm.
PyTorch seems to make a huge step in this direction, so I hope we will find a way to make this work.

Once again, thank you so much for all of your hard work, you guys make GPUs accessible to everyone
All of us are daydreaming about real industrial applications… But having a such a great prototyping tool at hand, it’s an incredible boon already !

smth · September 30, 2017, 2:57pm

We haven’t implemented checkpointing in PyTorch yet, we are thinking about this.
However, if you want to manually create some checkpointing logic, it’s not that difficult.

See this thread where someone trades of memory for timesteps, sort of like a manual scan_checkpoint: