How to save computation graph of a gradient?

craigyang · August 1, 2021, 10:06pm

Hi, how should I save the computation graph of a gradient vector computed from torch.autograd.grad(loss, model.parameters(), create_graph=True)?

The background is that I want to compute the Hessian-vector products of k vectors: H V, in which H is the Hessian of a neural network with n parameters, and V is a constant matrix with n rows and k columns. To do that, I compute the gradient of the inner product between gradient of the network forward function g and V, with respect to the network parameters. An example that works for a tiny network is

import torch

# define the tiny "network"
class quadratic_fun(torch.nn.Module):
    def __init__(self):
        super(quadratic_fun, self).__init__()
        self.x = torch.nn.Parameter(torch.ones(5, requires_grad=True))
        self.y = torch.nn.Parameter(torch.ones(5, requires_grad=True))

    def forward(self):
        loss = torch.norm(self.x) ** 2 + torch.norm(self.y) ** 2
        return loss

# compute the flattened gradient with create_graph=True
model = quadratic_fun()
loss_quad = model.forward()
grad_ft = torch.autograd.grad(loss_quad, model.parameters(), create_graph=True)
flat_grad = torch.cat([g.contiguous().view(-1) for g in grad_ft])

# generate the constant matrix V, and compute the matrix-gradient product
torch.manual_seed(0)
V = torch.randn((10, 3))
h = torch.matmul(flat_grad, V)

# compute the matrix-Jacobian product by iterating over the columns of the constant matrix
for i in range(3):
    hvp = torch.autograd.grad(h[i], model.parameters(), retain_graph=True)
    hvp_flat = torch.cat([g.contiguous().view(-1) for g in hvp])
    print(hvp_flat)

which gives

tensor([-2.2517, -0.8678, -0.6320, -2.5267,  0.2397, -0.2232, -0.9854,  0.2248,
        -0.2046,  0.1050])
tensor([-2.3047,  1.6974, -4.2304,  0.7000,  2.4753, -1.2272,  0.4968, -1.6821,
         1.5849,  1.0457])
tensor([-0.5012,  1.3840,  0.6445,  0.6163, -0.2869,  0.0632,  0.8794, -4.6321,
        -0.5793,  4.6044])

However, this is not feasible on CUDA when H is the Hessian of a large neural network: with retain_graph=True in the third from last line, the CUDA memory will quickly be filled up. While if I don’t retain the graph, the graph will be freed after one iteration of the for loop. In that case, I would need to compute the gradient again, which is time-consuming. Thus I wonder if I can save the not only the gradient value, but also its associated computation graph (both generated from grad_ft = torch.autograd.grad(loss_quad, model.parameters(), create_graph=True)) to a file or buffer, and reload it in a later iteration of the for loop.

Some other posts I looked into but didn’t find an answer:

This post suggests using JIT, but it is not clear to me how to use the API for the graph of a gradient vector.
A reply in this post suggests to compute the matrix-Jacobian product with torch.autograd.functional.jacobian, but it looks like the API only works when the function to compute Jacobian is explicitly defined.)

Thanks!

Varal7 · August 4, 2021, 4:25am

The short answer is: I think so! Using saved tensors default hooks.

The docs are still under review (#62362 and #62361) but the functionality is already merged to master as of today!

In particular, the first PR describes exactly your use case where you want to save a computation graph to the disk and retrieve it later when needed.
I think in your case you would want to do something like:

# compute the flattened gradient with create_graph=True and store the graph on disk
torch.autograd.graph.set_saved_tensors_default_hooks(pack_hook, unpack_hook)  
model = quadratic_fun()
loss_quad = model.forward()
grad_ft = torch.autograd.grad(loss_quad, model.parameters(), create_graph=True)
flat_grad = torch.cat([g.contiguous().view(-1) for g in grad_ft])

# generate the constant matrix V, and compute the matrix-gradient product
torch.manual_seed(0)
V = torch.randn((10, 3))
h = torch.matmul(flat_grad, V)
torch.autograd.graph.reset_saved_tensors_default_hooks()

# compute the matrix-Jacobian product by iterating over the columns of the constant matrix
for i in range(3):
    hvp = torch.autograd.grad(h[i], model.parameters(), retain_graph=True)
    hvp_flat = torch.cat([g.contiguous().view(-1) for g in hvp])
    print(hvp_flat)

where pack_hook and unpack_hook are defined here.

You can control exactly which part of the graph should be saved to disk by adapting the position of the calls to set_saved_tensors_default_hooks and reset_saved_tensors_default_hooks.

Alternatively, you use the context manager torch.autograd.graph.save_on_cpu, cf #62410.

craigyang · August 4, 2021, 3:16pm

Thanks Victor for the pointers to the new functionality! Is there a concrete example of what pack_hook and unpack_hook should look like? I tried your example here but have no idea why you have inc() and lambda x: x as pack and unpack hooks, and do not really understand why f("cpu") and f("cuda") have seemingly arbitrarily large values between the set and reset hook functions.

Also, in your example

    class SelfClosingTempFile():
        def __init__(self):
            self.fp = tempfile.TemporaryFile()

        def __del__(self):
            self.fp.close()

    def pack_hook(tensor):
        sctf = SelfClosingTempFile()
        torch.save(tensor, sctf.fp)
        return sctf

    def unpack_hook(sctf):
        sctf.fp.seek(0)
        return torch.load(sctf.fp)

what should tempfile.TemporaryFile() be?

Varal7 · August 4, 2021, 4:13pm

Hi Craig,

Please ignore my example that you linked, that’s a POC of what an incorrect usage of the hooks would be!
You can use the docs at Autograd mechanics — PyTorch 1.10.0 documentation and at Automatic differentiation package - torch.autograd — PyTorch 1.10.0 documentation.
To use the example you pasted, you need to import tempfile (tempfile — Generate temporary files and directories — Python 3.9.6 documentation)

craigyang · August 4, 2021, 6:01pm

Thanks Victor for the explanation. I imported tempfile as you suggested and have the following code that works:

import torch
import tempfile

class quadratic_fun(torch.nn.Module):
    def __init__(self):
        super(quadratic_fun, self).__init__()
        self.x = torch.nn.Parameter(torch.ones(5, requires_grad=True))
        self.y = torch.nn.Parameter(torch.ones(5, requires_grad=True))

    def forward(self):
        loss = torch.norm(self.x) ** 2 + torch.norm(self.y) ** 2
        return loss

class SelfClosingTempFile():
    def __init__(self):
        self.fp = tempfile.TemporaryFile()

    def __del__(self):
        self.fp.close()

def pack_hook(tensor):
    sctf = SelfClosingTempFile()
    torch.save(tensor, sctf.fp)
    return sctf

def unpack_hook(sctf):
    sctf.fp.seek(0)
    return torch.load(sctf.fp)

torch.autograd.graph.set_saved_tensors_default_hooks(pack_hook, unpack_hook)
model = quadratic_fun()
loss_quad = model.forward()
grad_ft = torch.autograd.grad(loss_quad, model.parameters(), create_graph=True)
flat_grad = torch.cat([g.contiguous().view(-1) for g in grad_ft])

torch.manual_seed(0)
V = torch.randn((10, 3))
h = torch.matmul(flat_grad, V)

torch.autograd.graph.reset_saved_tensors_default_hooks()

for i in range(3):
    hvp = torch.autograd.grad(h[i], model.parameters(), retain_graph=True)
    hvp_flat = torch.cat([g.contiguous().view(-1) for g in hvp])
    print(hvp_flat)

However, I want to set retain_graph to be False in the third from last line: I cannot retain the graph because of memory limits (and that’s exactly why I want to save the graph when I created it). If I remove retain_graph=True from the above code, I would still get

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

which is the same as not using the hook functions. Is it because I put reset_saved_tensors_default_hooks to the wrong position? How should I do what I want with all the hook functions here?

Varal7 · August 4, 2021, 6:52pm

What if you call torch.autograd.graph.reset_saved_tensors_default_hooks() after the for loop instead (but keep the retain_graph option on). Does that still exceed your memory requirements?

craigyang · August 5, 2021, 7:34pm

Hi Victor, sorry for getting back to you late (it took me some time to install the latest PyTorch onto a machine with CUDA).

In my code that does the actual training (not the example code above), I tried to put torch.autograd.graph.reset_saved_tensors_default_hooks() to after the for loop, but got

...
AttributeError: 'SelfClosingTempFile' object has no attribute 'fp'
...
RuntimeError: OSError: [Errno 24] Too many open files: '/tmp/tmpw5thrpry'

I have no control over how many files the process can open at the same time, though, and I did not find how to control the number of created files in the SelfClosingTempFile() class (like in tempfile.TemporaryFile()). Do you know if there is a workaround? Thanks!

Varal7 · August 5, 2021, 7:49pm

Hi Craig,

Thanks for taking the time to test this new functionality! You can try to increase the number of files that can be opened. For example, here: Python Subprocess: Too Many Open Files - Stack Overflow.

Edit: actually, I’ll provide you with another version of the hooks that should handle this issue.

craigyang · August 5, 2021, 9:07pm

Thanks for the pointer! After setting ulimit -Sn 500000, now there seems to be a problem with writing and running the tmp files when I run my code with torch.autograd.graph.reset_saved_tensors_default_hooks() after the for loop:

terminate called after throwing an instance of 'c10::Error'
 what():  [enforce fail at inline_container.cc:300] . unexpected pos 320 vs 252
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x47 (0x7f627ac4e4b7 in /home/craig/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x25844b0 (0x7f62c2c164b0 in /home/craig/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x257fa8c (0x7f62c2c11a8c in /home/craig/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3: caffe2::serialize::PyTorchStreamWriter::writeRecord(std::string const&, void const*, unsigned long, bool) + 0xb5 (0x7f62c2c196f5 in /home/craig/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: caffe2::serialize::PyTorchStreamWriter::writeEndOfFile() + 0x173 (0x7f62c2c199e3 in /home/craig/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5: caffe2::serialize::PyTorchStreamWriter::~PyTorchStreamWriter() + 0x125 (0x7f62c2c19c55 in /home/craig/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0xb1cec3 (0x7f62d5797ec3 in /home/craig/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x558988 (0x7f62d51d3988 in /home/craig/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x559c8e (0x7f62d51d4c8e in /home/craig/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #12: <unknown function> + 0x5548f5 (0x7f62d51cf8f5 in /home/craig/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #13: <unknown function> + 0xaa175 (0x7f62d65c0175 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #14: <unknown function> + 0xfbf034 (0x7f62c1651034 in /home/craig/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x37083df (0x7f62c3d9a3df in /home/craig/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #16: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f62d51cab66 in /home/craig/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #17: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x132 (0x7f62c3d98672 in /home/craig/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #18: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x89 (0x7f62c3d90589 in /home/craig/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #19: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x53 (0x7f62d5729163 in /home/craig/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #20: <unknown function> + 0xd6de4 (0x7f62d65ecde4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #21: <unknown function> + 0x9609 (0x7f62e658b609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #22: clone + 0x43 (0x7f62e64b2293 in /lib/x86_64-linux-gnu/libc.so.6)

Do you have any idea what the cause might be?

Varal7 · August 5, 2021, 9:15pm

This seems to be a serialize error. Do you get the same error with these hooks:

import torch
import os
import uuid

tmp_dir = "temp"

class quadratic_fun(torch.nn.Module):
    def __init__(self):
        super(quadratic_fun, self).__init__()
        self.x = torch.nn.Parameter(torch.ones(5, requires_grad=True))
        self.y = torch.nn.Parameter(torch.ones(5, requires_grad=True))

    def forward(self):
        loss = torch.norm(self.x) ** 2 + torch.norm(self.y) ** 2
        return loss

class SelfDeletingTempFile():
    def __init__(self):
        self.name = os.path.join(tmp_dir, str(uuid.uuid4()))

    def __del__(self):
        os.remove(self.name)

def pack_hook(tensor):
    temp_file = SelfDeletingTempFile()
    torch.save(tensor, temp_file.name)
    return temp_file

def unpack_hook(temp_file):
    return torch.load(temp_file.name)


torch.autograd.graph.set_saved_tensors_default_hooks(pack_hook, unpack_hook)

model = quadratic_fun()
loss_quad = model.forward()
grad_ft = torch.autograd.grad(loss_quad, model.parameters(), create_graph=True)
flat_grad = torch.cat([g.contiguous().view(-1) for g in grad_ft])

# generate the constant matrix V, and compute the matrix-gradient product
torch.manual_seed(0)
V = torch.randn((10, 3))
h = torch.matmul(flat_grad, V)

# compute the matrix-Jacobian product by iterating over the columns of the constant matrix
for i in range(3):
    hvp = torch.autograd.grad(h[i], model.parameters(), retain_graph=True)
    hvp_flat = torch.cat([g.contiguous().view(-1) for g in hvp])
    print(hvp_flat)

torch.autograd.graph.reset_saved_tensors_default_hooks()

Varal7 · August 5, 2021, 9:43pm

Here are two other thoughts:

Maybe you can keep the graph of h on GPU but only move to to disk the part that computes the matrix-Jacobian product.

model = quadratic_fun()
loss_quad = model.forward()
grad_ft = torch.autograd.grad(loss_quad, model.parameters(), create_graph=True)
flat_grad = torch.cat([g.contiguous().view(-1) for g in grad_ft])

# generate the constant matrix V, and compute the matrix-gradient product
torch.manual_seed(0)
V = torch.randn((10, 3))
h = torch.matmul(flat_grad, V)

torch.autograd.graph.set_saved_tensors_default_hooks(pack_hook, unpack_hook)
# compute the matrix-Jacobian product by iterating over the columns of the constant matrix
for i in range(3):
    hvp = torch.autograd.grad(h[i], model.parameters(), retain_graph=True)
    hvp_flat = torch.cat([g.contiguous().view(-1) for g in hvp])
    print(hvp_flat)

torch.autograd.graph.reset_saved_tensors_default_hooks()

or even, instead of moving to disk, moving to CPU:

model = quadratic_fun()
loss_quad = model.forward()
grad_ft = torch.autograd.grad(loss_quad, model.parameters(), create_graph=True)
flat_grad = torch.cat([g.contiguous().view(-1) for g in grad_ft])

# generate the constant matrix V, and compute the matrix-gradient product
torch.manual_seed(0)
V = torch.randn((10, 3))
h = torch.matmul(flat_grad, V)

with torch.autograd.graph.save_on_cpu(pin_memory=True):
    for i in range(3):
        hvp = torch.autograd.grad(h[i], model.parameters(), retain_graph=True)
        hvp_flat = torch.cat([g.contiguous().view(-1) for g in hvp])
        print(hvp_flat)

craigyang · August 8, 2021, 8:05pm

EDIT:

Thanks Victor for the additional thoughts. These two code snippets do work!

There is one issue though: I tried the functions you provided with a recent nightly build (1.10.0.dev20210805+cu102). Probably because of some incompatibility issues between this version and my cudatoolkit, the torch.autograd.grad() is much slower than before (in one instance, 40 seconds vs 5 seconds). Is there a PyTorch+cudatoolkit version combination that you recommend me to try? Thanks!

Varal7 · August 9, 2021, 4:19am

Sorry, I’m not very familiar with version of python and cudatoolkit but there should not be a slowdown with the more recent versions!
Is torch.autograd.grad() becoming much slower with the latest version of pytorch even when you don’t use the hooks?
It is expected that using the hooks will incur a performance penalty.

craigyang · August 9, 2021, 4:57am

Gotcha, thanks! Yes, torch.autograd.grad() has become much slower even when I don’t use the hooks. I am reinstalling everything and just wondered if you have any idea on which version combination would work.

I’ll tell you how much penalty the hooks will incur once I get the proper PyTorch+CUDA toolkit versions installed and up running.

Yufei-Gu-451 · November 16, 2024, 2:11pm

I am attempting a similar operation after 3 years with the new torch.autograd.graph.saved_tensors_hooks(pack_hook , unpack_hook ) API.

I am iteratively computing large amounts of hessian vector products and sum over all batches in my dataset (dataloader). Thus to optimize computation time and CUDA/CPU storage, I am attempting to save the first-order gradients and computation graphs of all data batches outside the iteration and retrieve them when computing the hessian vector products.

grad_ls = []
for batch in dataloader
            # Set Tensors Hook to save Gradients and Computation Graph on Disk
            with torch.autograd.graph.saved_tensors_hooks(pack_hook, unpack_hook):
                grad_ft = torch.autograd.grad(loss, self.params, allow_unused=True, create_graph=True, retain_graph=True)

            # Pack also the Gradient on Disk/Tempfile; Save the hook to a list
            grad_ls.append((pack_hook(grad_ft), len(batch)))

However, when computing the hvps using torch.autograd.grad(), the function returns None for all model parameters, in contrast to computing gradients and hvps at the same time. The code is provided below where v_list are the target vector in

for (grad_ft, batch_size) in grad_ls:
      grad_ft = unpack_hook(grad_ft)

      Hv = torch.autograd.grad(outputs=grad_ft, inputs=self.params,
                                     grad_outputs=v_list, create_graph=False, allow_unused=True)

I have checked the grad_ft tensor tuple saved on disk is the same in each retrieval. The second torch.autograd.grad() is executed, making the computation graph likely to be existing. I am curious on the cause of this behaviour.