Can I put the input tensor back to computation graph?

Hi, I have a question related to autograd. After finishing FP, I want to transfer the input tensor from GPU to CPU, and then after a while transfer it back to GPU for backward and obtain grads. The problem is that if I transfer the tensor back, it won’t go back to the computation graph, and therefore unable to obtain grads from the tensor.

I know there is way that I hook the original GPU tensor to obtain grad. But I wonder if it’s possible to obtain the grad in the transferring back stage.

Here is a simple example:

import torch

device = torch.device("cuda:0")
cpu = torch.device("cpu")

a = torch.randn(5, requires_grad=True, device=device)
b = torch.randn(5, requires_grad=True, device=device)

with torch.autograd.graph.save_on_cpu():
    d = a * b

grads = []

def hook_wrapper(grads):
    def hook(input_gradient):

    return hook

a =
a =



where the len(grads) will be 0 and I cannot obtain the grad.

@ptrblck Would you please give me some suggestions? Thanks!

A better way I’ve found to change the graph after the fact is to:

  1. .detach() the model outputs,
  2. use those to alter the targets(these aren’t in the graph and can be altered at this stage before the loss function),
  3. Run your loss function on the altered targets.

You won’t need to take the outputs or targets off GPU. Just keep all of the tensors on GPU and run your calcs there for speed.

@J_Johnson Thanks for your reply. But the fact in my case is that the output tensors are too large to be kept in GPU. And I have to trade some outputs out of GPUs to preserve space for other batches’ computation. BTW, my case is that I should run all batches forward, and then run them backward, so GPU memory is key constraint.

Do you have any suggestions for this case?

The graph ends up taking 2/3s or more of the total memory during training. Some optimizers more and some less. I’d be surprised to see the outputs larger than the model and the gradients. Are you storing all of the outputs during training?

Do you have a second GPU? You may want to clock the time it takes for each part in the process to ensure you are fine with any latency overhead.

Without knowing more about what you’re trying to accomplish, that is all I have for suggestions.

@J_Johnson Thanks. Let me explain my process more specifically. I want to achieve this process:

For a given large dataset, I slice the dataset into minibatches, and run them batch by batch in the forward process. (Yes, no batch of data will run backward before all batches finish, which result in a lot of computation graphs reside in GPUs, and that’s why I use torch.autograd.graph.save_on_cpu to put the generated tensors back to CPU). After all batches finish FP, I will place the output tensors from CPU back to GPU and run BP.

My current question is that: In the backward pass, I don’t know how to put the output tensor from CPU back to the corresponding computation graph place in GPU. I think PyTorch might preserve certain hook APIs for me to achieve this, and I wonder if you could give me any suggestion.

With Stochastic Gradient Descent(SGD) you’re just subracting the gradient from each weight/bias on the fly.

Why not just get the gradients for each batch and sum those together(so you just have one set of gradients) and then manually subtract those(times a learning rate, I’d set this much higher, since you’re running it all at once) from your parameters once your data has all been run through the model? That is what you’re effectively doing, anyway. I mean, you have to keep each layer of gradients separate. Don’t sum across layers, just across batches.

You won’t need momentum, since you are doing it all in one go.

Because you’re adding them at each iteration, the overhead would be equal to your model size. So you could keep it on GPU.

Yes, right… But for some reason (e.g., samples have dependencies between each other), I have to finish whole dataset’s FP first and then backward. Would there be any suggestions for me to trade the output tensors from GPU and then back to GPU afterwards?

If the data inputs are sequentially dependent on each other(such as in a time series), perhaps what you need is a recurrent neural network(i.e. LSTM, GRU, etc.) or a transformer. RNNs have a gated memory in the layer, good for the model to learn sequential information. Transformers, on the other hand, keep positional information of the sequence, and allow for more asynchronuous processing of sequential information. I.e. you might put in the last frame as inputs in an RNN, while in a Transformer, you’d put in 60 frames as one input(batches would then be windows of 60 frames while just randomizing which window is used).

What you’re asking might not be possible as autograd stores the operations in a DAG during forward propagation. You can read more about autograd here:

But I think if you become more familiar with the math behind the scenes, you’ll find there are more elegant solutions to your problem.