Optimizing for-loops in PyTorch

Hi, I have a training loop where the number of iterations (i.e the number of times a tensor goes through the same block of layers) is dynamic - its bounded at some maximum value, but durig train time can change dynamically.

This repo demonstrates it well: Github Repo

(relevant lines are 76, 80-81 rest can be ignored)

Here’s a small snippet to further illustrate what I’m using:

iters = randrange(1,10)
input : torch.tensor = ...

outputs  = ... # preallocate `iters` lenght tensor, to be filled at each iteration
for i in range(iters):
    interim_tensor = myblock(input) # myblock is a standard torch nn.Module
    outputs[i] = interim_tensor

In this case, the overhead from compilation is pretty large. I’m using :hugs: Accelerate and GPU utilization spikes anywhere from 2% to 98% several times a second. Clearly, this is really inefficient.

Any ideas how I can optimize it? Perhaps the only way I can think of is having a static graph for the maximum amount of iterations, and then exiting the graph at some point, letting AutoGrad Automagically calculate and backprop accordingly.

Thanks,
N

@ptrblck sorry for the ping, but would you have any idea how I can achieve this?

:pray: Does anyone have any idea at all? :slightly_frowning_face:

My idea would also be to try to script or compile the model/function, but it seems you have already tried it and the overhead is not acceptable:

How many different values for iters do you have? Could the overhead be amortized by a longer runtime assuming the number of different values for iters is not huge?

torch.compile didn’t give too much of a speedup; I reckon it might be about 10%?

well, iters is bounded in [1, max_iters] (both incl.); I don’t get what you mean by a longer runtime though…

Well, if max_iters is e.g. 10, but you are training your model for 2 weeks, the initial overhead might be noise compared to the long runtime of your training.

ah, no my training times are usually <20h on an RTX 3090.
I’m not fully sure where the bottleneck here is, but I don’t think changing iterations like that provides any help in the training process.

Is there a way I can have a static graph for max_iters, traverse the iters part of it, and then exit it? so the graph would still be static - just that autograd won’t consider other iterations as part of the computation graph for that specific training step…

I’m unsure if torch.compile is able to do so, but @marksaroufim might know.

1 Like

Yeah I don’t think torch.compile would help much here, you’re probably better off rewriting your code in a batched way

1 Like

Thanks! could you elaborate on this a little bit? wdym by a batched way? also doesn’t torch.compile break the dynamic graph into pieces of static graphs?

whoops my b, I just reread your code and noticed that your loops are dependent on the previous input so my batching suggestion won’t work

OK I think you have a few options

  1. Compile your model once for each size up front, inductor has a code cache which should speedup but not totally eliminate compilation times
  2. Use dynamic=True when compiling the model
  3. Have a static graph for the max number of iterations

Also keep in mind on consumer cards torch.compile() may or not give you the best performance relative to server GPUs like A100

1 Like

Would that be something like what I mentioned above?

Is there a way I can have a static graph for max_iters, traverse the iters part of it, and then exit it? so the graph would still be static - just that autograd won’t consider other iterations as part of the computation graph for that specific training step…

If so, how I accomplish this?