Optimizing for-loops in PyTorch

neel_g · June 15, 2023, 10:47pm

Hi, I have a training loop where the number of iterations (i.e the number of times a tensor goes through the same block of layers) is dynamic - its bounded at some maximum value, but durig train time can change dynamically.

This repo demonstrates it well: Github Repo

(relevant lines are 76, 80-81 rest can be ignored)

Here’s a small snippet to further illustrate what I’m using:

iters = randrange(1,10)
input : torch.tensor = ...

outputs  = ... # preallocate `iters` lenght tensor, to be filled at each iteration
for i in range(iters):
    interim_tensor = myblock(input) # myblock is a standard torch nn.Module
    outputs[i] = interim_tensor

In this case, the overhead from compilation is pretty large. I’m using Accelerate and GPU utilization spikes anywhere from 2% to 98% several times a second. Clearly, this is really inefficient.

Any ideas how I can optimize it? Perhaps the only way I can think of is having a static graph for the maximum amount of iterations, and then exiting the graph at some point, letting AutoGrad Automagically calculate and backprop accordingly.

Thanks,
N

neel_g · June 16, 2023, 11:42am

@ptrblck sorry for the ping, but would you have any idea how I can achieve this?

neel_g · August 2, 2023, 7:25pm

Does anyone have any idea at all?

ptrblck · August 3, 2023, 12:24am

My idea would also be to try to script or compile the model/function, but it seems you have already tried it and the overhead is not acceptable:

How many different values for iters do you have? Could the overhead be amortized by a longer runtime assuming the number of different values for iters is not huge?

neel_g · August 3, 2023, 1:06am

torch.compile didn’t give too much of a speedup; I reckon it might be about 10%?

well, iters is bounded in [1, max_iters] (both incl.); I don’t get what you mean by a longer runtime though…

ptrblck · August 3, 2023, 1:11am

Well, if max_iters is e.g. 10, but you are training your model for 2 weeks, the initial overhead might be noise compared to the long runtime of your training.

neel_g · August 3, 2023, 10:05am

ah, no my training times are usually <20h on an RTX 3090.
I’m not fully sure where the bottleneck here is, but I don’t think changing iterations like that provides any help in the training process.

Is there a way I can have a static graph for max_iters, traverse the iters part of it, and then exit it? so the graph would still be static - just that autograd won’t consider other iterations as part of the computation graph for that specific training step…

ptrblck · August 3, 2023, 4:45pm

I’m unsure if torch.compile is able to do so, but @marksaroufim might know.

marksaroufim · August 3, 2023, 4:49pm

Yeah I don’t think torch.compile would help much here, you’re probably better off rewriting your code in a batched way

neel_g · August 3, 2023, 7:08pm

Thanks! could you elaborate on this a little bit? wdym by a batched way? also doesn’t torch.compile break the dynamic graph into pieces of static graphs?

marksaroufim · August 3, 2023, 9:04pm

whoops my b, I just reread your code and noticed that your loops are dependent on the previous input so my batching suggestion won’t work

OK I think you have a few options

Compile your model once for each size up front, inductor has a code cache which should speedup but not totally eliminate compilation times
Use dynamic=True when compiling the model
Have a static graph for the max number of iterations

Also keep in mind on consumer cards torch.compile() may or not give you the best performance relative to server GPUs like A100

neel_g · August 3, 2023, 10:21pm

Would that be something like what I mentioned above?

Is there a way I can have a static graph for max_iters, traverse the iters part of it, and then exit it? so the graph would still be static - just that autograd won’t consider other iterations as part of the computation graph for that specific training step…

If so, how I accomplish this?