Hi, I have a training loop where the number of iterations (i.e the number of times a tensor goes through the same block of layers) is dynamic - its bounded at some maximum value, but durig train time can change dynamically.
(relevant lines are 76, 80-81 rest can be ignored)
Here’s a small snippet to further illustrate what I’m using:
iters = randrange(1,10)
input : torch.tensor = ...
outputs = ... # preallocate `iters` lenght tensor, to be filled at each iteration
for i in range(iters):
interim_tensor = myblock(input) # myblock is a standard torch nn.Module
outputs[i] = interim_tensor
In this case, the overhead from compilation is pretty large. I’m using Accelerate and GPU utilization spikes anywhere from 2% to 98% several times a second. Clearly, this is really inefficient.
Any ideas how I can optimize it? Perhaps the only way I can think of is having a static graph for the maximum amount of iterations, and then exiting the graph at some point, letting AutoGrad Automagically calculate and backprop accordingly.
My idea would also be to try to script or compile the model/function, but it seems you have already tried it and the overhead is not acceptable:
How many different values for iters do you have? Could the overhead be amortized by a longer runtime assuming the number of different values for iters is not huge?
Well, if max_iters is e.g. 10, but you are training your model for 2 weeks, the initial overhead might be noise compared to the long runtime of your training.
ah, no my training times are usually <20h on an RTX 3090.
I’m not fully sure where the bottleneck here is, but I don’t think changing iterations like that provides any help in the training process.
Is there a way I can have a static graph for max_iters, traverse the iters part of it, and then exit it? so the graph would still be static - just that autograd won’t consider other iterations as part of the computation graph for that specific training step…
Thanks! could you elaborate on this a little bit? wdym by a batched way? also doesn’t torch.compile break the dynamic graph into pieces of static graphs?
Would that be something like what I mentioned above?
Is there a way I can have a static graph for max_iters, traverse the iters part of it, and then exit it? so the graph would still be static - just that autograd won’t consider other iterations as part of the computation graph for that specific training step…