Nested computational graphs during forward for videos

Idriss_Mghabbar · March 11, 2021, 9:57am

Hello,

I am using pre-trained models on imagenet to build custom video classification models.
In the first phase, we compute the forward on every frame:

outputs = []
for i in range(n_frames):
     outputs.append(pretrained_model(video[i])

Given that “outputs” is updated at each iteration, this loop results in a nested computational graph because the graphs are retained at every iteration.

This causes a cuda out of memory for big models.

Is there a way to do this better ?

Thanks

albanD · March 11, 2021, 4:21pm

Hi,

Do you actually need all the outputs at the same time?
If so, you can try the checkpoint module to reduce the memory usage to only one branch at a time: torch.utils.checkpoint — PyTorch 1.8.0 documentation

Idriss_Mghabbar · March 11, 2021, 4:57pm

Hello albanD,

Thank you for your reply !!

Yes I need to fill the ouputs list because it will be used as an input to a customized model later on.
I don’t fully understand the usage of checkpoints in that context as it is just in the forward, the weights do not change, there is just a computational graph that is being created.
Can you be more specific please ?

Ideally, I want to tell PyTorch to use the same computational graph of the first frame for the next frames as it does not change.

albanD · March 11, 2021, 5:15pm

Ideally, I want to tell PyTorch to use the same computational graph of the first frame for the next frames as it does not change.

The thing is that what takes memory is not the graph itself, but the intermediary buffers that are needed to do the backward. And these buffers are different for each input you give.

Can you be more specific please ?

You can try:

outputs = []
for i in range(n_frames):
     outputs.append(checkpoint(pretrained_model, video[i]))

Idriss_Mghabbar · March 11, 2021, 6:12pm

I tried it and since I have to do a backward later on, it didn’t work.

Is there a way to save up memory knowing that it’s basically many forwards without a backward between them while keeping the required buffers to do a backward later?

Please tell me if it’s not clear, I can give more information.

albanD · March 11, 2021, 6:46pm

It is not super clear. But if the checkpoint proposed above didn’t work to reduce the memory enough. The you most likely won’t have enough memory.
In particular, can you estimate the memory used by one of such forward + state for it. And see if that will fit?

Idriss_Mghabbar · March 11, 2021, 10:26pm

Thank you for making me discover checkpoints ! They are very useful to trade compute for memory.

I’m afraid that since i’m checkpointing all of the pre-trained model, the tensors are not requiring grad anymore.

I get this error :

  UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")

I made the input requires grad. I hope this will solve the problem.

Thank you