Autograd error on VIT

I am running GitHub - msed-Ebrahimi/ARoFace: Official repository for "ARoFace: Alignment Robustness to Improve Low-Quality Face Recognition" ECCV24 repository and trying to train a VIT model and getting the following error.

Traceback (most recent call last):
  File "/workspace/ARoFace/train_v2.py", line 283, in <module>
    main(parser.parse_args())
  File "/workspace/ARoFace/train_v2.py", line 188, in main
    img, local_labels = adversarial_img_warping(backbone=backbone,
  File "/workspace/ARoFace/AdvWarp.py", line 66, in adversarial_img_warping
    grad_scale, grad_theta, grad_t = torch.autograd.grad(loss, [scale, theta, t])
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 412, in grad
    result = Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply
    return user_fn(self, *args)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 267, in backward
    raise RuntimeError(
RuntimeError: Checkpointing is not compatible with .grad() or when an `inputs` parameter is passed to .backward(). Please use .backward() and do not pass its `inputs` argument.

I can run the same code for resnet100 so my guess is that vit might be having some layers that is not compatible with this function

The error occurs because using torch.utils.checkpoint with use_reentrant=True does not work with .grad(). Can you try use_reentrant=False instead?

1 Like

it works but can you explain why it works?

Sure, the two methods of doing activation checkpoint use different mechanism underneath. The use_reentrant=True approach, as the name suggests, secretly performs another reentrant .backward() during the backward pass. If the user does something special like specify inputs= arguments in the outer backward pass, the inner backward pass today is not able to emulate that, as it always does a plain .backward().