How to fix the issue of "RuntimeError: Inplace update to inference tensor outside InferenceMode is not allowed."

Hi, I quantized an llama2 13B model using AutoAWq and I’m trying to run the model through vllm which is using torch distributed’s all_reduce. The details of the environment (torch/ray/vllm version), are at the end of the stack trace screenshot.

The exception appears to be raised by this line
https://github.com/pytorch/pytorch/blame/4e2e0437ea483a02aebefeab60e8870658990a5b/c10/core/TensorImpl.h#L370

but I’m not sure if this is incorrect behavior and if so what would be the recommended way to fix?

The reason why I guess this is purely a torch distributed issue is b/c the code executes perfectly fine when I set the number of gpus to 1. If I use 2 gpus then I receive this error. I am running this on an 80Gb A100 with 8 gpus.