Has “torch.compile” been tested on deepspeed?

I tested the combination of torch.compile and deepspeed under the LLaMA2 model and encountered a problem:

  1. When running only on a single machine with 8 GPUs, the performance can be improved by about 30%+.
  2. When running on 2 machines with 16GPU, there is no performance improvement.

I don’t know how to check the problem. I searched a lot on Google and didn’t find the answer.
Any suggestions? Thanks.

We’re in early discussions with the deepspeed team to figure this out, no concrete plans yet