I tested the combination of torch.compile and deepspeed under the LLaMA2 model and encountered a problem:
- When running only on a single machine with 8 GPUs, the performance can be improved by about 30%+.
- When running on 2 machines with 16GPU, there is no performance improvement.
I don’t know how to check the problem. I searched a lot on Google and didn’t find the answer.
Any suggestions? Thanks.