Torch2.0 with dataparallel

It seems torch2.0 with torch.nn.DataParallel improves little. Codes look like:

model_opt = torch.nn.DataParallel(model, device_ids=device_ids)
model_opt = torch.compile(model_opt)

When using single GPU, i.e., device_ids=[0] will lead to a 1.5 faster training, but for multi-gpu it falls back to the non-compile speed.